Hello!
I'm trying to make use of some remote CPU cores on a machine I can SSH to. idle-maltair.ucc
is an SSH alias for a machine running Julia 1.6, just as I am.
image.png
From the documentation, it looks like addprocs(["idle-maltair.ucc"]; dir="/tmp", topology=:master_worker, multiplex=true)
should work nicely, however it doesn't.
image.png
ERROR: TaskFailedException
nested task error: IOError: connect: connection timed out (ETIMEDOUT)
I've tried a bit of googling, but had no luck so far. Any help would be greatly appreciated.
I've seen similar timeouts when the connection is not open already. I enabled multiplexing directly in my ssh config and made sure to open a connection to the server before I connect via Julia.
https://discourse.julialang.org/t/addprocs-with-ssh-and-a-proxyjump-times-out/26647
I have this problem too. Did you come up with anything @Timothy ?
julia> run(`ssh $addr -- '/u/local/apps/julia/1.8.1/bin/julia --version '`);
julia version 1.8.1
julia> procs = addprocs([addr]; exename="/u/local/apps/julia/1.8.1/bin/julia", dir="/u/home/", sshflags="-vv")
debug2: channel_input_open_confirmation: channel 0: callback start
debug2: fd 3 setting TCP_NODELAY
debug2: client_session2_setup: id 0
debug1: Sending command: sh -l -c 'cd -- /u/home/
exec '\\''/u/local/apps/julia/1.8.1/bin/julia'\\'' --worker'
debug2: channel 0: request exec confirm 1
debug2: channel_input_open_confirmation: channel 0: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768
debug2: channel 0: rcvd adjust 2097152
debug2: channel_input_status_confirm: type 99 id 0
debug2: exec request accepted on channel 0
debug2: channel 0: read<=0 rfd 4 len 0
debug2: channel 0: read failed
debug2: chan_shutdown_read: channel 0: (i0 o0 sock -1 wfd 4 efd 6 [write])
debug2: channel 0: input open -> drain
debug2: channel 0: ibuf empty
debug2: channel 0: send eof
debug2: channel 0: input drain -> closed
debug2: channel 0: rcvd eof
debug2: channel 0: output open -> drain
debug2: channel 0: obuf empty
debug2: chan_shutdown_write: channel 0: (i3 o1 sock -1 wfd 5 efd 6 [write])
debug2: channel 0: output drain -> closed
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug2: channel 0: rcvd close
debug2: channel 0: almost dead
debug2: channel 0: gc: notify user
debug2: channel 0: gc: user detached
debug2: channel 0: send close
debug2: channel 0: is dead
debug2: channel 0: garbage collecting
debug1: channel 0: free: client-session, nchannels 1
Transferred: sent 3060, received 3132 bytes, in 63.9 seconds
Bytes per second: sent 47.9, received 49.0
debug1: Exit status 1
ERROR: TaskFailedException
nested task error: IOError: connect: connection timed out (ETIMEDOUT)
caused by: IOError: connect: connection timed out (ETIMEDOUT)
I never had any luck :(
It feels like the exec line has way too many quotes
Opened https://github.com/JuliaLang/julia/issues/47863
If I try to run that command manually on the server
-bash-4.2$ sh -l -c 'cd -- /u/home/
> exec '\\''/u/local/apps/julia/1.8.1/bin/julia'\\'' --worker'
sh: line 1: /u/local/apps/julia/1.8.1/bin/julia\: No such file or directory
I'm not sure if that's actually what it's doing or that's just some kind of string representation.
Maybe you need to specify the remote shell? Edit: Nevermind, I see from the last output that it is bash, so default option should work. Edit2: But maybe sh
isn't bash?
On the server,
$ realpath $(which sh)
/usr/bin/bash
$ sh
sh-4.2$ help
GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
I tried overriding this line https://github.com/JuliaLang/julia/blob/master/stdlib/Distributed/src/managers.jl#L298 to remove one layer of quoting by deleting escape_shell_posixly()
so @show remotecmd
looks like
remotecmd = `sh -l -c "cd -- /tmp
exec '/u/local/apps/julia/1.8.1/bin/julia' --worker"`
but addprocs
still times out.
If I manually run exec '/u/local/apps/julia/1.8.1/bin/julia' --worker
on the server it just sits there, which I guess is what it's supposed to do
I don't really know how to diagnose this
julia> Distributed.addprocs_locked(
Distributed.SSHManager([addr]),
exename="/u/local/apps/julia/1.8.1/bin/julia",
dir="/tmp",
)
ERROR: TaskFailedException
nested task error: IOError: connect: connection timed out (ETIMEDOUT)
Stacktrace:
[1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
@ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:1092
[2] worker_from_id
@ /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:1089 [inlined]
[3] #remote_do#170
@ /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:557 [inlined]
[4] remote_do
@ /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:557 [inlined]
[5] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
@ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/managers.jl:692
[6] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
@ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:603
[7] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:544
[8] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:484
caused by: IOError: connect: connection timed out (ETIMEDOUT)
Stacktrace:
[1] wait_connected(x::Sockets.TCPSocket)
@ Sockets /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Sockets/src/Sockets.jl:529
[2] connect
@ /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Sockets/src/Sockets.jl:564 [inlined]
[3] connect_to_worker(host::String, port::Int64)
@ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/managers.jl:651
[4] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
@ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/managers.jl:578
[5] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
@ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:599
[6] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:544
[7] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:484
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:436
[2] macro expansion
@ ./task.jl:455 [inlined]
[3] addprocs_locked(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, String, Tuple{Symbol, Symbol}, NamedTuple{(:exename, :dir), Tuple{String, String}}})
@ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:490
jar said:
If I manually run
exec '/u/local/apps/julia/1.8.1/bin/julia' --worker
on the server it just sits there, which I guess is what it's supposed to do
does the main process see the new worker though? if I try locally to start a julia session, do using Distributed
, then start another process with julia --worker
, the latter sits there but the former shows nprocs() = 1
. I don't really know how this stuff is supposed to work, I never used Distributed
nprocs()
(basically) just returns length(Distributed.PGRP.workers)
so I don't think it could work like that
shouldn't it increase when you add more workers?
It increases when you register_worker(worker)
, which happens, like, somewhere
If I manually run
exec '/u/local/apps/julia/1.8.1/bin/julia' --worker
on the server it just sits there, which I guess is what it's supposed to do
If you start julia --worker
and then press Enter once more, the worker should reply to stdout with a line like
julia_worker:9794#192.0.2.23
which is the TCP port and IPv4 address that the master process now needs to contact to talk to the worker outside SSH. Because SSH is only used for that initial handshake. That IP address must be reachable.
Last updated: Nov 06 2024 at 04:40 UTC