Stream: helpdesk (published)

Topic: addprocs hangs over ssh


view this post on Zulip Timothy (Apr 18 2021 at 08:24):

Hello!

I'm trying to make use of some remote CPU cores on a machine I can SSH to. idle-maltair.ucc is an SSH alias for a machine running Julia 1.6, just as I am.
image.png

From the documentation, it looks like addprocs(["idle-maltair.ucc"]; dir="/tmp", topology=:master_worker, multiplex=true) should work nicely, however it doesn't.
image.png

ERROR: TaskFailedException

    nested task error: IOError: connect: connection timed out (ETIMEDOUT)

I've tried a bit of googling, but had no luck so far. Any help would be greatly appreciated.

view this post on Zulip Lasse Peters (Apr 23 2021 at 21:30):

I've seen similar timeouts when the connection is not open already. I enabled multiplexing directly in my ssh config and made sure to open a connection to the server before I connect via Julia.

https://discourse.julialang.org/t/addprocs-with-ssh-and-a-proxyjump-times-out/26647

view this post on Zulip jar (Dec 10 2022 at 18:56):

I have this problem too. Did you come up with anything @Timothy ?

julia> run(`ssh $addr -- '/u/local/apps/julia/1.8.1/bin/julia --version '`);
julia version 1.8.1


julia> procs = addprocs([addr]; exename="/u/local/apps/julia/1.8.1/bin/julia", dir="/u/home/", sshflags="-vv")
debug2: channel_input_open_confirmation: channel 0: callback start
debug2: fd 3 setting TCP_NODELAY
debug2: client_session2_setup: id 0
debug1: Sending command: sh -l -c 'cd -- /u/home/
exec '\\''/u/local/apps/julia/1.8.1/bin/julia'\\'' --worker'
debug2: channel 0: request exec confirm 1
debug2: channel_input_open_confirmation: channel 0: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768
debug2: channel 0: rcvd adjust 2097152
debug2: channel_input_status_confirm: type 99 id 0
debug2: exec request accepted on channel 0
debug2: channel 0: read<=0 rfd 4 len 0
debug2: channel 0: read failed
debug2: chan_shutdown_read: channel 0: (i0 o0 sock -1 wfd 4 efd 6 [write])
debug2: channel 0: input open -> drain
debug2: channel 0: ibuf empty
debug2: channel 0: send eof
debug2: channel 0: input drain -> closed
debug2: channel 0: rcvd eof
debug2: channel 0: output open -> drain
debug2: channel 0: obuf empty
debug2: chan_shutdown_write: channel 0: (i3 o1 sock -1 wfd 5 efd 6 [write])
debug2: channel 0: output drain -> closed
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug2: channel 0: rcvd close
debug2: channel 0: almost dead
debug2: channel 0: gc: notify user
debug2: channel 0: gc: user detached
debug2: channel 0: send close
debug2: channel 0: is dead
debug2: channel 0: garbage collecting
debug1: channel 0: free: client-session, nchannels 1
Transferred: sent 3060, received 3132 bytes, in 63.9 seconds
Bytes per second: sent 47.9, received 49.0
debug1: Exit status 1


ERROR: TaskFailedException

    nested task error: IOError: connect: connection timed out (ETIMEDOUT)

    caused by: IOError: connect: connection timed out (ETIMEDOUT)

view this post on Zulip Timothy (Dec 10 2022 at 19:13):

I never had any luck :(

view this post on Zulip Mosè Giordano (Dec 10 2022 at 19:57):

It feels like the exec line has way too many quotes

view this post on Zulip jar (Dec 10 2022 at 20:14):

Opened https://github.com/JuliaLang/julia/issues/47863

view this post on Zulip jar (Dec 10 2022 at 20:18):

If I try to run that command manually on the server

-bash-4.2$ sh -l -c 'cd -- /u/home/
> exec '\\''/u/local/apps/julia/1.8.1/bin/julia'\\'' --worker'
sh: line 1: /u/local/apps/julia/1.8.1/bin/julia\: No such file or directory

view this post on Zulip jar (Dec 10 2022 at 20:21):

I'm not sure if that's actually what it's doing or that's just some kind of string representation.

view this post on Zulip Fredrik Ekre (Dec 10 2022 at 20:36):

Maybe you need to specify the remote shell? Edit: Nevermind, I see from the last output that it is bash, so default option should work. Edit2: But maybe sh isn't bash?

view this post on Zulip jar (Dec 10 2022 at 20:42):

On the server,

$ realpath $(which sh)
/usr/bin/bash
$ sh
sh-4.2$ help
GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)

view this post on Zulip jar (Dec 10 2022 at 20:51):

I tried overriding this line https://github.com/JuliaLang/julia/blob/master/stdlib/Distributed/src/managers.jl#L298 to remove one layer of quoting by deleting escape_shell_posixly() so @show remotecmd looks like

remotecmd = `sh -l -c "cd -- /tmp
exec '/u/local/apps/julia/1.8.1/bin/julia' --worker"`

but addprocs still times out.

view this post on Zulip jar (Dec 10 2022 at 20:53):

If I manually run exec '/u/local/apps/julia/1.8.1/bin/julia' --worker on the server it just sits there, which I guess is what it's supposed to do

view this post on Zulip jar (Dec 10 2022 at 23:05):

I don't really know how to diagnose this

julia> Distributed.addprocs_locked(
           Distributed.SSHManager([addr]),
           exename="/u/local/apps/julia/1.8.1/bin/julia",
           dir="/tmp",
       )
ERROR: TaskFailedException

    nested task error: IOError: connect: connection timed out (ETIMEDOUT)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:1092
     [2] worker_from_id
       @ /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:1089 [inlined]
     [3] #remote_do#170
       @ /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:557 [inlined]
     [4] remote_do
       @ /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:557 [inlined]
     [5] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/managers.jl:692
     [6] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:603
     [7] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:544
     [8] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:484

    caused by: IOError: connect: connection timed out (ETIMEDOUT)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Sockets/src/Sockets.jl:529
     [2] connect
       @ /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Sockets/src/Sockets.jl:564 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/managers.jl:651
     [4] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/managers.jl:578
     [5] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:599
     [6] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:544
     [7] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:484
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:436
 [2] macro expansion
   @ ./task.jl:455 [inlined]
 [3] addprocs_locked(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, String, Tuple{Symbol, Symbol}, NamedTuple{(:exename, :dir), Tuple{String, String}}})
   @ Distributed /nix/store/ca4hhym3f57vpmhgvylvqp86cmz9gbis-julia-bin-1.8.1/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:490

view this post on Zulip Mosè Giordano (Dec 10 2022 at 23:23):

jar said:

If I manually run exec '/u/local/apps/julia/1.8.1/bin/julia' --worker on the server it just sits there, which I guess is what it's supposed to do

does the main process see the new worker though? if I try locally to start a julia session, do using Distributed, then start another process with julia --worker, the latter sits there but the former shows nprocs() = 1. I don't really know how this stuff is supposed to work, I never used Distributed

view this post on Zulip jar (Dec 10 2022 at 23:30):

nprocs() (basically) just returns length(Distributed.PGRP.workers) so I don't think it could work like that

view this post on Zulip Mosè Giordano (Dec 10 2022 at 23:31):

shouldn't it increase when you add more workers?

view this post on Zulip jar (Dec 10 2022 at 23:34):

It increases when you register_worker(worker), which happens, like, somewhere

view this post on Zulip Markus Kuhn (Dec 14 2022 at 19:03):

If I manually run exec '/u/local/apps/julia/1.8.1/bin/julia' --worker on the server it just sits there, which I guess is what it's supposed to do

If you start julia --worker and then press Enter once more, the worker should reply to stdout with a line like

julia_worker:9794#192.0.2.23

which is the TCP port and IPv4 address that the master process now needs to contact to talk to the worker outside SSH. Because SSH is only used for that initial handshake. That IP address must be reachable.


Last updated: Oct 02 2023 at 04:34 UTC