Stream: helpdesk (published)

Topic: Distributed connection timeout


view this post on Zulip DrChainsaw (Oct 21 2021 at 12:51):

Anyone has experience with debugging Distributed/Dagger jobs on a cluster?

My task works when I use a small number of workers (10-40) but when I try to use more than that the thing pretty much always comes crashing down with "Peer N didn't connect to M in X seconds" at some point. Ideally I think I would like to be in the 500 range for maximum parallelism.

Is it possible to log when and how much data is sent so I can get an overview of whether I overload the network or something? I have tried to set things up so that the bulk of the data is aggregated in the same Thunk before handing off, but maybe this is not enough. Increasing the timeout by a factor of 10 didn't help, but maybe things scale in some exponential fashion which is not clear to me.

Assuming network loading is a problem, will concentrating workers on the same node help or will stuff anyways be sent over the network when passed between workers?


Last updated: Oct 02 2023 at 04:34 UTC