Stream: helpdesk (published)

Topic: ✔ Distributed connection timeout


view this post on Zulip DrChainsaw (Oct 21 2021 at 12:51):

Anyone has experience with debugging Distributed/Dagger jobs on a cluster?

My task works when I use a small number of workers (10-40) but when I try to use more than that the thing pretty much always comes crashing down with "Peer N didn't connect to M in X seconds" at some point. Ideally I think I would like to be in the 500 range for maximum parallelism.

Is it possible to log when and how much data is sent so I can get an overview of whether I overload the network or something? I have tried to set things up so that the bulk of the data is aggregated in the same Thunk before handing off, but maybe this is not enough. Increasing the timeout by a factor of 10 didn't help, but maybe things scale in some exponential fashion which is not clear to me.

Assuming network loading is a problem, will concentrating workers on the same node help or will stuff anyways be sent over the network when passed between workers?

view this post on Zulip DrChainsaw (Oct 22 2021 at 11:23):

Fwiw, I found an issue with the parallelism in my program where 95% of work ended up on a single worker. After fixing it I seem to be able to scale things up as much as I want to. Perhaps the worker was so overworked it could not respond?

view this post on Zulip Notification Bot (Oct 22 2021 at 11:24):

DrChainsaw has marked this topic as resolved.


Last updated: Nov 06 2024 at 04:40 UTC