Hello,
I am trying to automate running many multiweek long simulations. I want these simulations to be restartable after a reboot, so I am storing checkpoints and logs to an output directory per job.
Normally each simulation job should only have one julia process running it at a time, however, I would like to automatically detect if there are multiple processes trying to output to the same output directory at the same time.
I'm not sure if this is a good idea but currently I am trying to have each process constantly append a byte to a file, then check if the file size is as expected. If two processes are running with the same output directory, the file size will be larger than expected.
I initially tried this with open("file_name", "a")
but this didn't work, and multiple processes just overwrote each others data.
So now I am using Base.Filesystem.open
like so:
function main()
flags = Base.Filesystem.JL_O_APPEND | Base.Filesystem.JL_O_CREAT | Base.Filesystem.JL_O_WRONLY
perm = Base.Filesystem.S_IROTH | Base.Filesystem.S_IRGRP | Base.Filesystem.S_IWGRP | Base.Filesystem.S_IRUSR | Base.Filesystem.S_IWUSR
detect_mult_runners_f = Base.Filesystem.open("detect-mult-runners", flags, perm)
detect_mult_runners_i = Ref(filesize(detect_mult_runners_f))
Timer(0.0; interval=1.0) do t
write(detect_mult_runners_f, 0x41)
flush(detect_mult_runners_f)
detect_mult_runners_i[] += 1
if filesize(detect_mult_runners_f) != detect_mult_runners_i[]
@error "multiple runners are running this job, exiting"
exit()
end
end
sleep(1000)
end
main()
Is there a simpler way of locking an output directory from multiple julia processes that is safe when random powerloss can occur?
Also, is my use Filesystem.open
something that may break in a future julia release?
Update. I am no longer using Filesystem.open
with O_APPEND because using O_APPEND is very broken on linux, see the end of https://man7.org/linux/man-pages/man2/pwrite.2.html . I am instead using https://docs.julialang.org/en/v1/stdlib/FileWatching/#Pidfile
Nathan Zimmerberg has marked this topic as resolved.
Last updated: Nov 22 2024 at 04:41 UTC