Heisenbug from hell · helpdesk (published)

Program randomly terminates without any output whatsoever. No stacktrace, no Unreachable or segfault.

Has anyone seen anything like it? Are there some lower level debugging tools which can be used to help pinpointing?

It seems to happen randomly when Zygote and CUDA are at work running the first backpropagation using the normal Flux building blocks. If it gets through that phase once things seem to work for the duration of the session.

My computer has spent the last couple of days trying to narrow down the problem by running smaller parts in a shell loop, saving how often it terminates without reaching the end:

function runtest(f, w)
    pname = "errors"
    mkpath(pname)
    fname = joinpath(pname, string(f, '_', w, ".jls"))
    current_errors = isfile(fname) ? deserialize(fname) : 0
    @info "Test $fname, current_errors: $current_errors"
    serialize(fname, current_errors + 1)
    try
        f(w)
    catch e
        @warn "  Threw $e"
    finally
        @info "  Completed!"
        serialize(fname, current_errors)
    end
end

But I'm not getting anything which I could make an actionable issue out of. Here is the catch from the latest 24 hours of fishing:

# In powershell
for($i=0;$i -le 10000; $i++) {
    # Other similar configs
    julia --project="." -e 'include(\"ZCtest.jl\"); ZCtest.runtest(ZCtest.dsmresnet, ZCtest.cpu)'
    julia --project="." -e 'include(\"ZCtest.jl\"); ZCtest.runtest(ZCtest.dsmresnet, ZCtest.gpu)'
    julia --project="." -e 'include(\"ZCtest.jl\"); ZCtest.runtest(ZCtest.jresnet_graph, ZCtest.cpu)'
    # Other similar configs
}

[ Info: Test errors\dsmresnet_cpu.jls, current_errors: 0
[ Info:   Completed!
[ Info: Test errors\dsmresnet_gpu.jls, current_errors: 0
[ Info: Test errors\jresnet_graph_cpu.jls, current_errors: 0
[ Info:   Completed!

[ Info: Test errors\dsmresnet_cpu.jls, current_errors: 0
[ Info:   Completed!
[ Info: Test errors\dsmresnet_gpu.jls, current_errors: 1
[ Info: Test errors\jresnet_graph_cpu.jls, current_errors: 0
[ Info:   Completed!

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, haswell)
Environment:
  JULIA_DEPOT_PATH = E:/Programs/julia/.julia
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 1

julia> CUDA.version()
v"11.3.0"

mbaz (Jun 07 2021 at 17:27):

Brian Chen (Jun 07 2021 at 17:33):

For clarification, are you trying to run the first gradient pass multiple times until an error is encountered or running the script end-to-end? Can the issue be replicated if you extract just that part and feed in a dummy input?

Brian Chen (Jun 07 2021 at 17:34):

DrChainsaw (Jun 07 2021 at 18:33):

@mbaz rr would be useful to try, but does it work on windows nowadays? I have WSL2 but I have not yet dared to jump through the hoops to get CUDA to work in it. I have never seen it trigger when not using CUDA.

@Brian Chen
Yes, what I'm doing now is basically just running one single out of a few different things (e.g. single forward pass, single/few gradient of loss, single jacobian etc all with dummy input) and then exit. Is that what you meant by "that part"? If the program does not finish the error count for the test case is incremented by one. The shell loop will keep running testcases even in case of abnormal failures so the whole thing does not stop at the first error.

A watchdog of some kind is definitely something which has crossed my mind but I haven't been able to figure out if there is one in windows. Fwiw, I have some intentionally too large to fit jobs in the suite and they throw OOM exceptions with 100% (so far) reliability. The test which triggers the failure computes the gradient of the loss but there are other tests which compute the full jacobian of the whole model and they use more memory (I think) and complete successfully (no OOM exception either).

I remember a long time ago there was a GPU watchdog in windows that one needed to disable but I think one notices when it kicks in (screen flickers and a message). Also, GPU utilization is very low during the initial backpropagation since time is spent on building kernels and adjoints (I assume).

function resnet()
    function cbr(insize, outsize, maybepool = MaxPool((2,2); stride=(2,2)))
        c = Conv((3,3), insize => outsize;pad=1, bias=false)
        b = BatchNorm(outsize, relu)
        return c, b, maybepool
    end

    function resblock(insize)
        res1 = cbr(insize, insize, identity)
        res2 = cbr(insize, insize, identity)
        return SkipConnection(Chain(res1..., res2...), +)
    end

    Chain(
        cbr(3, 64, identity)...,
        cbr(64, 128)...,
        resblock(128),
        cbr(128, 256)...,
        cbr(256, 512)...,
        resblock(512),
        GlobalMaxPool(),
        Flux.flatten,
        Dense(512, 10; bias=false),
        x -> convert(eltype(x), 0.125) .* x
    )
end

dsmresnet(w) = dsmtest(resnet(), w)
function dsmtest(l, w)
    for _ in 1:10
        dsmtest(l, ones(Float32, 32,32, 3, 512), Flux.onehotbatch(rand(0:9,512), 0:9), w)
    end
end
dsmtest(l, d, o, w) = dsmtest(w(l), w(d), w(o))
dsmtest(l, d, o) = gradient(() -> Flux.logitcrossentropy(l(d), o), params(l))

Sorry for messy code. I have alot of very similar tests. In case it is not clear, w (for wrap :/) is either gpu or cpu.

mbaz (Jun 07 2021 at 18:52):

DrChainsaw (Jun 07 2021 at 20:03):

Ok, I just realized that it is possible to see exit codes in windows too :blush:

I seem to be getting a mixture of 0xC0000005 (segfault?) and 0xC00000ff (bad function table?) when running the full program. Anyone knows the significance of them not generating the normal "please submit a bug report with this message" stacktrace? Potential hardware fault?

Stream: helpdesk (published)