Stream: helpdesk (published)

Topic: Heisenbug from hell


view this post on Zulip DrChainsaw (Jun 07 2021 at 16:52):

Program randomly terminates without any output whatsoever. No stacktrace, no Unreachable or segfault.

Has anyone seen anything like it? Are there some lower level debugging tools which can be used to help pinpointing?

It seems to happen randomly when Zygote and CUDA are at work running the first backpropagation using the normal Flux building blocks. If it gets through that phase once things seem to work for the duration of the session.

My computer has spent the last couple of days trying to narrow down the problem by running smaller parts in a shell loop, saving how often it terminates without reaching the end:

function runtest(f, w)
    pname = "errors"
    mkpath(pname)
    fname = joinpath(pname, string(f, '_', w, ".jls"))
    current_errors = isfile(fname) ? deserialize(fname) : 0
    @info "Test $fname, current_errors: $current_errors"
    serialize(fname, current_errors + 1)
    try
        f(w)
    catch e
        @warn "  Threw $e"
    finally
        @info "  Completed!"
        serialize(fname, current_errors)
    end
end

But I'm not getting anything which I could make an actionable issue out of. Here is the catch from the latest 24 hours of fishing:

# In powershell
for($i=0;$i -le 10000; $i++) {
    # Other similar configs
    julia --project="." -e 'include(\"ZCtest.jl\"); ZCtest.runtest(ZCtest.dsmresnet, ZCtest.cpu)'
    julia --project="." -e 'include(\"ZCtest.jl\"); ZCtest.runtest(ZCtest.dsmresnet, ZCtest.gpu)'
    julia --project="." -e 'include(\"ZCtest.jl\"); ZCtest.runtest(ZCtest.jresnet_graph, ZCtest.cpu)'
    # Other similar configs
}

[ Info: Test errors\dsmresnet_cpu.jls, current_errors: 0
[ Info:   Completed!
[ Info: Test errors\dsmresnet_gpu.jls, current_errors: 0
[ Info: Test errors\jresnet_graph_cpu.jls, current_errors: 0
[ Info:   Completed!

[ Info: Test errors\dsmresnet_cpu.jls, current_errors: 0
[ Info:   Completed!
[ Info: Test errors\dsmresnet_gpu.jls, current_errors: 1
[ Info: Test errors\jresnet_graph_cpu.jls, current_errors: 0
[ Info:   Completed!
julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, haswell)
Environment:
  JULIA_DEPOT_PATH = E:/Programs/julia/.julia
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 1

julia> CUDA.version()
v"11.3.0"

view this post on Zulip mbaz (Jun 07 2021 at 17:27):

One possibility might be getting an rr trace: https://julialang.org/blog/2020/05/rr/

view this post on Zulip Brian Chen (Jun 07 2021 at 17:33):

For clarification, are you trying to run the first gradient pass multiple times until an error is encountered or running the script end-to-end? Can the issue be replicated if you extract just that part and feed in a dummy input?

view this post on Zulip Brian Chen (Jun 07 2021 at 17:34):

Likewise, is there any chance Windows is OOM-killing the process silently?

view this post on Zulip DrChainsaw (Jun 07 2021 at 18:33):

@mbaz rr would be useful to try, but does it work on windows nowadays? I have WSL2 but I have not yet dared to jump through the hoops to get CUDA to work in it. I have never seen it trigger when not using CUDA.

@Brian Chen
Yes, what I'm doing now is basically just running one single out of a few different things (e.g. single forward pass, single/few gradient of loss, single jacobian etc all with dummy input) and then exit. Is that what you meant by "that part"? If the program does not finish the error count for the test case is incremented by one. The shell loop will keep running testcases even in case of abnormal failures so the whole thing does not stop at the first error.

A watchdog of some kind is definitely something which has crossed my mind but I haven't been able to figure out if there is one in windows. Fwiw, I have some intentionally too large to fit jobs in the suite and they throw OOM exceptions with 100% (so far) reliability. The test which triggers the failure computes the gradient of the loss but there are other tests which compute the full jacobian of the whole model and they use more memory (I think) and complete successfully (no OOM exception either).

I remember a long time ago there was a GPU watchdog in windows that one needed to disable but I think one notices when it kicks in (screen flickers and a message). Also, GPU utilization is very low during the initial backpropagation since time is spent on building kernels and adjoints (I assume).

Here is the code for the failed-once-in-24h-test.

function resnet()
    function cbr(insize, outsize, maybepool = MaxPool((2,2); stride=(2,2)))
        c = Conv((3,3), insize => outsize;pad=1, bias=false)
        b = BatchNorm(outsize, relu)
        return c, b, maybepool
    end

    function resblock(insize)
        res1 = cbr(insize, insize, identity)
        res2 = cbr(insize, insize, identity)
        return SkipConnection(Chain(res1..., res2...), +)
    end

    Chain(
        cbr(3, 64, identity)...,
        cbr(64, 128)...,
        resblock(128),
        cbr(128, 256)...,
        cbr(256, 512)...,
        resblock(512),
        GlobalMaxPool(),
        Flux.flatten,
        Dense(512, 10; bias=false),
        x -> convert(eltype(x), 0.125) .* x
    )
end

It is run like this:

dsmresnet(w) = dsmtest(resnet(), w)
function dsmtest(l, w)
    for _ in 1:10
        dsmtest(l, ones(Float32, 32,32, 3, 512), Flux.onehotbatch(rand(0:9,512), 0:9), w)
    end
end
dsmtest(l, d, o, w) = dsmtest(w(l), w(d), w(o))
dsmtest(l, d, o) = gradient(() -> Flux.logitcrossentropy(l(d), o), params(l))

Sorry for messy code. I have alot of very similar tests. In case it is not clear, w (for wrap :/) is either gpu or cpu.

view this post on Zulip mbaz (Jun 07 2021 at 18:52):

Oh, I missed you're on Windows, sorry. Yeah, I think rr works only on Linux.

view this post on Zulip DrChainsaw (Jun 07 2021 at 20:03):

Ok, I just realized that it is possible to see exit codes in windows too :blush:

I seem to be getting a mixture of 0xC0000005 (segfault?) and 0xC00000ff (bad function table?) when running the full program. Anyone knows the significance of them not generating the normal "please submit a bug report with this message" stacktrace? Potential hardware fault?


Last updated: Oct 02 2023 at 04:34 UTC