I have a julia script that somehow requires 60+ gb of memory. However, running
varinfo(imported=true,sortby=:size,all=true)
shows something which adds up to 100mb. If I turn recursive=true on in varinfo, the process allocates an extra 40gb and runs out of memory...
How can I found out how this memory is wasted? Are there any "usual suspects"?
I'd guess you may have lots of intermediate allocations that are freed, but the garbage collector isn't keeping up. You could try @allocated
on different pieces of the code, or annotate the code with TimerOutputs.jl
I'm not sure, if I run just half of the script and get back repl access, then after GC.gc(), there shouldn't be any intermediary variables right? yet if I do this, the memory usage remains ridiculously high. Nothing is running, and the variables I see in varinfo are quite modest in size.
it's also bugging me that varinfo goes and takes up 40+ gigabytes before crashing.
Even if no variables are leaked into the REPL? Generated code is not GCd, do you by any chance do a lot of codegen?
In either case, it would be nice if you could share the script as a MWE of a potential GC bug
at the moment it's not very minimal. There is a bit of generated code; can I check the size of the compiled code? There are a few variables leaked into the repl, but those I can see with varinfo, and they amount to 100mb. I can try to reduce the script a bit, and then share it.
I modified varinfo to spit out the intermediary results and I see lines like:
MPSKit.MPSKit.LinearAlgebra.MPSKit.MPSKit.LinearAlgebra.BLAS.MPSKit.MPSKit.LinearAlgebra.MPSKit.MPSKit.LinearAlgebra.BLAS.LinearAlgebra.*
which implies that it revisits modules that it already visited, so it's pretty bugged.
Is the script performing disk IO? If so, is your issue similar to https://github.com/JuliaData/CSV.jl/issues/850
?
a tiny bit of disk io, so it's not similar
I wrote a dirty-patched version of varinfo derp.txt and the total memory usage known about by julia is estimated at 600mb
MPSKit.MPSKit.LinearAlgebra.MPSKit
This does seem troubling — how does the binding LinearAlgebra.MPSKit
come to exist?
That turned out to be a problem in varinfo itself, which I think I have patched locally. However, I'm not entirely sure how the function was meant to work (why did it filter out Base,Main and Core), so I'm not sure if I should open a pull request
there was also a "mistake" in base, summarysize on bitarrays failed as size appears undefined
whut
julia> total = 0; for (k,v) in SUNRepresentations.CGCCACHE[(3,Float64)]
total+=Base.summarysize(v)
end
julia> total
2184221472
julia> Base.summarysize(SUNRepresentations.CGCCACHE[(3,Float64)])
463032
I understand summarysize is a lower bound, but this seems a bit too crazy
aha, perhaps this is related https://github.com/JuliaLang/julia/issues/41941
This does seem related and it was fixed in nightly already - https://github.com/JuliaLang/julia/pull/41492
Though I'm not sure that will fix all your varinfo problems here.
indeed it was! I changed varinfo a bit, I'm not sure if I should open a pull request as it does things differently as the other varinfo, but in combination with the latest julia it now shows nicely where the memory usage is coming from! :D
the varinfo thing has apparantly also been fixed https://github.com/JuliaLang/julia/pull/42061
why did it filter out Base,Main and Core
I don't know either (author of that fix here), but I did give is a lovely fake name in my patch :)
thanks a lot for the fix, the code is now also much nicer to read :) we also resolved the underlying issue in our own code, and now everything works again with much less memory usage
I'm on julia 1.8.0 and am running into the same issue once again. Julia takes up 34.1G, the output of varinfo is
julia> varinfo(all=true,minsize=1024,imported=true,sortby=:size)
name size summary
–––––––––––––––– ––––––––––– –––––––––––––––––––––––––––––––
Base Module
Core Module
Main Module
TensorKit 122.409 MiB Module
LinearAlgebra 4.727 MiB Module
JLD2 3.486 MiB Module
MPSKit 2.533 MiB Module
TensorOperations 2.154 MiB Module
KrylovKit 1.668 MiB Module
Test 1.205 MiB Module
Printf 1.092 MiB Module
OptimKit 973.529 KiB Module
DelimitedFiles 957.035 KiB Module
Parameters 952.041 KiB Module
FastClosures 879.606 KiB Module
InteractiveUtils 326.286 KiB Module
err 4.919 KiB 1-element Base.ExceptionStack
##meta#58 3.129 KiB IdDict{Any, Any} with 2 entries
ans 1.469 KiB Markdown.MD
How do I even start debugging such a thing? The script takes a while to run before reaching this level of memory usage, so it's not easy to make a minimal working example
have you done any allocation/performance optimization on your code? Are you doing everything in global scope or are you working in functions?
Julia 1.8.0 will use a lot more memory for LinearAlgebra
(but it should be faster)
Sukera said:
have you done any allocation/performance optimization on your code? Are you doing everything in global scope or are you working in functions?
I have not - the code heavily relies on krylov eigenvalue solvers which temporarily may allocate a lot (it's pretty much all you see if you run pprof, it's the dominating contribution). However, after the simulation is done this shouldn't matter - GC.gc() should clear it up. The simulation uses functions.
I was hoping that whatever it will turn out to be would turn up in varinfo(all=true,imported=true,recursive=true)
Michael Fiano said:
Julia 1.8.0 will use a lot more memory for LinearAlgebra
What exactly changed there? Varinfo claims it doesn't really use too much:
julia> varinfo(LinearAlgebra,all=true,minsize=1024,imported=true,sortby=:size,recursive=true)
name size summary
––––––––––––––––––––––––– ––––––––––– –––––––––––––––––––––––––––––––––
Broadcast 6.830 MiB Module
LinearAlgebra 4.682 MiB Module
BLAS.LinearAlgebra 4.682 MiB Module
BLAS 1.216 MiB Module
BLAS.BLAS 1.216 MiB Module
LAPACK 997.065 KiB Module
LAPACK.LAPACK 997.065 KiB Module
##meta#58 359.730 KiB IdDict{Any, Any} with 174 entries
LAPACK.##meta#58 131.451 KiB IdDict{Any, Any} with 90 entries
BLAS.##meta#58 82.208 KiB IdDict{Any, Any} with 56 entries
Libdl 2.386 KiB Module
BlasHessenbergQ 1.297 KiB UnionAll
StridedMaybeAdjOrTransMat 1.281 KiB UnionAll
AdjOrTransStridedMat 1.094 KiB UnionAll
it's possible that you're running into the GC not returning the memory to the system after it was allocated
shouldn't matter too much, as it's all virtual memory
gimme a sec, there was an issue about this...
A lot more BLAS threads by default
Tune it with BLAS.set_num_threads()
if you need to
I set the blas threads to 1, as they conflict with julia multithreading
there we go https://github.com/JuliaLang/julia/issues/30653
and now that I read that issue, this one as well https://github.com/JuliaLang/julia/issues/42566
Sukera said:
there we go https://github.com/JuliaLang/julia/issues/30653
thanks a ton!!!!
ccall(:malloc_trim, Cvoid, (Cint,), 0)
clears it right up! So it's nothing inherently wrong in my code, I'm perfectly fine with julia not returning that memory to the os
I don't think that would be Julia
That would be your libc implementation
indeed
julia has already called free
on that memory, malloc
just then was like "nah I don't need it yet, you keep it" in its internal table
what lots of people don't know is that malloc
is not just a simple table mapping pointers to ranges of memory, but a somewhat complex system for managing allocations systemwide.. and it's optimized for returning fast, so if you have another metric to optimize for, it may not be optimal
there's still something weird going on - no other threads are running, yet I see a fluctuating summarysize. Probably nothing to worry about?
julia> for i in 1:10; @show Base.summarysize(envs)/(1024*1024*1024); end;
Base.summarysize(envs) / (1024 * 1024 * 1024) = 0.6670088469982147
Base.summarysize(envs) / (1024 * 1024 * 1024) = 3.7884459123015404
Base.summarysize(envs) / (1024 * 1024 * 1024) = 2.5368779748678207
Base.summarysize(envs) / (1024 * 1024 * 1024) = 0.6451764479279518
Base.summarysize(envs) / (1024 * 1024 * 1024) = 2.2040650248527527
Base.summarysize(envs) / (1024 * 1024 * 1024) = 2.928450770676136
Base.summarysize(envs) / (1024 * 1024 * 1024) = 0.72379120439291
Base.summarysize(envs) / (1024 * 1024 * 1024) = 0.8031216561794281
Base.summarysize(envs) / (1024 * 1024 * 1024) = 3.0689545273780823
Base.summarysize(envs) / (1024 * 1024 * 1024) = 2.5368198826909065
Last updated: Nov 06 2024 at 04:40 UTC