I have a task which spends 10% of the time on GC when executed on a single thread, but as soon as I thread it 70% of the execution time is GC. Might anyone have general suggestions on how I could improve this situation?
unfortunately, the only "real" thing you can do right now is to allocate less in the threaded sections
reason being that GC runs stop other threads from executing as well
so lots of GC in a threaded section also means a large slowdown
That's a pity to hear. I've got to say, it's a bit disappointing when doing something embarrassingly parallel that when going from 1 thread to 60 performance only improves by ~1.4x.
related to this: is there a chance to get another GC strategy in the near future (if at all possible)?
you can always preallocate your working memory
I think @Gabriel Baraldi does some investigative work, but nothing in the near future. There is some work on parallelizing GC entirely, but no ETA on that.
Sukera said:
you can always preallocate your working memory
yes I know, this is what I do. I was just wondering if other strategies exist and if so, if investigations are carried out
it's a difficult problem and even with a parallel GC, you're going to want to minimize interaction with it either way :shrug:
memory management is expensive, no matter whether it's julia, C, C++ or Java
(relatively speaking to a hot loop)
that said, @Timothy if you only observe a speedup of 1.4x when you expect 60x, it suggests to me that your single threaded path is not at all optimized, so I'd suggest starting with that
I don't really see what the optimisation of the single-threaded path has to do with how well it paralelizes if there's no interaction between threads.
because GC is interacting across threads.
Oh, you're talking about optimising it to produce less GC?
yes
GC is single threaded, so if any thread wants to allocate memory, it needs to take a lock and blocks all other threads from allocating, IIRC.
on top of that, depending on how you're threading, you may accidentally share data/state between threads, which can lead to more boxing/allocations
the easiest way to tackle both is to optimize the single threaded version
That would be nice, but I have a feeling that would be quite difficult. The code is basically doing two major things:
DecisionTree.jl
)do you have an example?
keep in mind that e.g. Threads.@threads
more or less has to create an anonymous function under the hood, so anything you capture in there is shared across threads
and that capturing may too lead to type instabilities, boxing etc.
This is the rough structure of the code (which uses Transducers.jl
)
(sample generator) |>
Map(pairs -> zip(pairs...) .|> Iterators.flatten .|> collect) |>
Map(function ((train, test),)
tree = train(decision tree on train data)
oob = predict(tree, test)
importance = permutation_importance(tree, test)
(; tree, oob, importance)
end) |> tcollect
The actual code is a bit more complicated, but if you're interested you can find it here: http://ix.io/4hNh
how large is pairs
?
Usually a single-element vector (hence the if
statement in the full version)
your full version is type unstable - you only assign importance
conditionally
Yea, importance
and oob
are both Union{Nothing, X}
s, but I'm assuming that type instability isn't having a big impact overall.
unprofiled assumptions are bad for making a decision :)
trainX
also creates a copy, does it really need to?
since slicing copies here trainX, trainY = X1[train, :], y1[train]
It doesn't need to be a copy, but neither that or Matrix(testX)
seem to have much of an impact overall
because both copy
why not a @view
?
I suppose I could prefix @views
just for fun
same general thought goes for zip(pairs...)
, do you really need to collect
it, thereby allocating new memory?
I'm not |>
collecting, I'm .|>
collecting, so it changes the structure
same thought though, do you need to collect each flattened array?
and I also think the broadcast (which allocates yet-another array) can be elided by using Iterators.map(Iterators.flatten, zip(pairs...))
Yes, otherwise I get a method error
Also, in this particular case, I know that branch is never hit
I'm not saying doing just that change will fix things, I'm just pointing out where you get those allocations
I appreciate it :)
The map
change is nice, I'll make that regardless of the impact
tbh, transducers are supposed to compose, so I think you should be able to use the Transducers.jl native Map just as well
regardless, every broadcast in there (like getproperty.(submachines, :test)
later on) has to allocate a result array
Timothy said:
Yes, otherwise I get a method error
I'm very happy to discover that I'm wrong here! and it does help a bit!
the mapreduce
over dictionaries is going to be expensive too, since it needs to allocate a new dictionary and rehash all entries
Any recommendations there?
put the getproperty
thing into the map
of the mapreduce
you're already mapping
so why have another loop allocating memory before that?
(or get rid of the Dict
entirely, but that is probably a bit more of an undertaking)
Since the keys are the same, I suppose I could sort each importance list on the keys and then directly sum the vectors of values?
that's also an option, yeah
or you just iterate over keys(..)
of one importance list and sum all entries of all lists
Ok, with the .|> collection
removal and Iterators.map
the 60x thread increase now causes a 1.9x perf improvement
so still 30x missing, possibly due to the other places with copy slicing instead of views
and depending on how much the code you're calling allocates, of course
I could share tree_permutation_importance
too, if that's of interest?
I'm a bit short on time for indepth reviews, sorry :/
That's fine, the 1.4 -> 1.9x improvement is already very much appreciated :)
If allocations can't be avoided, running multi-processing (e.g. Distributed
) can help since then each instance gets its own allocator and gc.
It has its own sets of drawbacks (sending data over sockets, multiple compilations etc.) so it is absolutely no silver bullet.
I use Distributed
a lot for this reason.
Or MPI.jl, if it makes sense
Yea, the problem I have with distributed + a few packages + a large data sets is an explosion in memory required.
My solution to that has been just buying a lot of ram, though sometimes it's not enough
I have 128G and it's not enough :sweat_smile:
there's always swap
When I installed the OS I assumed I'd need no swap with so much memory :laughter_tears:
image.png it's not always fast but it does go
Last updated: Nov 06 2024 at 04:40 UTC