Reduce GC time · helpdesk (published)

I have a task which spends 10% of the time on GC when executed on a single thread, but as soon as I thread it 70% of the execution time is GC. Might anyone have general suggestions on how I could improve this situation?

Sukera (Dec 05 2022 at 09:27):

unfortunately, the only "real" thing you can do right now is to allocate less in the threaded sections

Sukera (Dec 05 2022 at 09:28):

reason being that GC runs stop other threads from executing as well

Sukera (Dec 05 2022 at 09:28):

so lots of GC in a threaded section also means a large slowdown

Timothy (Dec 05 2022 at 10:56):

That's a pity to hear. I've got to say, it's a bit disappointing when doing something embarrassingly parallel that when going from 1 thread to 60 performance only improves by ~1.4x.

Max Köhler (Dec 05 2022 at 10:58):

related to this: is there a change to get another GC strategy in the near future (if at all possible)?

Sukera (Dec 05 2022 at 11:07):

you can always preallocate your working memory

Sukera (Dec 05 2022 at 11:09):

I think @Gabriel Baraldi does some investigative work, but nothing in the near future. There is some work on parallelizing GC entirely, but no ETA on that.

Max Köhler (Dec 05 2022 at 11:09):

Sukera said:

you can always preallocate your working memory

yes I know, this is what I do. I was just wondering if other strategies exist and if so, if investigations are carried out

Sukera (Dec 05 2022 at 11:10):

it's a difficult problem and even with a parallel GC, you're going to want to minimize interaction with it either way :shrug:

Sukera (Dec 05 2022 at 11:11):

memory management is expensive, no matter whether it's julia, C, C++ or Java

Sukera (Dec 05 2022 at 11:11):

(relatively speaking to a hot loop)

Sukera (Dec 05 2022 at 11:12):

that said, @Timothy if you only observe a speedup of 1.4x when you expect 60x, it suggests to me that your single threaded path is not at all optimized, so I'd suggest starting with that

Timothy (Dec 05 2022 at 11:17):

I don't really see what the optimisation of the single-threaded path has to do with how well it paralelizes if there's no interaction between threads.

Sukera (Dec 05 2022 at 11:18):

because GC is interacting across threads.

Timothy (Dec 05 2022 at 11:18):

Oh, you're talking about optimising it to produce less GC?

Sukera (Dec 05 2022 at 11:18):

yes

Sukera (Dec 05 2022 at 11:19):

GC is single threaded, so if any thread wants to allocate memory, it needs to take a lock and blocks all other threads from allocating, IIRC.

Sukera (Dec 05 2022 at 11:19):

on top of that, depending on how you're threading, you may accidentally share data/state between threads, which can lead to more boxing/allocations

Sukera (Dec 05 2022 at 11:20):

the easiest way to tackle both is to optimize the single threaded version

Timothy (Dec 05 2022 at 11:20):

That would be nice, but I have a feeling that would be quite difficult. The code is basically doing two major things:

Constructing and evaluating decision trees (from DecisionTree.jl)
Calculating permutation importance

Sukera (Dec 05 2022 at 11:21):

do you have an example?

Sukera (Dec 05 2022 at 11:22):

keep in mind that e.g. Threads.@threads more or less has to create an anonymous function under the hood, so anything you capture in there is shared across threads

Sukera (Dec 05 2022 at 11:22):

and that capturing may too lead to type instabilities, boxing etc.

Timothy (Dec 05 2022 at 11:23):

This is the rough structure of the code (which uses Transducers.jl)

(sample generator) |>
  Map(pairs -> zip(pairs...) .|> Iterators.flatten .|> collect) |>
  Map(function ((train, test),)
    tree = train(decision tree on train data)
    oob = predict(tree, test)
    importance = permutation_importance(tree, test)
    (; tree, oob, importance)
  end) |> tcollect

Timothy (Dec 05 2022 at 11:24):

The actual code is a bit more complicated, but if you're interested you can find it here: http://ix.io/4hNh

Sukera (Dec 05 2022 at 11:25):

how large is pairs?

Timothy (Dec 05 2022 at 11:26):

Usually a single-element vector (hence the if statement in the full version)

Sukera (Dec 05 2022 at 11:26):

your full version is type unstable - you only assign importance conditionally

Timothy (Dec 05 2022 at 11:27):

Yea, importance and oob are both Union{Nothing, X}s, but I'm assuming that type instability isn't having a big impact overall.

Sukera (Dec 05 2022 at 11:28):

unprofiled assumptions are bad for making a decision :)

Sukera (Dec 05 2022 at 11:28):

trainX also creates a copy, does it really need to?

Sukera (Dec 05 2022 at 11:28):

since slicing copies here trainX, trainY = X1[train, :], y1[train]

Timothy (Dec 05 2022 at 11:29):

It doesn't need to be a copy, but neither that or Matrix(testX) seem to have much of an impact overall

Sukera (Dec 05 2022 at 11:29):

because both copy

Sukera (Dec 05 2022 at 11:29):

why not a @view?

Timothy (Dec 05 2022 at 11:29):

I suppose I could prefix @views just for fun

Sukera (Dec 05 2022 at 11:30):

same general thought goes for zip(pairs...), do you really need to collect it, thereby allocating new memory?

Timothy (Dec 05 2022 at 11:30):

I'm not |> collecting, I'm .|> collecting, so it changes the structure

Sukera (Dec 05 2022 at 11:31):

same thought though, do you need to collect each flattened array?

Sukera (Dec 05 2022 at 11:33):

and I also think the broadcast (which allocates yet-another array) can be elided by using Iterators.map(Iterators.flatten, zip(pairs...))

Timothy (Dec 05 2022 at 11:33):

Yes, otherwise I get a method error

Timothy (Dec 05 2022 at 11:33):

Also, in this particular case, I know that branch is never hit

Sukera (Dec 05 2022 at 11:34):

I'm not saying doing just that change will fix things, I'm just pointing out where you get those allocations

Timothy (Dec 05 2022 at 11:34):

I appreciate it :)

Timothy (Dec 05 2022 at 11:34):

The map change is nice, I'll make that regardless of the impact

Sukera (Dec 05 2022 at 11:35):

tbh, transducers are supposed to compose, so I think you should be able to use the Transducers.jl native Map just as well

Sukera (Dec 05 2022 at 11:36):

regardless, every broadcast in there (like getproperty.(submachines, :test) later on) has to allocate a result array

Timothy (Dec 05 2022 at 11:36):

Timothy said:

Yes, otherwise I get a method error

I'm very happy to discover that I'm wrong here! and it does help a bit!

Sukera (Dec 05 2022 at 11:36):

the mapreduce over dictionaries is going to be expensive too, since it needs to allocate a new dictionary and rehash all entries

Timothy (Dec 05 2022 at 11:37):

Any recommendations there?

Sukera (Dec 05 2022 at 11:39):

put the getproperty thing into the map of the mapreduce

Sukera (Dec 05 2022 at 11:39):

you're already mapping

Sukera (Dec 05 2022 at 11:39):

so why have another loop allocating memory before that?

Sukera (Dec 05 2022 at 11:39):

(or get rid of the Dict entirely, but that is probably a bit more of an undertaking)

Timothy (Dec 05 2022 at 11:40):

Since the keys are the same, I suppose I could sort each importance list on the keys and then directly sum the vectors of values?

Sukera (Dec 05 2022 at 11:41):

that's also an option, yeah

Sukera (Dec 05 2022 at 11:41):

or you just iterate over keys(..) of one importance list and sum all entries of all lists

Timothy (Dec 05 2022 at 11:46):

Ok, with the .|> collection removal and Iterators.map the 60x thread increase now causes a 1.9x perf improvement

Sukera (Dec 05 2022 at 11:50):

so still 30x missing, possibly due to the other places with copy slicing instead of views

Sukera (Dec 05 2022 at 11:50):

and depending on how much the code you're calling allocates, of course

Timothy (Dec 05 2022 at 11:52):

I could share tree_permutation_importance too, if that's of interest?

Sukera (Dec 05 2022 at 11:53):

I'm a bit short on time for indepth reviews, sorry :/

Timothy (Dec 05 2022 at 11:54):

That's fine, the 1.4 -> 1.9x improvement is already very much appreciated :)

DrChainsaw (Dec 05 2022 at 17:08):

If allocations can't be avoided, running multi-processing (e.g. Distributed) can help since then each instance gets its own allocator and gc.

It has its own sets of drawbacks (sending data over sockets, multiple compilations etc.) so it is absolutely no silver bullet.

chriselrod (Dec 06 2022 at 23:20):

I use Distributed a lot for this reason.

Brenhin Keller (Dec 06 2022 at 23:44):

Or MPI.jl, if it makes sense

Timothy (Dec 07 2022 at 01:24):

Yea, the problem I have with distributed + a few packages + a large data sets is an explosion in memory required.

jar (Dec 07 2022 at 01:36):

My solution to that has been just buying a lot of ram, though sometimes it's not enough

Timothy (Dec 07 2022 at 01:38):

I have 128G and it's not enough :sweat_smile:

jar (Dec 07 2022 at 01:41):

there's always swap

Timothy (Dec 07 2022 at 01:44):

When I installed the OS I assumed I'd need no swap with so much memory :laughter_tears:

jar (Dec 29 2022 at 03:32):

image.png it's not always fast but it does go

Last updated: Oct 02 2023 at 04:34 UTC

Stream: helpdesk (published)

Topic: Reduce GC time

Timothy (Dec 05 2022 at 09:24):

Sukera (Dec 05 2022 at 09:27):

Sukera (Dec 05 2022 at 09:28):

Sukera (Dec 05 2022 at 09:28):

Timothy (Dec 05 2022 at 10:56):

Max Köhler (Dec 05 2022 at 10:58):

Sukera (Dec 05 2022 at 11:07):

Sukera (Dec 05 2022 at 11:09):

Max Köhler (Dec 05 2022 at 11:09):

Sukera (Dec 05 2022 at 11:10):

Sukera (Dec 05 2022 at 11:11):

Sukera (Dec 05 2022 at 11:11):

Sukera (Dec 05 2022 at 11:12):

Timothy (Dec 05 2022 at 11:17):

Sukera (Dec 05 2022 at 11:18):

Timothy (Dec 05 2022 at 11:18):

Sukera (Dec 05 2022 at 11:18):

Sukera (Dec 05 2022 at 11:19):

Sukera (Dec 05 2022 at 11:19):

Sukera (Dec 05 2022 at 11:20):

Timothy (Dec 05 2022 at 11:20):

Sukera (Dec 05 2022 at 11:21):

Sukera (Dec 05 2022 at 11:22):

Sukera (Dec 05 2022 at 11:22):

Timothy (Dec 05 2022 at 11:23):

Timothy (Dec 05 2022 at 11:24):

Sukera (Dec 05 2022 at 11:25):

Timothy (Dec 05 2022 at 11:26):

Sukera (Dec 05 2022 at 11:26):

Timothy (Dec 05 2022 at 11:27):

Sukera (Dec 05 2022 at 11:28):

Sukera (Dec 05 2022 at 11:28):

Sukera (Dec 05 2022 at 11:28):

Timothy (Dec 05 2022 at 11:29):

Sukera (Dec 05 2022 at 11:29):

Sukera (Dec 05 2022 at 11:29):

Timothy (Dec 05 2022 at 11:29):

Sukera (Dec 05 2022 at 11:30):

Timothy (Dec 05 2022 at 11:30):

Sukera (Dec 05 2022 at 11:31):

Sukera (Dec 05 2022 at 11:33):

Timothy (Dec 05 2022 at 11:33):

Timothy (Dec 05 2022 at 11:33):

Sukera (Dec 05 2022 at 11:34):

Timothy (Dec 05 2022 at 11:34):

Timothy (Dec 05 2022 at 11:34):

Sukera (Dec 05 2022 at 11:35):

Sukera (Dec 05 2022 at 11:36):

Timothy (Dec 05 2022 at 11:36):

Sukera (Dec 05 2022 at 11:36):

Timothy (Dec 05 2022 at 11:37):

Sukera (Dec 05 2022 at 11:39):

Sukera (Dec 05 2022 at 11:39):

Sukera (Dec 05 2022 at 11:39):

Sukera (Dec 05 2022 at 11:39):

Timothy (Dec 05 2022 at 11:40):

Sukera (Dec 05 2022 at 11:41):

Sukera (Dec 05 2022 at 11:41):

Timothy (Dec 05 2022 at 11:46):

Sukera (Dec 05 2022 at 11:50):

Sukera (Dec 05 2022 at 11:50):

Timothy (Dec 05 2022 at 11:52):

Sukera (Dec 05 2022 at 11:53):

Timothy (Dec 05 2022 at 11:54):

DrChainsaw (Dec 05 2022 at 17:08):

chriselrod (Dec 06 2022 at 23:20):

Brenhin Keller (Dec 06 2022 at 23:44):

Timothy (Dec 07 2022 at 01:24):

jar (Dec 07 2022 at 01:36):

Timothy (Dec 07 2022 at 01:38):

jar (Dec 07 2022 at 01:41):

Timothy (Dec 07 2022 at 01:44):

jar (Dec 29 2022 at 03:32):