Stream: helpdesk (published)

Topic: Reduce GC time


view this post on Zulip Timothy (Dec 05 2022 at 09:24):

I have a task which spends 10% of the time on GC when executed on a single thread, but as soon as I thread it 70% of the execution time is GC. Might anyone have general suggestions on how I could improve this situation?

view this post on Zulip Sukera (Dec 05 2022 at 09:27):

unfortunately, the only "real" thing you can do right now is to allocate less in the threaded sections

view this post on Zulip Sukera (Dec 05 2022 at 09:28):

reason being that GC runs stop other threads from executing as well

view this post on Zulip Sukera (Dec 05 2022 at 09:28):

so lots of GC in a threaded section also means a large slowdown

view this post on Zulip Timothy (Dec 05 2022 at 10:56):

That's a pity to hear. I've got to say, it's a bit disappointing when doing something embarrassingly parallel that when going from 1 thread to 60 performance only improves by ~1.4x.

view this post on Zulip Max Köhler (Dec 05 2022 at 10:58):

related to this: is there a change to get another GC strategy in the near future (if at all possible)?

view this post on Zulip Sukera (Dec 05 2022 at 11:07):

you can always preallocate your working memory

view this post on Zulip Sukera (Dec 05 2022 at 11:09):

I think @Gabriel Baraldi does some investigative work, but nothing in the near future. There is some work on parallelizing GC entirely, but no ETA on that.

view this post on Zulip Max Köhler (Dec 05 2022 at 11:09):

Sukera said:

you can always preallocate your working memory

yes I know, this is what I do. I was just wondering if other strategies exist and if so, if investigations are carried out

view this post on Zulip Sukera (Dec 05 2022 at 11:10):

it's a difficult problem and even with a parallel GC, you're going to want to minimize interaction with it either way :shrug:

view this post on Zulip Sukera (Dec 05 2022 at 11:11):

memory management is expensive, no matter whether it's julia, C, C++ or Java

view this post on Zulip Sukera (Dec 05 2022 at 11:11):

(relatively speaking to a hot loop)

view this post on Zulip Sukera (Dec 05 2022 at 11:12):

that said, @Timothy if you only observe a speedup of 1.4x when you expect 60x, it suggests to me that your single threaded path is not at all optimized, so I'd suggest starting with that

view this post on Zulip Timothy (Dec 05 2022 at 11:17):

I don't really see what the optimisation of the single-threaded path has to do with how well it paralelizes if there's no interaction between threads.

view this post on Zulip Sukera (Dec 05 2022 at 11:18):

because GC is interacting across threads.

view this post on Zulip Timothy (Dec 05 2022 at 11:18):

Oh, you're talking about optimising it to produce less GC?

view this post on Zulip Sukera (Dec 05 2022 at 11:18):

yes

view this post on Zulip Sukera (Dec 05 2022 at 11:19):

GC is single threaded, so if any thread wants to allocate memory, it needs to take a lock and blocks all other threads from allocating, IIRC.

view this post on Zulip Sukera (Dec 05 2022 at 11:19):

on top of that, depending on how you're threading, you may accidentally share data/state between threads, which can lead to more boxing/allocations

view this post on Zulip Sukera (Dec 05 2022 at 11:20):

the easiest way to tackle both is to optimize the single threaded version

view this post on Zulip Timothy (Dec 05 2022 at 11:20):

That would be nice, but I have a feeling that would be quite difficult. The code is basically doing two major things:

view this post on Zulip Sukera (Dec 05 2022 at 11:21):

do you have an example?

view this post on Zulip Sukera (Dec 05 2022 at 11:22):

keep in mind that e.g. Threads.@threads more or less has to create an anonymous function under the hood, so anything you capture in there is shared across threads

view this post on Zulip Sukera (Dec 05 2022 at 11:22):

and that capturing may too lead to type instabilities, boxing etc.

view this post on Zulip Timothy (Dec 05 2022 at 11:23):

This is the rough structure of the code (which uses Transducers.jl)

(sample generator) |>
  Map(pairs -> zip(pairs...) .|> Iterators.flatten .|> collect) |>
  Map(function ((train, test),)
    tree = train(decision tree on train data)
    oob = predict(tree, test)
    importance = permutation_importance(tree, test)
    (; tree, oob, importance)
  end) |> tcollect

view this post on Zulip Timothy (Dec 05 2022 at 11:24):

The actual code is a bit more complicated, but if you're interested you can find it here: http://ix.io/4hNh

view this post on Zulip Sukera (Dec 05 2022 at 11:25):

how large is pairs?

view this post on Zulip Timothy (Dec 05 2022 at 11:26):

Usually a single-element vector (hence the if statement in the full version)

view this post on Zulip Sukera (Dec 05 2022 at 11:26):

your full version is type unstable - you only assign importance conditionally

view this post on Zulip Timothy (Dec 05 2022 at 11:27):

Yea, importance and oob are both Union{Nothing, X}s, but I'm assuming that type instability isn't having a big impact overall.

view this post on Zulip Sukera (Dec 05 2022 at 11:28):

unprofiled assumptions are bad for making a decision :)

view this post on Zulip Sukera (Dec 05 2022 at 11:28):

trainX also creates a copy, does it really need to?

view this post on Zulip Sukera (Dec 05 2022 at 11:28):

since slicing copies here trainX, trainY = X1[train, :], y1[train]

view this post on Zulip Timothy (Dec 05 2022 at 11:29):

It doesn't need to be a copy, but neither that or Matrix(testX) seem to have much of an impact overall

view this post on Zulip Sukera (Dec 05 2022 at 11:29):

because both copy

view this post on Zulip Sukera (Dec 05 2022 at 11:29):

why not a @view?

view this post on Zulip Timothy (Dec 05 2022 at 11:29):

I suppose I could prefix @views just for fun

view this post on Zulip Sukera (Dec 05 2022 at 11:30):

same general thought goes for zip(pairs...), do you really need to collect it, thereby allocating new memory?

view this post on Zulip Timothy (Dec 05 2022 at 11:30):

I'm not |> collecting, I'm .|> collecting, so it changes the structure

view this post on Zulip Sukera (Dec 05 2022 at 11:31):

same thought though, do you need to collect each flattened array?

view this post on Zulip Sukera (Dec 05 2022 at 11:33):

and I also think the broadcast (which allocates yet-another array) can be elided by using Iterators.map(Iterators.flatten, zip(pairs...))

view this post on Zulip Timothy (Dec 05 2022 at 11:33):

Yes, otherwise I get a method error

view this post on Zulip Timothy (Dec 05 2022 at 11:33):

Also, in this particular case, I know that branch is never hit

view this post on Zulip Sukera (Dec 05 2022 at 11:34):

I'm not saying doing just that change will fix things, I'm just pointing out where you get those allocations

view this post on Zulip Timothy (Dec 05 2022 at 11:34):

I appreciate it :)

view this post on Zulip Timothy (Dec 05 2022 at 11:34):

The map change is nice, I'll make that regardless of the impact

view this post on Zulip Sukera (Dec 05 2022 at 11:35):

tbh, transducers are supposed to compose, so I think you should be able to use the Transducers.jl native Map just as well

view this post on Zulip Sukera (Dec 05 2022 at 11:36):

regardless, every broadcast in there (like getproperty.(submachines, :test) later on) has to allocate a result array

view this post on Zulip Timothy (Dec 05 2022 at 11:36):

Timothy said:

Yes, otherwise I get a method error

I'm very happy to discover that I'm wrong here! and it does help a bit!

view this post on Zulip Sukera (Dec 05 2022 at 11:36):

the mapreduce over dictionaries is going to be expensive too, since it needs to allocate a new dictionary and rehash all entries

view this post on Zulip Timothy (Dec 05 2022 at 11:37):

Any recommendations there?

view this post on Zulip Sukera (Dec 05 2022 at 11:39):

put the getproperty thing into the map of the mapreduce

view this post on Zulip Sukera (Dec 05 2022 at 11:39):

you're already mapping

view this post on Zulip Sukera (Dec 05 2022 at 11:39):

so why have another loop allocating memory before that?

view this post on Zulip Sukera (Dec 05 2022 at 11:39):

(or get rid of the Dict entirely, but that is probably a bit more of an undertaking)

view this post on Zulip Timothy (Dec 05 2022 at 11:40):

Since the keys are the same, I suppose I could sort each importance list on the keys and then directly sum the vectors of values?

view this post on Zulip Sukera (Dec 05 2022 at 11:41):

that's also an option, yeah

view this post on Zulip Sukera (Dec 05 2022 at 11:41):

or you just iterate over keys(..) of one importance list and sum all entries of all lists

view this post on Zulip Timothy (Dec 05 2022 at 11:46):

Ok, with the .|> collection removal and Iterators.map the 60x thread increase now causes a 1.9x perf improvement

view this post on Zulip Sukera (Dec 05 2022 at 11:50):

so still 30x missing, possibly due to the other places with copy slicing instead of views

view this post on Zulip Sukera (Dec 05 2022 at 11:50):

and depending on how much the code you're calling allocates, of course

view this post on Zulip Timothy (Dec 05 2022 at 11:52):

I could share tree_permutation_importance too, if that's of interest?

view this post on Zulip Sukera (Dec 05 2022 at 11:53):

I'm a bit short on time for indepth reviews, sorry :/

view this post on Zulip Timothy (Dec 05 2022 at 11:54):

That's fine, the 1.4 -> 1.9x improvement is already very much appreciated :)

view this post on Zulip DrChainsaw (Dec 05 2022 at 17:08):

If allocations can't be avoided, running multi-processing (e.g. Distributed) can help since then each instance gets its own allocator and gc.

It has its own sets of drawbacks (sending data over sockets, multiple compilations etc.) so it is absolutely no silver bullet.

view this post on Zulip chriselrod (Dec 06 2022 at 23:20):

I use Distributed a lot for this reason.

view this post on Zulip Brenhin Keller (Dec 06 2022 at 23:44):

Or MPI.jl, if it makes sense

view this post on Zulip Timothy (Dec 07 2022 at 01:24):

Yea, the problem I have with distributed + a few packages + a large data sets is an explosion in memory required.

view this post on Zulip jar (Dec 07 2022 at 01:36):

My solution to that has been just buying a lot of ram, though sometimes it's not enough

view this post on Zulip Timothy (Dec 07 2022 at 01:38):

I have 128G and it's not enough :sweat_smile:

view this post on Zulip jar (Dec 07 2022 at 01:41):

there's always swap

view this post on Zulip Timothy (Dec 07 2022 at 01:44):

When I installed the OS I assumed I'd need no swap with so much memory :laughter_tears:

view this post on Zulip jar (Dec 29 2022 at 03:32):

image.png it's not always fast but it does go


Last updated: Oct 02 2023 at 04:34 UTC