Stream: helpdesk (published)

Topic: Unwanted allocations with threads


view this post on Zulip Júlio Hoffimann (Jul 04 2025 at 17:33):

The two functions below differ by the Threads.@threads annotation:

function update1!(A)
  m, n = size(A)
  for j in 1:n
    for i in 1:m
      @inbounds A[i, j] = sin(i) * cos(j)
    end
  end
end

function update2!(A)
  m, n = size(A)
  Threads.@threads for j in 1:n
    for i in 1:m
      @inbounds A[i, j] = sin(i) * cos(j)
    end
  end
end

I get the following results with 8 threads:

julia> using BenchmarkTools

julia> A = rand(100, 100);

julia> @btime update1!($A)
  159.424 μs (0 allocations: 0 bytes)

julia> @btime update2!($A)
  24.672 μs (42 allocations: 4.25 KiB)

julia> @allocated update1!(A)
0

julia> @allocated update2!(A)
4720

Is this expected behavior? Any way to avoid the allocations? I need to call this function some tens of millions of times in a hot loop.

view this post on Zulip Simeon Schaub (Jul 04 2025 at 17:40):

Yes, it's expected since new tasks need to be allocated. If you're calling this inside a hot loop though, why not thread the outer loop? That will almost certainly be faster

view this post on Zulip Júlio Hoffimann (Jul 04 2025 at 17:41):

The outer loop is very high up in the abstraction layers. This function lives in a package and the outer loop lives in another separate package, for instance.

view this post on Zulip Júlio Hoffimann (Jul 04 2025 at 17:43):

I will benchmark the other package to see the impact.

view this post on Zulip Júlio Hoffimann (Jul 04 2025 at 17:54):

Unfortunately the performance decreases using threads due to the allocations.

view this post on Zulip Mosè Giordano (Jul 04 2025 at 18:32):

Júlio Hoffimann said:

Unfortunately the performance decreases using threads due to the allocations.

That suggests that the body of the loop isn't expensive enough to make spawning tasks worth it. Do more work inside each task

view this post on Zulip Júlio Hoffimann (Jul 04 2025 at 18:35):

My original idea was to gain performance in the inner function and in the outer function as a consequence. The issue is that the inner allocations decrease the outer performance. The inner function is much faster with threads.

view this post on Zulip wheeheee (Jul 05 2025 at 03:38):

If this is the actual function and not just an example, you could try

function update3!(A, s, c)
    m, n = size(A)
    s .= sin.(1:m)
    c .= cos.(1:n)
    A .= s .* c'
end
julia> m1 = n1 = 100
100

julia> s = zeros(m1); c = zeros(n1);

julia> @btime update2!(A) setup = (A = rand(m1, n1));
  40.200 μs (42 allocations: 5.00 KiB)

julia> @btime update3!(A, $s, $c) setup = (A = rand(m1, n1));
  2.500 μs (0 allocations: 0 bytes)

julia> A = zeros(m1, n1); B = similar(A);

julia> update1!(A); update3!(B, s, c);

julia> A == B
true

view this post on Zulip Júlio Hoffimann (Jul 05 2025 at 08:59):

That is interesting. Broadcasting is faster than using multiple threads in this case? Why?

Unfortunately my use case is not that simple.

view this post on Zulip Mosè Giordano (Jul 05 2025 at 10:17):

Júlio Hoffimann said:

That is interesting. Broadcasting is faster than using multiple threads in this case? Why?

Again, because of the overhead of spawning tasks in the first place. Actual numbers strongly depend on your system, but expect spawning a task to be of the order of ~microseconds: if your entire function takes 2 microseconds to run, even spawning a single task would kill performance, if you spawn multiple of them that'd completely dwarf the actual computing time.

Again, to make multithreading worth it you need to do substantial computations within each task, something which takes much more than ~microseconds

view this post on Zulip Mosè Giordano (Jul 05 2025 at 10:19):

See the section "Multi-threading: is it always worth it?" of https://github.com/JuliaParallel/julia-hpc-tutorial-sc24/blob/9d2c54a52cc38f7f0159be79591232575333b7c9/parts/multithreading/multithreading.ipynb

view this post on Zulip Mosè Giordano (Jul 05 2025 at 10:24):

A very tall matrix (many more rows than columns, ideally with at least tens or hundreds of thousands of rows) may benefit from multithreading, but for a square matrix with just 100 rows you won't see any benefit

view this post on Zulip wheeheee (Jul 05 2025 at 10:26):

Júlio Hoffimann said:

That is interesting. Broadcasting is faster than using multiple threads in this case? Why?

Unfortunately my use case is not that simple.

I don't see that broadcasting is faster? As you said, whatever is in the loop you actually do isn't as simple as a separable function like sin(i) * cos(j) in this example.

Anyway, I think you might be looking for Polyester.jl. It doesn't allocate. If that doesn't help you that much, I would take Mosè's advice...

view this post on Zulip Júlio Hoffimann (Jul 05 2025 at 10:27):

@Mosè Giordano I think I'm missing something here. My example with update1! and update2! shows that threads improve the performance. The code with broadcasting called update3!beats both.

view this post on Zulip Júlio Hoffimann (Jul 05 2025 at 10:29):

In other words, update2! shows that tasks have enough work. But update3! beats it.

view this post on Zulip Júlio Hoffimann (Jul 05 2025 at 10:30):

@wheeheee oh I see the issue now! The update3! assumes separability of the kernel.

view this post on Zulip Júlio Hoffimann (Jul 05 2025 at 10:30):

It is doing less computation.

view this post on Zulip Júlio Hoffimann (Jul 05 2025 at 10:31):

I'll produce a complete MWE with the outer loop I mentioned before to see if you have improvement suggestions.

view this post on Zulip Júlio Hoffimann (Jul 05 2025 at 12:21):

The threaded performance is a function of the kernel function, the size n of the problem and the number of threads. On my machine, the threaded update2! only becomes advantageous with n>90 when running with 8 threads:

using BenchmarkTools
using StaticArrays
using Distances

function kernel(i, j)
  u = SVector(float(i), float(j))
  v = SVector(float(i + 1), float(j + 1))
  euclidean(u, v)
end

function update1!(A)
  m, n = size(A)
  for j in 1:n
    for i in 1:m
      @inbounds A[i, j] = kernel(i, j)
    end
  end
end

function update2!(A)
  m, n = size(A)
  Threads.@threads for j in 1:n
    for i in 1:m
      @inbounds A[i, j] = kernel(i, j)
    end
  end
end

@show Threads.nthreads()

results = map(1:10:150) do n
  A = rand(n, n)
  t1 = median(@benchmark update1!($A)).time
  t2 = median(@benchmark update2!($A)).time
  (n=n, t1=t1, t2=t2)
end
Threads.nthreads() = 8
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
 (n = 1, t1 = 6.286, t2 = 6960.0)
 (n = 11, t1 = 151.07542579075425, t2 = 7117.875)
 (n = 21, t1 = 518.8802083333334, t2 = 7134.5)
 (n = 31, t1 = 1122.6, t2 = 7110.375)
 (n = 41, t1 = 1968.7, t2 = 7036.5)
 (n = 51, t1 = 3033.875, t2 = 7280.571428571428)
 (n = 61, t1 = 4352.142857142857, t2 = 7336.5)
 (n = 71, t1 = 5883.666666666667, t2 = 7573.0)
 (n = 81, t1 = 7673.5, t2 = 7964.642857142857)
 (n = 91, t1 = 9691.0, t2 = 8256.166666666666)
 (n = 101, t1 = 11952.0, t2 = 8688.25)
 (n = 111, t1 = 14423.0, t2 = 9112.3)
 (n = 121, t1 = 17149.0, t2 = 9868.7)
 (n = 131, t1 = 20071.0, t2 = 10540.7)
 (n = 141, t1 = 23638.0, t2 = 11317.2)

I wonder if these numbers are similar in other machines. Appreciate if you could run the script in your machine for comparison :slight_smile:

view this post on Zulip Jeffrey Chandler (Jul 05 2025 at 18:17):

Run on a Ryzen 7 5800X

Threads.nthreads() = 8
(n = 1, t1 = 8.2, t2 = 4487.5)
(n = 11, t1 = 168.39378238341968, t2 = 4787.5)
(n = 21, t1 = 514.8717948717949, t2 = 4857.142857142857)
(n = 31, t1 = 1130.0, t2 = 4771.428571428572)
(n = 41, t1 = 1890.0, t2 = 5012.5)
(n = 51, t1 = 3088.8888888888887, t2 = 5128.571428571428)
(n = 61, t1 = 4257.142857142857, t2 = 5600.0)
(n = 71, t1 = 5833.333333333333, t2 = 5285.714285714285)
(n = 81, t1 = 7325.0, t2 = 5514.285714285715)
(n = 91, t1 = 9433.333333333334, t2 = 6016.666666666667)
(n = 101, t1 = 11500.0, t2 = 6400.0)
(n = 111, t1 = 13700.0, t2 = 7250.0)
(n = 121, t1 = 16200.0, t2 = 7920.0)
(n = 131, t1 = 19600.0, t2 = 8200.0)
(n = 141, t1 = 22300.0, t2 = 8925.0)

view this post on Zulip Jeffrey Chandler (Jul 05 2025 at 18:19):

From what I'm seeing, the startup for the threads (for me) is ~ 4500 because once the sequential run time gets notably longer than that, I see better times multithreading.

view this post on Zulip Júlio Hoffimann (Jul 05 2025 at 18:21):

Very interesting to see that in your machine n>70 is where things get better with threads.

view this post on Zulip Mosè Giordano (Jul 06 2025 at 00:50):

Grace:

Threads.nthreads() = 36
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
 (n = 1, t1 = 3.648, t2 = 32769.0)
 (n = 11, t1 = 82.30466321243523, t2 = 31424.0)
 (n = 21, t1 = 298.67054263565893, t2 = 32289.0)
 (n = 31, t1 = 658.2981366459627, t2 = 32288.0)
 (n = 41, t1 = 1155.3, t2 = 31936.0)
 (n = 51, t1 = 1788.9, t2 = 33473.0)
 (n = 61, t1 = 2545.777777777778, t2 = 31873.0)
 (n = 71, t1 = 3456.125, t2 = 33377.0)
 (n = 81, t1 = 4503.0, t2 = 33409.0)
 (n = 91, t1 = 5690.666666666667, t2 = 33728.0)
 (n = 101, t1 = 7008.0, t2 = 34368.0)
 (n = 111, t1 = 8480.0, t2 = 34241.0)
 (n = 121, t1 = 10112.0, t2 = 34112.0)
 (n = 131, t1 = 11809.0, t2 = 34464.0)
 (n = 141, t1 = 13696.0, t2 = 33696.0)

AMD Ryzen Threadripper PRO 7995WX 96-Cores:

Threads.nthreads() = 96
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
 (n = 1, t1 = 4.216, t2 = 2.9395215e6)
 (n = 11, t1 = 148.04912280701754, t2 = 3.0838e6)
 (n = 21, t1 = 500.24226804123714, t2 = 3.204477e6)
 (n = 31, t1 = 1059.6, t2 = 3.353132e6)
 (n = 41, t1 = 839.3170731707318, t2 = 3.332551e6)
 (n = 51, t1 = 1569.3, t2 = 3.386172e6)
 (n = 61, t1 = 2520.5555555555557, t2 = 3.5372155e6)
 (n = 71, t1 = 3675.625, t2 = 3.404369e6)
 (n = 81, t1 = 3164.75, t2 = 7.638604e6)
 (n = 91, t1 = 4449.571428571428, t2 = 3.314338e6)
 (n = 101, t1 = 5942.166666666667, t2 = 3.4618665e6)
 (n = 111, t1 = 7586.5, t2 = 3.367734e6)
 (n = 121, t1 = 6630.0, t2 = 3.3118545e6)
 (n = 131, t1 = 8499.333333333334, t2 = 3.512002e6)
 (n = 141, t1 = 10596.0, t2 = 3.284583e6)

view this post on Zulip Mosè Giordano (Jul 06 2025 at 00:59):

with a tall matrix (A = rand(100 * n, n)) on Grace I get

15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
 (n = 1, t1 = 69.50819672131148, t2 = 32800.0)
 (n = 11, t1 = 8437.333333333334, t2 = 32768.0)
 (n = 21, t1 = 30816.0, t2 = 33216.0)
 (n = 31, t1 = 67200.0, t2 = 32864.0)
 (n = 41, t1 = 117698.0, t2 = 35040.0)
 (n = 51, t1 = 181922.0, t2 = 35201.0)
 (n = 61, t1 = 262178.0, t2 = 36272.5)
 (n = 71, t1 = 355268.0, t2 = 40352.0)
 (n = 81, t1 = 459269.0, t2 = 42433.0)
 (n = 91, t1 = 579781.0, t2 = 42465.0)
 (n = 101, t1 = 720487.0, t2 = 46241.0)
 (n = 111, t1 = 862538.0, t2 = 52417.0)
 (n = 121, t1 = 1.021932e6, t2 = 54177.0)
 (n = 131, t1 = 1.19835e6, t2 = 55969.0)
 (n = 141, t1 = 1.389072e6, t2 = 64928.0)

view this post on Zulip wheeheee (Jul 06 2025 at 04:08):

Added a new update3! with the @batch macro from Polyester.jl replacing @threads to the benchmarks, and also fill!(A, sqrt(2)), just to see if the compiler was able to do magic, since barring fp error or overflow that is the result of the computation (it wasn't)

julia> results = map(1:10:150) do n
           A = rand(n, n)
           t1 = median(@benchmark update1!($A)).time
           t2 = median(@benchmark update2!($A)).time
           t3 = median(@benchmark update3!($A)).time
           tfill = median(@benchmark fill!($A, sqrt(2))).time
           (;n, t1, t2, t3, tfill)
       end
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64, t3::Float64, tfill::Float64}}:
 (n = 1, t1 = 5.8, t2 = 6100.0, t3 = 6.3, tfill = 2.7)
 (n = 11, t1 = 157.53768844221105, t2 = 5950.0, t3 = 379.04761904761904, tfill = 11.7)
 (n = 21, t1 = 361.21495327102804, t2 = 6100.0, t3 = 410.44776119402985, tfill = 18.236472945891784)
 (n = 31, t1 = 887.5, t2 = 6183.333333333333, t3 = 517.0103092783505, tfill = 36.052366565961734)
 (n = 41, t1 = 1290.0, t2 = 6333.333333333333, t3 = 658.2417582417582, tfill = 57.520325203252035)
 (n = 51, t1 = 2200.0, t2 = 7166.666666666667, t3 = 877.3972602739726, tfill = 87.26315789473684)
 (n = 61, t1 = 2800.0, t2 = 7416.666666666667, t3 = 1030.0, tfill = 121.38919514884233)
 (n = 71, t1 = 4071.4285714285716, t2 = 8400.0, t3 = 1420.0, tfill = 162.8140703517588)
 (n = 81, t1 = 4871.428571428572, t2 = 8900.0, t3 = 1640.0, tfill = 583.9779005524862)
 (n = 91, t1 = 6520.0, t2 = 9300.0, t3 = 2055.5555555555557, tfill = 798.9473684210526)
 (n = 101, t1 = 7525.0, t2 = 10125.0, t3 = 2366.6666666666665, tfill = 980.0)
 (n = 111, t1 = 9600.0, t2 = 10000.0, t3 = 2922.222222222222, tfill = 1180.0)
 (n = 121, t1 = 10800.0, t2 = 11100.0, t3 = 3262.5, tfill = 1390.0)
 (n = 131, t1 = 13200.0, t2 = 11800.0, t3 = 3912.5, tfill = 1630.0)
 (n = 141, t1 = 14600.0, t2 = 13000.0, t3 = 4357.142857142857, tfill = 1890.0)

julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b73 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, tigerlake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_CONDAPKG_BACKEND = Null
  JULIA_DEPOT_PATH = Q:\.julia;
  JULIA_NUM_THREADS = auto

view this post on Zulip wheeheee (Jul 06 2025 at 04:11):

So the version using Polyester threads starts beating the serial version somewhere between n=21 and n = 31, and always beats @threads...

view this post on Zulip Júlio Hoffimann (Jul 06 2025 at 05:10):

That is interesting! I'll benchmark on my machine to see how it performs. If it beats the serial version with n=21 that is a game changer for us.

view this post on Zulip Mason Protter (Jul 06 2025 at 07:30):

That probably has less to do with Polyester's threading performance, and more to do with the fact that it cheats by converting your arrays to StrideArrays.PtrArray out from under you

view this post on Zulip Júlio Hoffimann (Jul 06 2025 at 09:26):

@wheeheee could you please share the version written with Polyester.jl?

view this post on Zulip wheeheee (Jul 06 2025 at 09:32):

After using Polyester, I merely replaced Threads.@threads in update2! with @batch

view this post on Zulip wheeheee (Jul 08 2025 at 04:36):

Mason Protter said:

That probably has less to do with Polyester's threading performance, and more to do with the fact that it cheats by converting your arrays to StrideArrays.PtrArray out from under you

While I do agree that there's not much point multithreading at n = 21 or even n = 30, why should that be faster?

view this post on Zulip Júlio Hoffimann (Jul 09 2025 at 15:46):

@wheeheee I ran with Polyester.jl and reproduced the improved with n=21 and beyond:

15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
 (n = 1, t1 = 6.295, t2 = 7.3993993993994)
 (n = 11, t1 = 147.77980535279806, t2 = 410.6108490566038)
 (n = 21, t1 = 509.26943005181346, t2 = 445.1029411764706)
 (n = 31, t1 = 1122.7, t2 = 550.0837563451776)
 (n = 41, t1 = 1968.2, t2 = 687.1114130434783)
 (n = 51, t1 = 3033.8888888888887, t2 = 812.6372549019608)
 (n = 61, t1 = 4257.857142857143, t2 = 1034.9117647058824)
 (n = 71, t1 = 5764.5, t2 = 1303.4)
 (n = 81, t1 = 7510.75, t2 = 1564.7)
 (n = 91, t1 = 9694.0, t2 = 1832.7)
 (n = 101, t1 = 11949.0, t2 = 2218.1)
 (n = 111, t1 = 14416.0, t2 = 2631.0555555555557)
 (n = 121, t1 = 17125.0, t2 = 3016.222222222222)
 (n = 131, t1 = 20053.0, t2 = 3442.3333333333335)
 (n = 141, t1 = 23155.0, t2 = 4009.875)

Could you please elaborate on why there is no much point multi-threading with n=21?

view this post on Zulip Júlio Hoffimann (Jul 09 2025 at 15:50):

@Mosè Giordano in your output I am seeing 36 and 96 threads. Are these physical cores?

I wonder how the latest processors in the market compare with your setup. I have an old Intel i7 processor, and was assuming that most processors today still had around 8 physical cores. Perhaps I need a processor update?

view this post on Zulip Mosè Giordano (Jul 09 2025 at 15:53):

Júlio Hoffimann said:

Mosè Giordano in your output I am seeing 36 and 96 threads. Are these physical cores?

yes, but Grace has 72 physical cores, I was using half of the system

Júlio Hoffimann said:

I wonder how the latest processors in the market compare with your setup. I have an old Intel i7 processor, and was assuming that most processors today still had around 8 physical cores. Perhaps I need a processor update?

these are datacenter grade CPUs, not for laptops

view this post on Zulip wheeheee (Jul 10 2025 at 01:26):

Júlio Hoffimann said:

Could you please elaborate on why there is no much point multi-threading with n=21?

Because there’s not much of a speedup (10%) for using so many cores. Even though there is less overhead compared to @threads, most of the work is still done setting up the threading or whatnot, although tasks are reused. Might as well just do without.

view this post on Zulip wheeheee (Jul 10 2025 at 01:36):

You can also look at Mose’s numbers, I think the overhead increases with the number of cores / threads. With the Threadripper’s monstrous number of threads, even at 141 the serial version is still better. Basically this has all been mentioned before in the thread, if there is only a small amount of work multithreading costs more than it’s worth.

view this post on Zulip wheeheee (Jul 10 2025 at 01:37):

Also as mentioned earlier, use multithreading in the outer loop. Being further up the abstraction layer doesn’t necessarily prevent multithreading, as long as the computation is really parallelizable.


Last updated: Jul 22 2025 at 04:56 UTC