Unwanted allocations with threads · helpdesk (published)

julia> using BenchmarkTools

julia> A = rand(100, 100);

julia> @btime update1!($A)
  159.424 μs (0 allocations: 0 bytes)

julia> @btime update2!($A)
  24.672 μs (42 allocations: 4.25 KiB)

julia> @allocated update1!(A)
0

julia> @allocated update2!(A)
4720

Is this expected behavior? Any way to avoid the allocations? I need to call this function some tens of millions of times in a hot loop.

Simeon Schaub (Jul 04 2025 at 17:40):

Yes, it's expected since new tasks need to be allocated. If you're calling this inside a hot loop though, why not thread the outer loop? That will almost certainly be faster

Júlio Hoffimann (Jul 04 2025 at 17:41):

The outer loop is very high up in the abstraction layers. This function lives in a package and the outer loop lives in another separate package, for instance.

Júlio Hoffimann (Jul 04 2025 at 17:43):

Júlio Hoffimann (Jul 04 2025 at 17:54):

Mosè Giordano (Jul 04 2025 at 18:32):

That suggests that the body of the loop isn't expensive enough to make spawning tasks worth it. Do more work inside each task

Júlio Hoffimann (Jul 04 2025 at 18:35):

My original idea was to gain performance in the inner function and in the outer function as a consequence. The issue is that the inner allocations decrease the outer performance. The inner function is much faster with threads.

wheeheee (Jul 05 2025 at 03:38):

function update3!(A, s, c)
    m, n = size(A)
    s .= sin.(1:m)
    c .= cos.(1:n)
    A .= s .* c'
end

julia> m1 = n1 = 100
100

julia> s = zeros(m1); c = zeros(n1);

julia> @btime update2!(A) setup = (A = rand(m1, n1));
  40.200 μs (42 allocations: 5.00 KiB)

julia> @btime update3!(A, $s, $c) setup = (A = rand(m1, n1));
  2.500 μs (0 allocations: 0 bytes)

julia> A = zeros(m1, n1); B = similar(A);

julia> update1!(A); update3!(B, s, c);

julia> A == B
true

Júlio Hoffimann (Jul 05 2025 at 08:59):

That is interesting. Broadcasting is faster than using multiple threads in this case? Why?

Mosè Giordano (Jul 05 2025 at 10:17):

Again, because of the overhead of spawning tasks in the first place. Actual numbers strongly depend on your system, but expect spawning a task to be of the order of ~microseconds: if your entire function takes 2 microseconds to run, even spawning a single task would kill performance, if you spawn multiple of them that'd completely dwarf the actual computing time.

Again, to make multithreading worth it you need to do substantial computations within each task, something which takes much more than ~microseconds

Mosè Giordano (Jul 05 2025 at 10:19):

Mosè Giordano (Jul 05 2025 at 10:24):

A very tall matrix (many more rows than columns, ideally with at least tens or hundreds of thousands of rows) may benefit from multithreading, but for a square matrix with just 100 rows you won't see any benefit

wheeheee (Jul 05 2025 at 10:26):

I don't see that broadcasting is faster? As you said, whatever is in the loop you actually do isn't as simple as a separable function like sin(i) * cos(j) in this example.

Anyway, I think you might be looking for Polyester.jl. It doesn't allocate. If that doesn't help you that much, I would take Mosè's advice...

Júlio Hoffimann (Jul 05 2025 at 10:27):

@Mosè Giordano I think I'm missing something here. My example with update1! and update2! shows that threads improve the performance. The code with broadcasting called update3!beats both.

Júlio Hoffimann (Jul 05 2025 at 10:29):

In other words, update2! shows that tasks have enough work. But update3! beats it.

Júlio Hoffimann (Jul 05 2025 at 10:30):

@wheeheee oh I see the issue now! The update3! assumes separability of the kernel.

Júlio Hoffimann (Jul 05 2025 at 10:30):

Júlio Hoffimann (Jul 05 2025 at 10:31):

I'll produce a complete MWE with the outer loop I mentioned before to see if you have improvement suggestions.

Júlio Hoffimann (Jul 05 2025 at 12:21):

The threaded performance is a function of the kernel function, the size n of the problem and the number of threads. On my machine, the threaded update2! only becomes advantageous with n>90 when running with 8 threads:

using BenchmarkTools
using StaticArrays
using Distances

function kernel(i, j)
  u = SVector(float(i), float(j))
  v = SVector(float(i + 1), float(j + 1))
  euclidean(u, v)
end

function update1!(A)
  m, n = size(A)
  for j in 1:n
    for i in 1:m
      @inbounds A[i, j] = kernel(i, j)
    end
  end
end

function update2!(A)
  m, n = size(A)
  Threads.@threads for j in 1:n
    for i in 1:m
      @inbounds A[i, j] = kernel(i, j)
    end
  end
end

@show Threads.nthreads()

results = map(1:10:150) do n
  A = rand(n, n)
  t1 = median(@benchmark update1!($A)).time
  t2 = median(@benchmark update2!($A)).time
  (n=n, t1=t1, t2=t2)
end

Threads.nthreads() = 8
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
 (n = 1, t1 = 6.286, t2 = 6960.0)
 (n = 11, t1 = 151.07542579075425, t2 = 7117.875)
 (n = 21, t1 = 518.8802083333334, t2 = 7134.5)
 (n = 31, t1 = 1122.6, t2 = 7110.375)
 (n = 41, t1 = 1968.7, t2 = 7036.5)
 (n = 51, t1 = 3033.875, t2 = 7280.571428571428)
 (n = 61, t1 = 4352.142857142857, t2 = 7336.5)
 (n = 71, t1 = 5883.666666666667, t2 = 7573.0)
 (n = 81, t1 = 7673.5, t2 = 7964.642857142857)
 (n = 91, t1 = 9691.0, t2 = 8256.166666666666)
 (n = 101, t1 = 11952.0, t2 = 8688.25)
 (n = 111, t1 = 14423.0, t2 = 9112.3)
 (n = 121, t1 = 17149.0, t2 = 9868.7)
 (n = 131, t1 = 20071.0, t2 = 10540.7)
 (n = 141, t1 = 23638.0, t2 = 11317.2)

I wonder if these numbers are similar in other machines. Appreciate if you could run the script in your machine for comparison :slight_smile:

Jeffrey Chandler (Jul 05 2025 at 18:17):

Threads.nthreads() = 8
(n = 1, t1 = 8.2, t2 = 4487.5)
(n = 11, t1 = 168.39378238341968, t2 = 4787.5)
(n = 21, t1 = 514.8717948717949, t2 = 4857.142857142857)
(n = 31, t1 = 1130.0, t2 = 4771.428571428572)
(n = 41, t1 = 1890.0, t2 = 5012.5)
(n = 51, t1 = 3088.8888888888887, t2 = 5128.571428571428)
(n = 61, t1 = 4257.142857142857, t2 = 5600.0)
(n = 71, t1 = 5833.333333333333, t2 = 5285.714285714285)
(n = 81, t1 = 7325.0, t2 = 5514.285714285715)
(n = 91, t1 = 9433.333333333334, t2 = 6016.666666666667)
(n = 101, t1 = 11500.0, t2 = 6400.0)
(n = 111, t1 = 13700.0, t2 = 7250.0)
(n = 121, t1 = 16200.0, t2 = 7920.0)
(n = 131, t1 = 19600.0, t2 = 8200.0)
(n = 141, t1 = 22300.0, t2 = 8925.0)

Jeffrey Chandler (Jul 05 2025 at 18:19):

From what I'm seeing, the startup for the threads (for me) is ~ 4500 because once the sequential run time gets notably longer than that, I see better times multithreading.

Júlio Hoffimann (Jul 05 2025 at 18:21):

Very interesting to see that in your machine n>70 is where things get better with threads.

Mosè Giordano (Jul 06 2025 at 00:50):

Threads.nthreads() = 36
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
 (n = 1, t1 = 3.648, t2 = 32769.0)
 (n = 11, t1 = 82.30466321243523, t2 = 31424.0)
 (n = 21, t1 = 298.67054263565893, t2 = 32289.0)
 (n = 31, t1 = 658.2981366459627, t2 = 32288.0)
 (n = 41, t1 = 1155.3, t2 = 31936.0)
 (n = 51, t1 = 1788.9, t2 = 33473.0)
 (n = 61, t1 = 2545.777777777778, t2 = 31873.0)
 (n = 71, t1 = 3456.125, t2 = 33377.0)
 (n = 81, t1 = 4503.0, t2 = 33409.0)
 (n = 91, t1 = 5690.666666666667, t2 = 33728.0)
 (n = 101, t1 = 7008.0, t2 = 34368.0)
 (n = 111, t1 = 8480.0, t2 = 34241.0)
 (n = 121, t1 = 10112.0, t2 = 34112.0)
 (n = 131, t1 = 11809.0, t2 = 34464.0)
 (n = 141, t1 = 13696.0, t2 = 33696.0)

Threads.nthreads() = 96
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
 (n = 1, t1 = 4.216, t2 = 2.9395215e6)
 (n = 11, t1 = 148.04912280701754, t2 = 3.0838e6)
 (n = 21, t1 = 500.24226804123714, t2 = 3.204477e6)
 (n = 31, t1 = 1059.6, t2 = 3.353132e6)
 (n = 41, t1 = 839.3170731707318, t2 = 3.332551e6)
 (n = 51, t1 = 1569.3, t2 = 3.386172e6)
 (n = 61, t1 = 2520.5555555555557, t2 = 3.5372155e6)
 (n = 71, t1 = 3675.625, t2 = 3.404369e6)
 (n = 81, t1 = 3164.75, t2 = 7.638604e6)
 (n = 91, t1 = 4449.571428571428, t2 = 3.314338e6)
 (n = 101, t1 = 5942.166666666667, t2 = 3.4618665e6)
 (n = 111, t1 = 7586.5, t2 = 3.367734e6)
 (n = 121, t1 = 6630.0, t2 = 3.3118545e6)
 (n = 131, t1 = 8499.333333333334, t2 = 3.512002e6)
 (n = 141, t1 = 10596.0, t2 = 3.284583e6)

Mosè Giordano (Jul 06 2025 at 00:59):

15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
 (n = 1, t1 = 69.50819672131148, t2 = 32800.0)
 (n = 11, t1 = 8437.333333333334, t2 = 32768.0)
 (n = 21, t1 = 30816.0, t2 = 33216.0)
 (n = 31, t1 = 67200.0, t2 = 32864.0)
 (n = 41, t1 = 117698.0, t2 = 35040.0)
 (n = 51, t1 = 181922.0, t2 = 35201.0)
 (n = 61, t1 = 262178.0, t2 = 36272.5)
 (n = 71, t1 = 355268.0, t2 = 40352.0)
 (n = 81, t1 = 459269.0, t2 = 42433.0)
 (n = 91, t1 = 579781.0, t2 = 42465.0)
 (n = 101, t1 = 720487.0, t2 = 46241.0)
 (n = 111, t1 = 862538.0, t2 = 52417.0)
 (n = 121, t1 = 1.021932e6, t2 = 54177.0)
 (n = 131, t1 = 1.19835e6, t2 = 55969.0)
 (n = 141, t1 = 1.389072e6, t2 = 64928.0)

wheeheee (Jul 06 2025 at 04:08):

Added a new update3! with the @batch macro from Polyester.jl replacing @threads to the benchmarks, and also fill!(A, sqrt(2)), just to see if the compiler was able to do magic, since barring fp error or overflow that is the result of the computation (it wasn't)

julia> results = map(1:10:150) do n
           A = rand(n, n)
           t1 = median(@benchmark update1!($A)).time
           t2 = median(@benchmark update2!($A)).time
           t3 = median(@benchmark update3!($A)).time
           tfill = median(@benchmark fill!($A, sqrt(2))).time
           (;n, t1, t2, t3, tfill)
       end
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64, t3::Float64, tfill::Float64}}:
 (n = 1, t1 = 5.8, t2 = 6100.0, t3 = 6.3, tfill = 2.7)
 (n = 11, t1 = 157.53768844221105, t2 = 5950.0, t3 = 379.04761904761904, tfill = 11.7)
 (n = 21, t1 = 361.21495327102804, t2 = 6100.0, t3 = 410.44776119402985, tfill = 18.236472945891784)
 (n = 31, t1 = 887.5, t2 = 6183.333333333333, t3 = 517.0103092783505, tfill = 36.052366565961734)
 (n = 41, t1 = 1290.0, t2 = 6333.333333333333, t3 = 658.2417582417582, tfill = 57.520325203252035)
 (n = 51, t1 = 2200.0, t2 = 7166.666666666667, t3 = 877.3972602739726, tfill = 87.26315789473684)
 (n = 61, t1 = 2800.0, t2 = 7416.666666666667, t3 = 1030.0, tfill = 121.38919514884233)
 (n = 71, t1 = 4071.4285714285716, t2 = 8400.0, t3 = 1420.0, tfill = 162.8140703517588)
 (n = 81, t1 = 4871.428571428572, t2 = 8900.0, t3 = 1640.0, tfill = 583.9779005524862)
 (n = 91, t1 = 6520.0, t2 = 9300.0, t3 = 2055.5555555555557, tfill = 798.9473684210526)
 (n = 101, t1 = 7525.0, t2 = 10125.0, t3 = 2366.6666666666665, tfill = 980.0)
 (n = 111, t1 = 9600.0, t2 = 10000.0, t3 = 2922.222222222222, tfill = 1180.0)
 (n = 121, t1 = 10800.0, t2 = 11100.0, t3 = 3262.5, tfill = 1390.0)
 (n = 131, t1 = 13200.0, t2 = 11800.0, t3 = 3912.5, tfill = 1630.0)
 (n = 141, t1 = 14600.0, t2 = 13000.0, t3 = 4357.142857142857, tfill = 1890.0)

julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b73 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, tigerlake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_CONDAPKG_BACKEND = Null
  JULIA_DEPOT_PATH = Q:\.julia;
  JULIA_NUM_THREADS = auto

wheeheee (Jul 06 2025 at 04:11):

So the version using Polyester threads starts beating the serial version somewhere between n=21 and n = 31, and always beats @threads...

Júlio Hoffimann (Jul 06 2025 at 05:10):

That is interesting! I'll benchmark on my machine to see how it performs. If it beats the serial version with n=21 that is a game changer for us.

Mason Protter (Jul 06 2025 at 07:30):

That probably has less to do with Polyester's threading performance, and more to do with the fact that it cheats by converting your arrays to StrideArrays.PtrArray out from under you

Júlio Hoffimann (Jul 06 2025 at 09:26):

wheeheee (Jul 06 2025 at 09:32):

After using Polyester, I merely replaced Threads.@threads in update2! with @batch

wheeheee (Jul 08 2025 at 04:36):

While I do agree that there's not much point multithreading at n = 21 or even n = 30, why should that be faster?

Júlio Hoffimann (Jul 09 2025 at 15:46):

@wheeheee I ran with Polyester.jl and reproduced the improved with n=21 and beyond:

15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
 (n = 1, t1 = 6.295, t2 = 7.3993993993994)
 (n = 11, t1 = 147.77980535279806, t2 = 410.6108490566038)
 (n = 21, t1 = 509.26943005181346, t2 = 445.1029411764706)
 (n = 31, t1 = 1122.7, t2 = 550.0837563451776)
 (n = 41, t1 = 1968.2, t2 = 687.1114130434783)
 (n = 51, t1 = 3033.8888888888887, t2 = 812.6372549019608)
 (n = 61, t1 = 4257.857142857143, t2 = 1034.9117647058824)
 (n = 71, t1 = 5764.5, t2 = 1303.4)
 (n = 81, t1 = 7510.75, t2 = 1564.7)
 (n = 91, t1 = 9694.0, t2 = 1832.7)
 (n = 101, t1 = 11949.0, t2 = 2218.1)
 (n = 111, t1 = 14416.0, t2 = 2631.0555555555557)
 (n = 121, t1 = 17125.0, t2 = 3016.222222222222)
 (n = 131, t1 = 20053.0, t2 = 3442.3333333333335)
 (n = 141, t1 = 23155.0, t2 = 4009.875)

Could you please elaborate on why there is no much point multi-threading with n=21?

Júlio Hoffimann (Jul 09 2025 at 15:50):

@Mosè Giordano in your output I am seeing 36 and 96 threads. Are these physical cores?

I wonder how the latest processors in the market compare with your setup. I have an old Intel i7 processor, and was assuming that most processors today still had around 8 physical cores. Perhaps I need a processor update?

Mosè Giordano (Jul 09 2025 at 15:53):

wheeheee (Jul 10 2025 at 01:26):

Because there’s not much of a speedup (10%) for using so many cores. Even though there is less overhead compared to @threads, most of the work is still done setting up the threading or whatnot, although tasks are reused. Might as well just do without.

wheeheee (Jul 10 2025 at 01:36):

You can also look at Mose’s numbers, I think the overhead increases with the number of cores / threads. With the Threadripper’s monstrous number of threads, even at 141 the serial version is still better. Basically this has all been mentioned before in the thread, if there is only a small amount of work multithreading costs more than it’s worth.

wheeheee (Jul 10 2025 at 01:37):

Also as mentioned earlier, use multithreading in the outer loop. Being further up the abstraction layer doesn’t necessarily prevent multithreading, as long as the computation is really parallelizable.

Stream: helpdesk (published)

Topic: Unwanted allocations with threads

Júlio Hoffimann (Jul 04 2025 at 17:33):

Simeon Schaub (Jul 04 2025 at 17:40):

Júlio Hoffimann (Jul 04 2025 at 17:41):

Júlio Hoffimann (Jul 04 2025 at 17:43):

Júlio Hoffimann (Jul 04 2025 at 17:54):

Mosè Giordano (Jul 04 2025 at 18:32):

Júlio Hoffimann (Jul 04 2025 at 18:35):

wheeheee (Jul 05 2025 at 03:38):

Júlio Hoffimann (Jul 05 2025 at 08:59):

Mosè Giordano (Jul 05 2025 at 10:17):

Mosè Giordano (Jul 05 2025 at 10:19):

Mosè Giordano (Jul 05 2025 at 10:24):

wheeheee (Jul 05 2025 at 10:26):

Júlio Hoffimann (Jul 05 2025 at 10:27):

Júlio Hoffimann (Jul 05 2025 at 10:29):

Júlio Hoffimann (Jul 05 2025 at 10:30):

Júlio Hoffimann (Jul 05 2025 at 10:30):

Júlio Hoffimann (Jul 05 2025 at 10:31):

Júlio Hoffimann (Jul 05 2025 at 12:21):

Jeffrey Chandler (Jul 05 2025 at 18:17):

Jeffrey Chandler (Jul 05 2025 at 18:19):

Júlio Hoffimann (Jul 05 2025 at 18:21):

Mosè Giordano (Jul 06 2025 at 00:50):

Mosè Giordano (Jul 06 2025 at 00:59):

wheeheee (Jul 06 2025 at 04:08):

wheeheee (Jul 06 2025 at 04:11):

Júlio Hoffimann (Jul 06 2025 at 05:10):

Mason Protter (Jul 06 2025 at 07:30):

Júlio Hoffimann (Jul 06 2025 at 09:26):

wheeheee (Jul 06 2025 at 09:32):

wheeheee (Jul 08 2025 at 04:36):

Júlio Hoffimann (Jul 09 2025 at 15:46):

Júlio Hoffimann (Jul 09 2025 at 15:50):

Mosè Giordano (Jul 09 2025 at 15:53):

wheeheee (Jul 10 2025 at 01:26):

wheeheee (Jul 10 2025 at 01:36):

wheeheee (Jul 10 2025 at 01:37):