The two functions below differ by the Threads.@threads
annotation:
function update1!(A)
m, n = size(A)
for j in 1:n
for i in 1:m
@inbounds A[i, j] = sin(i) * cos(j)
end
end
end
function update2!(A)
m, n = size(A)
Threads.@threads for j in 1:n
for i in 1:m
@inbounds A[i, j] = sin(i) * cos(j)
end
end
end
I get the following results with 8 threads:
julia> using BenchmarkTools
julia> A = rand(100, 100);
julia> @btime update1!($A)
159.424 μs (0 allocations: 0 bytes)
julia> @btime update2!($A)
24.672 μs (42 allocations: 4.25 KiB)
julia> @allocated update1!(A)
0
julia> @allocated update2!(A)
4720
Is this expected behavior? Any way to avoid the allocations? I need to call this function some tens of millions of times in a hot loop.
Yes, it's expected since new tasks need to be allocated. If you're calling this inside a hot loop though, why not thread the outer loop? That will almost certainly be faster
The outer loop is very high up in the abstraction layers. This function lives in a package and the outer loop lives in another separate package, for instance.
I will benchmark the other package to see the impact.
Unfortunately the performance decreases using threads due to the allocations.
Júlio Hoffimann said:
Unfortunately the performance decreases using threads due to the allocations.
That suggests that the body of the loop isn't expensive enough to make spawning tasks worth it. Do more work inside each task
My original idea was to gain performance in the inner function and in the outer function as a consequence. The issue is that the inner allocations decrease the outer performance. The inner function is much faster with threads.
If this is the actual function and not just an example, you could try
function update3!(A, s, c)
m, n = size(A)
s .= sin.(1:m)
c .= cos.(1:n)
A .= s .* c'
end
julia> m1 = n1 = 100
100
julia> s = zeros(m1); c = zeros(n1);
julia> @btime update2!(A) setup = (A = rand(m1, n1));
40.200 μs (42 allocations: 5.00 KiB)
julia> @btime update3!(A, $s, $c) setup = (A = rand(m1, n1));
2.500 μs (0 allocations: 0 bytes)
julia> A = zeros(m1, n1); B = similar(A);
julia> update1!(A); update3!(B, s, c);
julia> A == B
true
That is interesting. Broadcasting is faster than using multiple threads in this case? Why?
Unfortunately my use case is not that simple.
Júlio Hoffimann said:
That is interesting. Broadcasting is faster than using multiple threads in this case? Why?
Again, because of the overhead of spawning tasks in the first place. Actual numbers strongly depend on your system, but expect spawning a task to be of the order of ~microseconds: if your entire function takes 2 microseconds to run, even spawning a single task would kill performance, if you spawn multiple of them that'd completely dwarf the actual computing time.
Again, to make multithreading worth it you need to do substantial computations within each task, something which takes much more than ~microseconds
See the section "Multi-threading: is it always worth it?" of https://github.com/JuliaParallel/julia-hpc-tutorial-sc24/blob/9d2c54a52cc38f7f0159be79591232575333b7c9/parts/multithreading/multithreading.ipynb
A very tall matrix (many more rows than columns, ideally with at least tens or hundreds of thousands of rows) may benefit from multithreading, but for a square matrix with just 100 rows you won't see any benefit
Júlio Hoffimann said:
That is interesting. Broadcasting is faster than using multiple threads in this case? Why?
Unfortunately my use case is not that simple.
I don't see that broadcasting is faster? As you said, whatever is in the loop you actually do isn't as simple as a separable function like sin(i) * cos(j)
in this example.
Anyway, I think you might be looking for Polyester.jl. It doesn't allocate. If that doesn't help you that much, I would take Mosè's advice...
@Mosè Giordano I think I'm missing something here. My example with update1!
and update2!
shows that threads improve the performance. The code with broadcasting called update3!
beats both.
In other words, update2!
shows that tasks have enough work. But update3!
beats it.
@wheeheee oh I see the issue now! The update3!
assumes separability of the kernel.
It is doing less computation.
I'll produce a complete MWE with the outer loop I mentioned before to see if you have improvement suggestions.
The threaded performance is a function of the kernel
function, the size n
of the problem and the number of threads. On my machine, the threaded update2!
only becomes advantageous with n>90
when running with 8
threads:
using BenchmarkTools
using StaticArrays
using Distances
function kernel(i, j)
u = SVector(float(i), float(j))
v = SVector(float(i + 1), float(j + 1))
euclidean(u, v)
end
function update1!(A)
m, n = size(A)
for j in 1:n
for i in 1:m
@inbounds A[i, j] = kernel(i, j)
end
end
end
function update2!(A)
m, n = size(A)
Threads.@threads for j in 1:n
for i in 1:m
@inbounds A[i, j] = kernel(i, j)
end
end
end
@show Threads.nthreads()
results = map(1:10:150) do n
A = rand(n, n)
t1 = median(@benchmark update1!($A)).time
t2 = median(@benchmark update2!($A)).time
(n=n, t1=t1, t2=t2)
end
Threads.nthreads() = 8
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
(n = 1, t1 = 6.286, t2 = 6960.0)
(n = 11, t1 = 151.07542579075425, t2 = 7117.875)
(n = 21, t1 = 518.8802083333334, t2 = 7134.5)
(n = 31, t1 = 1122.6, t2 = 7110.375)
(n = 41, t1 = 1968.7, t2 = 7036.5)
(n = 51, t1 = 3033.875, t2 = 7280.571428571428)
(n = 61, t1 = 4352.142857142857, t2 = 7336.5)
(n = 71, t1 = 5883.666666666667, t2 = 7573.0)
(n = 81, t1 = 7673.5, t2 = 7964.642857142857)
(n = 91, t1 = 9691.0, t2 = 8256.166666666666)
(n = 101, t1 = 11952.0, t2 = 8688.25)
(n = 111, t1 = 14423.0, t2 = 9112.3)
(n = 121, t1 = 17149.0, t2 = 9868.7)
(n = 131, t1 = 20071.0, t2 = 10540.7)
(n = 141, t1 = 23638.0, t2 = 11317.2)
I wonder if these numbers are similar in other machines. Appreciate if you could run the script in your machine for comparison :slight_smile:
Run on a Ryzen 7 5800X
Threads.nthreads() = 8
(n = 1, t1 = 8.2, t2 = 4487.5)
(n = 11, t1 = 168.39378238341968, t2 = 4787.5)
(n = 21, t1 = 514.8717948717949, t2 = 4857.142857142857)
(n = 31, t1 = 1130.0, t2 = 4771.428571428572)
(n = 41, t1 = 1890.0, t2 = 5012.5)
(n = 51, t1 = 3088.8888888888887, t2 = 5128.571428571428)
(n = 61, t1 = 4257.142857142857, t2 = 5600.0)
(n = 71, t1 = 5833.333333333333, t2 = 5285.714285714285)
(n = 81, t1 = 7325.0, t2 = 5514.285714285715)
(n = 91, t1 = 9433.333333333334, t2 = 6016.666666666667)
(n = 101, t1 = 11500.0, t2 = 6400.0)
(n = 111, t1 = 13700.0, t2 = 7250.0)
(n = 121, t1 = 16200.0, t2 = 7920.0)
(n = 131, t1 = 19600.0, t2 = 8200.0)
(n = 141, t1 = 22300.0, t2 = 8925.0)
From what I'm seeing, the startup for the threads (for me) is ~ 4500 because once the sequential run time gets notably longer than that, I see better times multithreading.
Very interesting to see that in your machine n>70
is where things get better with threads.
Grace:
Threads.nthreads() = 36
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
(n = 1, t1 = 3.648, t2 = 32769.0)
(n = 11, t1 = 82.30466321243523, t2 = 31424.0)
(n = 21, t1 = 298.67054263565893, t2 = 32289.0)
(n = 31, t1 = 658.2981366459627, t2 = 32288.0)
(n = 41, t1 = 1155.3, t2 = 31936.0)
(n = 51, t1 = 1788.9, t2 = 33473.0)
(n = 61, t1 = 2545.777777777778, t2 = 31873.0)
(n = 71, t1 = 3456.125, t2 = 33377.0)
(n = 81, t1 = 4503.0, t2 = 33409.0)
(n = 91, t1 = 5690.666666666667, t2 = 33728.0)
(n = 101, t1 = 7008.0, t2 = 34368.0)
(n = 111, t1 = 8480.0, t2 = 34241.0)
(n = 121, t1 = 10112.0, t2 = 34112.0)
(n = 131, t1 = 11809.0, t2 = 34464.0)
(n = 141, t1 = 13696.0, t2 = 33696.0)
AMD Ryzen Threadripper PRO 7995WX 96-Cores:
Threads.nthreads() = 96
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
(n = 1, t1 = 4.216, t2 = 2.9395215e6)
(n = 11, t1 = 148.04912280701754, t2 = 3.0838e6)
(n = 21, t1 = 500.24226804123714, t2 = 3.204477e6)
(n = 31, t1 = 1059.6, t2 = 3.353132e6)
(n = 41, t1 = 839.3170731707318, t2 = 3.332551e6)
(n = 51, t1 = 1569.3, t2 = 3.386172e6)
(n = 61, t1 = 2520.5555555555557, t2 = 3.5372155e6)
(n = 71, t1 = 3675.625, t2 = 3.404369e6)
(n = 81, t1 = 3164.75, t2 = 7.638604e6)
(n = 91, t1 = 4449.571428571428, t2 = 3.314338e6)
(n = 101, t1 = 5942.166666666667, t2 = 3.4618665e6)
(n = 111, t1 = 7586.5, t2 = 3.367734e6)
(n = 121, t1 = 6630.0, t2 = 3.3118545e6)
(n = 131, t1 = 8499.333333333334, t2 = 3.512002e6)
(n = 141, t1 = 10596.0, t2 = 3.284583e6)
with a tall matrix (A = rand(100 * n, n)
) on Grace I get
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
(n = 1, t1 = 69.50819672131148, t2 = 32800.0)
(n = 11, t1 = 8437.333333333334, t2 = 32768.0)
(n = 21, t1 = 30816.0, t2 = 33216.0)
(n = 31, t1 = 67200.0, t2 = 32864.0)
(n = 41, t1 = 117698.0, t2 = 35040.0)
(n = 51, t1 = 181922.0, t2 = 35201.0)
(n = 61, t1 = 262178.0, t2 = 36272.5)
(n = 71, t1 = 355268.0, t2 = 40352.0)
(n = 81, t1 = 459269.0, t2 = 42433.0)
(n = 91, t1 = 579781.0, t2 = 42465.0)
(n = 101, t1 = 720487.0, t2 = 46241.0)
(n = 111, t1 = 862538.0, t2 = 52417.0)
(n = 121, t1 = 1.021932e6, t2 = 54177.0)
(n = 131, t1 = 1.19835e6, t2 = 55969.0)
(n = 141, t1 = 1.389072e6, t2 = 64928.0)
Added a new update3!
with the @batch
macro from Polyester.jl replacing @threads
to the benchmarks, and also fill!(A, sqrt(2))
, just to see if the compiler was able to do magic, since barring fp error or overflow that is the result of the computation (it wasn't)
julia> results = map(1:10:150) do n
A = rand(n, n)
t1 = median(@benchmark update1!($A)).time
t2 = median(@benchmark update2!($A)).time
t3 = median(@benchmark update3!($A)).time
tfill = median(@benchmark fill!($A, sqrt(2))).time
(;n, t1, t2, t3, tfill)
end
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64, t3::Float64, tfill::Float64}}:
(n = 1, t1 = 5.8, t2 = 6100.0, t3 = 6.3, tfill = 2.7)
(n = 11, t1 = 157.53768844221105, t2 = 5950.0, t3 = 379.04761904761904, tfill = 11.7)
(n = 21, t1 = 361.21495327102804, t2 = 6100.0, t3 = 410.44776119402985, tfill = 18.236472945891784)
(n = 31, t1 = 887.5, t2 = 6183.333333333333, t3 = 517.0103092783505, tfill = 36.052366565961734)
(n = 41, t1 = 1290.0, t2 = 6333.333333333333, t3 = 658.2417582417582, tfill = 57.520325203252035)
(n = 51, t1 = 2200.0, t2 = 7166.666666666667, t3 = 877.3972602739726, tfill = 87.26315789473684)
(n = 61, t1 = 2800.0, t2 = 7416.666666666667, t3 = 1030.0, tfill = 121.38919514884233)
(n = 71, t1 = 4071.4285714285716, t2 = 8400.0, t3 = 1420.0, tfill = 162.8140703517588)
(n = 81, t1 = 4871.428571428572, t2 = 8900.0, t3 = 1640.0, tfill = 583.9779005524862)
(n = 91, t1 = 6520.0, t2 = 9300.0, t3 = 2055.5555555555557, tfill = 798.9473684210526)
(n = 101, t1 = 7525.0, t2 = 10125.0, t3 = 2366.6666666666665, tfill = 980.0)
(n = 111, t1 = 9600.0, t2 = 10000.0, t3 = 2922.222222222222, tfill = 1180.0)
(n = 121, t1 = 10800.0, t2 = 11100.0, t3 = 3262.5, tfill = 1390.0)
(n = 131, t1 = 13200.0, t2 = 11800.0, t3 = 3912.5, tfill = 1630.0)
(n = 141, t1 = 14600.0, t2 = 13000.0, t3 = 4357.142857142857, tfill = 1890.0)
julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b73 (2025-04-14 06:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, tigerlake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
JULIA_CONDAPKG_BACKEND = Null
JULIA_DEPOT_PATH = Q:\.julia;
JULIA_NUM_THREADS = auto
So the version using Polyester threads starts beating the serial version somewhere between n=21
and n = 31
, and always beats @threads
...
That is interesting! I'll benchmark on my machine to see how it performs. If it beats the serial version with n=21 that is a game changer for us.
That probably has less to do with Polyester's threading performance, and more to do with the fact that it cheats by converting your arrays to StrideArrays.PtrArray
out from under you
@wheeheee could you please share the version written with Polyester.jl?
After using Polyester
, I merely replaced Threads.@threads
in update2!
with @batch
Mason Protter said:
That probably has less to do with Polyester's threading performance, and more to do with the fact that it cheats by converting your arrays to
StrideArrays.PtrArray
out from under you
While I do agree that there's not much point multithreading at n = 21
or even n = 30
, why should that be faster?
@wheeheee I ran with Polyester.jl and reproduced the improved with n=21
and beyond:
15-element Vector{@NamedTuple{n::Int64, t1::Float64, t2::Float64}}:
(n = 1, t1 = 6.295, t2 = 7.3993993993994)
(n = 11, t1 = 147.77980535279806, t2 = 410.6108490566038)
(n = 21, t1 = 509.26943005181346, t2 = 445.1029411764706)
(n = 31, t1 = 1122.7, t2 = 550.0837563451776)
(n = 41, t1 = 1968.2, t2 = 687.1114130434783)
(n = 51, t1 = 3033.8888888888887, t2 = 812.6372549019608)
(n = 61, t1 = 4257.857142857143, t2 = 1034.9117647058824)
(n = 71, t1 = 5764.5, t2 = 1303.4)
(n = 81, t1 = 7510.75, t2 = 1564.7)
(n = 91, t1 = 9694.0, t2 = 1832.7)
(n = 101, t1 = 11949.0, t2 = 2218.1)
(n = 111, t1 = 14416.0, t2 = 2631.0555555555557)
(n = 121, t1 = 17125.0, t2 = 3016.222222222222)
(n = 131, t1 = 20053.0, t2 = 3442.3333333333335)
(n = 141, t1 = 23155.0, t2 = 4009.875)
Could you please elaborate on why there is no much point multi-threading with n=21
?
@Mosè Giordano in your output I am seeing 36 and 96 threads. Are these physical cores?
I wonder how the latest processors in the market compare with your setup. I have an old Intel i7 processor, and was assuming that most processors today still had around 8 physical cores. Perhaps I need a processor update?
Júlio Hoffimann said:
Mosè Giordano in your output I am seeing 36 and 96 threads. Are these physical cores?
yes, but Grace has 72 physical cores, I was using half of the system
Júlio Hoffimann said:
I wonder how the latest processors in the market compare with your setup. I have an old Intel i7 processor, and was assuming that most processors today still had around 8 physical cores. Perhaps I need a processor update?
these are datacenter grade CPUs, not for laptops
Júlio Hoffimann said:
Could you please elaborate on why there is no much point multi-threading with
n=21
?
Because there’s not much of a speedup (10%) for using so many cores. Even though there is less overhead compared to @threads
, most of the work is still done setting up the threading or whatnot, although tasks are reused. Might as well just do without.
You can also look at Mose’s numbers, I think the overhead increases with the number of cores / threads. With the Threadripper’s monstrous number of threads, even at 141 the serial version is still better. Basically this has all been mentioned before in the thread, if there is only a small amount of work multithreading costs more than it’s worth.
Also as mentioned earlier, use multithreading in the outer loop. Being further up the abstraction layer doesn’t necessarily prevent multithreading, as long as the computation is really parallelizable.
Last updated: Jul 22 2025 at 04:56 UTC