I'm trying to beat Numba and failing:
In [55]: @numba.jit(nopython=True, fastmath=False, parallel=True)
...: def sum_squares(a):
...: return np.sum(np.square(a))
...:
In [58]: %timeit sum_squares(A)
874 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
julia> A = rand(2048, 2408);
julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A)
sum_squares (generic function with 1 method)
julia> @btime sum_squares($A)
982.900 μs (2328 allocations: 159.38 KiB)
1.6443015832629544e6
I'm getting about the same. Also tried A |> Map(x -> x^2) |> sum
with Transducers and got slightly worse. I often find the performance of threading in Julia frustratingly fiddly, especially for the additive cost. Not sure how to fix this. @Takafumi Arakaki (tkf) might have some insight.
My insight is that it's best to try conjuring @chriselrod for this kind of close-to-the-metal thing :)
I'm curious how it scales. In my experience the additive cost with the threading libraries usually sucks, but they'll scale pretty well
@Mark Kittisopikul , how big is the array in Python? I noticed you wrote rand(2048, 2408)
in the julia code
I tried, that, it's still really slow regardless, it's only like 5/4 or so
Here is what I see with LoopVectorization.jl
julia> using LoopVectorization, ThreadsX
julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A);
julia> function sum_squares_turbo(A::Matrix{T}) where {T}
out = zero(T)
@tturbo for i in eachindex(A)
out += A[i]^2
end
out
end;
julia> let A = rand(2048, 2048)
@btime sum_squares($A)
@btime sum_squares_turbo($A)
end;
804.960 μs (669 allocations: 42.56 KiB)
747.194 μs (0 allocations: 0 bytes)
julia> let A = rand(2048, 2408)
@btime sum_squares($A)
@btime sum_squares_turbo($A)
end;
966.097 μs (672 allocations: 42.66 KiB)
913.550 μs (0 allocations: 0 bytes)
Theres some profit from using good vecotorization here and lower overhead in the multithreading, but this operation is almost totally dominated by memory bandwidth I believe, so it can only be so fast.
It's worrying that performance is only worse by a factor of 2 or 3 if you do it sequentially
I think Julia multi-threading still needs a lot of work. It seems fast for very large oprations, but it still seems to really suck for relatively small arrays
I beleive that's because it's dominated by memory bandwidth, not a problem with julia's threading
Is it possible that the Julia and python benchmarks work a little differently? I think I basically know what happens when bnechmarks run on non-parallel code, but I don't know if there are any weird caveats with multi-threading
Actually, now that I look at the OP carefully, the difference to Numba is actually "only" about 100 μs. This kind of overhead is actually not super crazy if you have many cores at the moment, unfortunately.
(I)Python's timeit
disables GC
I would really appreciate if someone could come along and explain to me:
The fact that there's not much difference from LoopVectorization's @tturbo
makes me think the overhead here is not the problem
LV has super low overhead threading
I agree it's memory-bound, but if it's completely a hardware property, I think you'd see more similar timing in Python.
Or maybe it's just a difference in benchmarking tools as @Expanding Man said
Actually, I'm now puzzled why Numba is fast for that code. Does Numba do map-reduce fusion to avoid allocating np.square(a)
?
Doesn't it have all sorts of insane numpy-specific JIT optimizations?
Yeah. I’m sure they’re fusing operations there
I definitely recall one of the problems with numba being that you can never really know what it's doing, because they had to be so aggressive in optimizing innocent looking calls that it's basically re-writing all your code for you.
And then you'll have something it can't optimize and it uses regular python code and surprise! your code now takes 100 ms
Actually, just enabling SIMD (simd = true
) helps a lot here:
julia> A = rand(2048, 2408);
julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A);
julia> @btime sum_squares($A)
1.350 ms (416 allocations: 26.41 KiB)
1.643906959681204e6
julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A; simd = true);
julia> @btime sum_squares($A)
586.626 μs (415 allocations: 26.36 KiB)
1.6439069596812006e6
julia> Threads.nthreads()
4
Well, that basically solves it then, I pretty much guarantee you numba is doing simd
Expanding Man said:
And then you'll have something it can't optimize and it uses regular python code and surprise! your code now takes 100 ms
This is true for any optimizing compiler including Julia. I think Julia experts don't feel "you can never really know what it's doing" mainly since it has great introspection tools and maybe also the language design is more transparent to this. It's conceivable that Numba has or will have nice tools for diving into code gen.
It's a lot less extreme in the Julia case though. "Normal" Julia code doesn't run slower than everything else by a factor of 50. "Normal" python code is ungodly slow. I guess I'm thinking more about things that I have more experience here like trying to use anonymous functions to solve diff eq and stuff like that, but seems like the same kind of thing was happening with numba.
FWIW, this is what I get
julia> using LoopVectorization, ThreadsX
julia> A = rand(2048, 2408);
julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A; simd = true);
julia> function sum_squares_turbo(A::Matrix{T}) where {T}
out = zero(T)
@tturbo for i in eachindex(A)
out += A[i]^2
end
out
end;
julia> @btime sum_squares($A)
314.468 μs (5455 allocations: 346.11 KiB)
1.6439709895118927e6
julia> @btime sum_squares_turbo($A)
167.646 μs (0 allocations: 0 bytes)
1.6439709895118927e6
nthreads
?
In terms of memory bound:
julia> sizeof(A) / (1 << 20)
37.625
A
is 37.625 MiB. Thus, my computer has to stream A
through RAM. This should be entirely hardware limited.
julia> Threads.nthreads()
36
julia> LoopVectorization.num_cores()
static(18)
lol the benefits of having a huge EPYC machine or whatever it is you're running :laughing:
While we're at it, @Takafumi Arakaki (tkf) , any idea why this is so slow? I would have thought this should be almost exactly the same thing as ThreadsX
is doing, no?
◖◗ @btime $A |> Map(x -> x^2) |> sum
3.461 ms (0 allocations: 0 bytes)
1.6446086881129586e6
julia> LoopVectorization.num_cores()
static(18)
julia> versioninfo()
Julia Version 1.8.0-DEV.607
Commit 523d851be2* (2021-09-23 22:21 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
Probably the most relevant feature is that it has 4 memory channels.
IIRC, my RAM is clocked at 3.6 GHz (I should spend some time to see if I can get it higher).
DDR4 can do 8 bytes/channel/clock, and I have 4 channels, so the bandwidth should be
julia> 3.6 * 8 * 4
115.2
I.e., 115.2 GiB/s. So, dividing the number of GiB
in A
by the GiB
gives us
julia> (sizeof(A) / (1 << 30)) / (3.6 * 8 * 4) # seconds
0.0003189510769314236
julia> 1e6ans # microseconds
318.95107693142364
So streaming that much memory should take 319 microseconds.
I think that's simply because it's single-threaded and also it doesn't enable SIMD
julia> @btime $A |> Map(x -> x^2) |> sum
5.230 ms (0 allocations: 0 bytes)
1.6439112758022486e6
julia> @btime sum($A |> Map(x -> x^2); simd = true)
3.202 ms (0 allocations: 0 bytes)
1.6439112758022742e6
julia> @btime ThreadsX.sum($A |> Map(x -> x^2), simd = true)
643.555 μs (413 allocations: 26.30 KiB)
1.6439112758022815e6
julia> Threads.nthreads()
4
It's a bit puzzling that the speedup is somewhat superlinear though. Maybe some part of the array sits in L3 (but it's not 30 MB )?
julia> sizeof(A) / (CPUSummary.num_l2cache()*CPUSummary.cache_size(Val(2)) + CPUSummary.num_l3cache()*CPUSummary.cache_size(Val(3)))
0.8801169590643275
So I guess A
actually fits in my CPU's cache, hence why I can get much better performance than the 320 microseconds theoretical peak based on memory bandwidth.
I now wonder why @Mason Protter didn't see the difference, even with non-SIMD ThreadsX.mapreduce
.
I also see the boost from LoopVectorization:
julia> @btime sum_squares_turbo($A)
163.249 μs (0 allocations: 0 bytes)
1.6439112758022822e6
julia> @btime sum_squares($A)
644.125 μs (415 allocations: 26.36 KiB)
1.6439112758022815e6
julia> versioninfo()
Julia Version 1.8.0-DEV.590
Commit 2c9e051c46 (2021-09-23 21:35 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: AMD EPYC 7502 32-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver2)
julia> Threads.nthreads()
4
julia> LoopVectorization.num_cores()
static(64)
shell> nproc
128
Maybe LoopVectorization is making some bad optimizations for my system
LoopVectorization really doesn't like znver1...
If you're willing to debug, maybe we can find out why.
But, --
tkf's EPYC machine and my Intel machine both have enough cache to hold A
without touching RAM.
So it could also be that this is the issue. (i.e., we see a huge speedup only because it fits in cache, and otherwise we wouldn't)
Ah, yes, there was a race condition in my question and chriselrod's answer
julia> sizeof(A) / (CPUSummary.num_l2cache()*CPUSummary.cache_size(Val(2)) + CPUSummary.num_l3cache()*CPUSummary.cache_size(Val(3)))
0.1306423611111111
In terms of flops, 58 GFLOPS wouldn't be a bad number to hit with a single core:
julia> 2e-9length(A) / 167.646e-6
58.83330350858356
(but it is pretty bad for 18 or 64 cores)
julia> B = rand(2048*2, 2408*2);
julia> sizeof(B) / (CPUSummary.num_l2cache()*CPUSummary.cache_size(Val(2)) + CPUSummary.num_l3cache()*CPUSummary.cache_size(Val(3)))
3.52046783625731
julia> @btime sum_squares_turbo($B)
1.657 ms (0 allocations: 0 bytes)
6.575621490774e6
julia> @btime sum_squares($B)
1.905 ms (5478 allocations: 346.83 KiB)
6.575621490774e6
julia> 2e-9length(B) / 1.676e-3
23.539780429594273
Making B
4x larger than A
so that it is now too large for my cache makes it take 10x longer.
I get half the GFLOPS from LV and performance is closer.
Most likely explanation: tkf and my machine are fast enough that overhead makes a huge difference.
LV only takes less than 170 microseconds for each of us.
julia> @benchmark fetch(Threads.@spawn 1+1)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min … max): 2.022 μs … 68.784 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 59.636 μs ┊ GC (median): 0.00%
Time (mean ± σ): 55.873 μs ± 13.982 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▁▁ ▁▂▁▂▁▁ ▁ ▄▆████▇▇▅▄ ▂
███▇██████████▇▇▇▆▅▅▅▆▄▆▄▄▄▅▂▄▅▅▆▄▅▅▄▅▄▄▅▅▅▅▅▄▄▄▇██████████ █
2.02 μs Histogram: log(frequency) by time 65.8 μs <
Memory estimate: 473 bytes, allocs estimate: 4.
Hence, the overhead of base threading makes a huge difference for us, but not for slower (fewer core) machines.
Oh wait, if I use transducers Map
it's not multi-threaded by default? I guess I need to read the docs? :sweat_smile:
Tacking on 150 microseconds of overhead makes a huge difference when LV takes 170 microseconds (for A
), but much less difference when LV takes 1700 microseconds (for B
).
I think LV handles scaling with many threads better than base's threading, hence @Takafumi Arakaki (tkf) gets 400+ microseconds of overhead.
With only 12 threads, maybe @Mason Protter 's threading overhead is around 50 microseconds, while LV's baseline time is much slower, combining to make only a small proportional difference.
@Expanding Man Yeah, your right. You'd need to combine it with ThreadsX.jl or Folds.jl (or use foldxt
explicitly).
Long answer: Map
does not care how it's executed and so it all depends on how you reduce it. sum
(or rather it's internal foldl
) is by default sequential. You'd need to opt-in parallelism since I can't analyze `Map(f)
to see if it is safe to do it. You can opt-in parallelism it by choosing the variants of the "last function you apply" like sum
, collect
, Set
, etc. I guess it's imaginable to make it parallel by default, but Transducers.jl started as a sequential computing package.
That's really cool. I was recently discussing on slack how I love transducers but I tend to avoid it because it's such basic functionality (not necessarily basic implementation) that I expect it from base so I hesitate to add it as a dependency. I should stop doing this.
Well, I'd love to have tailwind for shoving transducers into Base :laughing:
@chriselrod Hmm... Isn't it (also?) simply because LV generates better code than the Julia compiler? I see improvements with one thread and small array:
julia> function sum_squares_turbo(A::Array{T}) where {T}
out = zero(T)
@tturbo for i in eachindex(A)
out += A[i]^2
end
out
end;
julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A; simd = true, basesize = typemax(Int));
julia> sum_squares_foldl(A) = foldxl(+, A |> Map(x->x^2); simd = true);
julia> xs = rand(2048);
julia> @btime sum_squares_turbo($xs);
104.054 ns (0 allocations: 0 bytes)
julia> @btime sum_squares($xs); # base case + some surrounding overhead
275.578 ns (2 allocations: 32 bytes)
julia> @btime sum_squares_foldl($xs); # base case, without overhead
199.763 ns (0 allocations: 0 bytes)
Takafumi Arakaki (tkf) said:
Well, I'd love to have tailwind for shoving transducers into Base :laughing:
Transducers in Base would be super awesome. In particular my complaints in my aforementioned discussion on slack was about how Base.Iterators
is really weird because it's very basic functionality that's shoved into another module. A really cool thing about the idea of transducers in base (though perhaps this is a frivolous reason to want it) is that it can naturally fit without interfering with the existing namespace.
Takafumi Arakaki (tkf) said:
chriselrod Hmm... Isn't it (also?) simply because LV generates better code than the Julia compiler? I see improvements with one thread and small array:
I figured the arrays being memory bound would mean this doesn't matter.
You'll notice the difference go away if you make the arrays much larger.
If you're curious, LV is unrolling by 8 while LLVM unrolls by 4. If you have up to 2 loads per cycle and 2 fmas per cycle, the CPU ideally wouldn't be bottlenecked by memory if it is all in the L1 cache. With a latency of 4 cycles per fma and 2 fma issued/cycle, you'd need 8 in flight at a time without dependency chain conflicts. Hence LV will unroll it by 8.
A dot product on the other hand, which requires 2 loads per fma, it'll only unroll by 4 because the loads limit it to at best 1 fma/cycle.
The theoretical peak flops of a single core of my CPU is 124.8 GFLOPS. The multithreaded benchmark was making less than half that using 18 cores, meaning the cores themselves were probably idle most of the time rather than hitting the limit on fmas issued/cycle.
Ah, you are right. There's still some difference if it fits in L2 but if it hits L3 there's no difference
julia> xs = rand(2^15);
julia> sizeof(xs) / CPUSummary.cache_size(Val(1))
8.0
julia> xs = rand(2^15);
julia> @btime sum_squares_turbo($xs);
2.463 μs (0 allocations: 0 bytes)
julia> @btime sum_squares_foldl($xs);
3.081 μs (0 allocations: 0 bytes)
julia> xs = rand(2^19);
julia> sizeof(xs) / CPUSummary.cache_size(Val(2))
8.0
julia> @btime sum_squares_turbo($xs);
57.410 μs (0 allocations: 0 bytes)
julia> @btime sum_squares_foldl($xs);
57.509 μs (0 allocations: 0 bytes)
Thanks for sorting this out guys. I learned a lot from reading over the conversation.
In case it is help to you, here's is what Numba is doing in terms of LLVM and asm:
sum_squares_numba_llvm.txt
sum_squares_numba_x86_64_asm.txt
Mason Protter said:
Mark Kittisopikul , how big is the array in Python? I noticed you wrote
rand(2048, 2408)
in the julia code
@Mark Kittisopikul can you confirm that the array you benchmarked in Numba is the same size as the array in Julia?
Yes. Same size.
Okay. The 2408 seemed like it was a typo
That’s quite strange
Oh, hmm, maybe
Looking over the history, I think you were absolutely correct, @Mason Protter . I screwed up the inputs between the two languages in the end of my testing. Originally, my test suite was completely starting from pyjulia and using Main.eval
from Python to conduct my tests, but I simplified to a pure Julia example for this post, and that is when I made the typo.
So is there a performance difference between the Julia and Numba versions?
In short, no. The longer version is that it depends how one measures it.
Last updated: Dec 28 2024 at 04:38 UTC