Stream: helpdesk (published)

Topic: Sum of Squares, Numba vs Julia


view this post on Zulip Mark Kittisopikul (Sep 24 2021 at 21:39):

I'm trying to beat Numba and failing:

In [55]: @numba.jit(nopython=True, fastmath=False, parallel=True)
    ...: def sum_squares(a):
    ...:     return np.sum(np.square(a))
    ...:

In [58]: %timeit sum_squares(A)
874 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
julia> A = rand(2048, 2408);

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A)
sum_squares (generic function with 1 method)

julia> @btime sum_squares($A)
  982.900 μs (2328 allocations: 159.38 KiB)
1.6443015832629544e6

view this post on Zulip Expanding Man (Sep 24 2021 at 21:55):

I'm getting about the same. Also tried A |> Map(x -> x^2) |> sum with Transducers and got slightly worse. I often find the performance of threading in Julia frustratingly fiddly, especially for the additive cost. Not sure how to fix this. @Takafumi Arakaki (tkf) might have some insight.

view this post on Zulip Takafumi Arakaki (tkf) (Sep 24 2021 at 23:57):

My insight is that it's best to try conjuring @chriselrod for this kind of close-to-the-metal thing :)

view this post on Zulip Expanding Man (Sep 24 2021 at 23:59):

I'm curious how it scales. In my experience the additive cost with the threading libraries usually sucks, but they'll scale pretty well

view this post on Zulip Mason Protter (Sep 25 2021 at 00:02):

@Mark Kittisopikul , how big is the array in Python? I noticed you wrote rand(2048, 2408) in the julia code

view this post on Zulip Expanding Man (Sep 25 2021 at 00:03):

I tried, that, it's still really slow regardless, it's only like 5/4 or so

view this post on Zulip Mason Protter (Sep 25 2021 at 00:03):

Here is what I see with LoopVectorization.jl

julia> using LoopVectorization, ThreadsX

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A);

julia> function sum_squares_turbo(A::Matrix{T}) where {T}
           out = zero(T)
           @tturbo for i in eachindex(A)
               out += A[i]^2
           end
           out
       end;

julia> let A = rand(2048, 2048)
           @btime sum_squares($A)
           @btime sum_squares_turbo($A)
       end;
  804.960 μs (669 allocations: 42.56 KiB)
  747.194 μs (0 allocations: 0 bytes)

julia> let A = rand(2048, 2408)
           @btime sum_squares($A)
           @btime sum_squares_turbo($A)
       end;
  966.097 μs (672 allocations: 42.66 KiB)
  913.550 μs (0 allocations: 0 bytes)

view this post on Zulip Mason Protter (Sep 25 2021 at 00:05):

Theres some profit from using good vecotorization here and lower overhead in the multithreading, but this operation is almost totally dominated by memory bandwidth I believe, so it can only be so fast.

view this post on Zulip Expanding Man (Sep 25 2021 at 00:06):

It's worrying that performance is only worse by a factor of 2 or 3 if you do it sequentially

view this post on Zulip Expanding Man (Sep 25 2021 at 00:06):

I think Julia multi-threading still needs a lot of work. It seems fast for very large oprations, but it still seems to really suck for relatively small arrays

view this post on Zulip Mason Protter (Sep 25 2021 at 00:06):

I beleive that's because it's dominated by memory bandwidth, not a problem with julia's threading

view this post on Zulip Expanding Man (Sep 25 2021 at 00:07):

Is it possible that the Julia and python benchmarks work a little differently? I think I basically know what happens when bnechmarks run on non-parallel code, but I don't know if there are any weird caveats with multi-threading

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 00:09):

Actually, now that I look at the OP carefully, the difference to Numba is actually "only" about 100 μs. This kind of overhead is actually not super crazy if you have many cores at the moment, unfortunately.

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 00:10):

(I)Python's timeit disables GC

view this post on Zulip Expanding Man (Sep 25 2021 at 00:12):

I would really appreciate if someone could come along and explain to me:

view this post on Zulip Mason Protter (Sep 25 2021 at 00:14):

The fact that there's not much difference from LoopVectorization's @tturbo makes me think the overhead here is not the problem

view this post on Zulip Mason Protter (Sep 25 2021 at 00:14):

LV has super low overhead threading

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 00:17):

I agree it's memory-bound, but if it's completely a hardware property, I think you'd see more similar timing in Python.

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 00:18):

Or maybe it's just a difference in benchmarking tools as @Expanding Man said

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 00:19):

Actually, I'm now puzzled why Numba is fast for that code. Does Numba do map-reduce fusion to avoid allocating np.square(a)?

view this post on Zulip Expanding Man (Sep 25 2021 at 00:19):

Doesn't it have all sorts of insane numpy-specific JIT optimizations?

view this post on Zulip Mason Protter (Sep 25 2021 at 00:20):

Yeah. I’m sure they’re fusing operations there

view this post on Zulip Expanding Man (Sep 25 2021 at 00:21):

I definitely recall one of the problems with numba being that you can never really know what it's doing, because they had to be so aggressive in optimizing innocent looking calls that it's basically re-writing all your code for you.

view this post on Zulip Expanding Man (Sep 25 2021 at 00:22):

And then you'll have something it can't optimize and it uses regular python code and surprise! your code now takes 100 ms

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 00:22):

Actually, just enabling SIMD (simd = true) helps a lot here:

julia> A = rand(2048, 2408);

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A);

julia> @btime sum_squares($A)
  1.350 ms (416 allocations: 26.41 KiB)
1.643906959681204e6

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A; simd = true);

julia> @btime sum_squares($A)
  586.626 μs (415 allocations: 26.36 KiB)
1.6439069596812006e6

julia> Threads.nthreads()
4

view this post on Zulip Expanding Man (Sep 25 2021 at 00:24):

Well, that basically solves it then, I pretty much guarantee you numba is doing simd

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 00:31):

Expanding Man said:

And then you'll have something it can't optimize and it uses regular python code and surprise! your code now takes 100 ms

This is true for any optimizing compiler including Julia. I think Julia experts don't feel "you can never really know what it's doing" mainly since it has great introspection tools and maybe also the language design is more transparent to this. It's conceivable that Numba has or will have nice tools for diving into code gen.

view this post on Zulip Expanding Man (Sep 25 2021 at 00:34):

It's a lot less extreme in the Julia case though. "Normal" Julia code doesn't run slower than everything else by a factor of 50. "Normal" python code is ungodly slow. I guess I'm thinking more about things that I have more experience here like trying to use anonymous functions to solve diff eq and stuff like that, but seems like the same kind of thing was happening with numba.

view this post on Zulip chriselrod (Sep 25 2021 at 00:37):

FWIW, this is what I get

julia> using LoopVectorization, ThreadsX

julia> A = rand(2048, 2408);

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A; simd = true);

julia> function sum_squares_turbo(A::Matrix{T}) where {T}
                  out = zero(T)
                  @tturbo for i in eachindex(A)
                      out += A[i]^2
                  end
                  out
              end;

julia> @btime sum_squares($A)
  314.468 μs (5455 allocations: 346.11 KiB)
1.6439709895118927e6

julia> @btime sum_squares_turbo($A)
  167.646 μs (0 allocations: 0 bytes)
1.6439709895118927e6

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 00:38):

nthreads?

view this post on Zulip chriselrod (Sep 25 2021 at 00:38):

In terms of memory bound:

julia> sizeof(A) / (1 << 20)
37.625

A is 37.625 MiB. Thus, my computer has to stream A through RAM. This should be entirely hardware limited.

view this post on Zulip chriselrod (Sep 25 2021 at 00:38):

julia> Threads.nthreads()
36

julia> LoopVectorization.num_cores()
static(18)

view this post on Zulip Expanding Man (Sep 25 2021 at 00:39):

lol the benefits of having a huge EPYC machine or whatever it is you're running :laughing:

view this post on Zulip Expanding Man (Sep 25 2021 at 00:46):

While we're at it, @Takafumi Arakaki (tkf) , any idea why this is so slow? I would have thought this should be almost exactly the same thing as ThreadsX is doing, no?

◖◗ @btime $A |> Map(x -> x^2) |> sum
  3.461 ms (0 allocations: 0 bytes)
1.6446086881129586e6

view this post on Zulip chriselrod (Sep 25 2021 at 00:53):

julia> LoopVectorization.num_cores()
static(18)

julia> versioninfo()
Julia Version 1.8.0-DEV.607
Commit 523d851be2* (2021-09-23 22:21 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

Probably the most relevant feature is that it has 4 memory channels.

IIRC, my RAM is clocked at 3.6 GHz (I should spend some time to see if I can get it higher).
DDR4 can do 8 bytes/channel/clock, and I have 4 channels, so the bandwidth should be

julia> 3.6 * 8 * 4
115.2

I.e., 115.2 GiB/s. So, dividing the number of GiB in A by the GiB gives us

julia> (sizeof(A) / (1 << 30)) / (3.6 * 8 * 4) # seconds
0.0003189510769314236

julia> 1e6ans # microseconds
318.95107693142364

So streaming that much memory should take 319 microseconds.

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 00:54):

I think that's simply because it's single-threaded and also it doesn't enable SIMD

julia> @btime $A |> Map(x -> x^2) |> sum
  5.230 ms (0 allocations: 0 bytes)
1.6439112758022486e6

julia> @btime sum($A |> Map(x -> x^2); simd = true)
  3.202 ms (0 allocations: 0 bytes)
1.6439112758022742e6

julia> @btime ThreadsX.sum($A |> Map(x -> x^2), simd = true)
  643.555 μs (413 allocations: 26.30 KiB)
1.6439112758022815e6

julia> Threads.nthreads()
4

It's a bit puzzling that the speedup is somewhat superlinear though. Maybe some part of the array sits in L3 (but it's not 30 MB )?

view this post on Zulip chriselrod (Sep 25 2021 at 00:55):

julia> sizeof(A) / (CPUSummary.num_l2cache()*CPUSummary.cache_size(Val(2)) + CPUSummary.num_l3cache()*CPUSummary.cache_size(Val(3)))
0.8801169590643275

So I guess A actually fits in my CPU's cache, hence why I can get much better performance than the 320 microseconds theoretical peak based on memory bandwidth.

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 00:55):

I now wonder why @Mason Protter didn't see the difference, even with non-SIMD ThreadsX.mapreduce.

I also see the boost from LoopVectorization:

julia> @btime sum_squares_turbo($A)
  163.249 μs (0 allocations: 0 bytes)
1.6439112758022822e6

julia> @btime sum_squares($A)
  644.125 μs (415 allocations: 26.36 KiB)
1.6439112758022815e6

julia> versioninfo()
Julia Version 1.8.0-DEV.590
Commit 2c9e051c46 (2021-09-23 21:35 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD EPYC 7502 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver2)

julia> Threads.nthreads()
4

julia> LoopVectorization.num_cores()
static(64)

shell> nproc
128

view this post on Zulip Mason Protter (Sep 25 2021 at 00:58):

Maybe LoopVectorization is making some bad optimizations for my system

view this post on Zulip chriselrod (Sep 25 2021 at 00:59):

LoopVectorization really doesn't like znver1...

view this post on Zulip chriselrod (Sep 25 2021 at 01:00):

If you're willing to debug, maybe we can find out why.
But, --

view this post on Zulip chriselrod (Sep 25 2021 at 01:00):

tkf's EPYC machine and my Intel machine both have enough cache to hold A without touching RAM.

view this post on Zulip chriselrod (Sep 25 2021 at 01:00):

So it could also be that this is the issue. (i.e., we see a huge speedup only because it fits in cache, and otherwise we wouldn't)

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 01:01):

Ah, yes, there was a race condition in my question and chriselrod's answer

julia> sizeof(A) / (CPUSummary.num_l2cache()*CPUSummary.cache_size(Val(2)) + CPUSummary.num_l3cache()*CPUSummary.cache_size(Val(3)))
0.1306423611111111

view this post on Zulip chriselrod (Sep 25 2021 at 01:01):

In terms of flops, 58 GFLOPS wouldn't be a bad number to hit with a single core:

julia> 2e-9length(A) / 167.646e-6
58.83330350858356

(but it is pretty bad for 18 or 64 cores)

view this post on Zulip chriselrod (Sep 25 2021 at 01:04):

julia> B = rand(2048*2, 2408*2);

julia> sizeof(B) / (CPUSummary.num_l2cache()*CPUSummary.cache_size(Val(2)) + CPUSummary.num_l3cache()*CPUSummary.cache_size(Val(3)))
3.52046783625731

julia> @btime sum_squares_turbo($B)
  1.657 ms (0 allocations: 0 bytes)
6.575621490774e6

julia> @btime sum_squares($B)
  1.905 ms (5478 allocations: 346.83 KiB)
6.575621490774e6

julia> 2e-9length(B) / 1.676e-3
23.539780429594273

Making B 4x larger than A so that it is now too large for my cache makes it take 10x longer.
I get half the GFLOPS from LV and performance is closer.

Most likely explanation: tkf and my machine are fast enough that overhead makes a huge difference.
LV only takes less than 170 microseconds for each of us.

julia> @benchmark fetch(Threads.@spawn 1+1)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):   2.022 μs … 68.784 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     59.636 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   55.873 μs ± 13.982 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▁▁ ▁▂▁▂▁▁ ▁                                     ▄▆████▇▇▅▄ ▂
  ███▇██████████▇▇▇▆▅▅▅▆▄▆▄▄▄▅▂▄▅▅▆▄▅▅▄▅▄▄▅▅▅▅▅▄▄▄▇██████████ █
  2.02 μs      Histogram: log(frequency) by time      65.8 μs <

 Memory estimate: 473 bytes, allocs estimate: 4.

Hence, the overhead of base threading makes a huge difference for us, but not for slower (fewer core) machines.

view this post on Zulip Expanding Man (Sep 25 2021 at 01:08):

Oh wait, if I use transducers Map it's not multi-threaded by default? I guess I need to read the docs? :sweat_smile:

view this post on Zulip chriselrod (Sep 25 2021 at 01:10):

Tacking on 150 microseconds of overhead makes a huge difference when LV takes 170 microseconds (for A), but much less difference when LV takes 1700 microseconds (for B).

I think LV handles scaling with many threads better than base's threading, hence @Takafumi Arakaki (tkf) gets 400+ microseconds of overhead.

With only 12 threads, maybe @Mason Protter 's threading overhead is around 50 microseconds, while LV's baseline time is much slower, combining to make only a small proportional difference.

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 01:19):

@Expanding Man Yeah, your right. You'd need to combine it with ThreadsX.jl or Folds.jl (or use foldxt explicitly).

Long answer: Map does not care how it's executed and so it all depends on how you reduce it. sum (or rather it's internal foldl) is by default sequential. You'd need to opt-in parallelism since I can't analyze `Map(f) to see if it is safe to do it. You can opt-in parallelism it by choosing the variants of the "last function you apply" like sum, collect, Set, etc. I guess it's imaginable to make it parallel by default, but Transducers.jl started as a sequential computing package.

view this post on Zulip Expanding Man (Sep 25 2021 at 01:21):

That's really cool. I was recently discussing on slack how I love transducers but I tend to avoid it because it's such basic functionality (not necessarily basic implementation) that I expect it from base so I hesitate to add it as a dependency. I should stop doing this.

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 01:34):

Well, I'd love to have tailwind for shoving transducers into Base :laughing:

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 01:39):

@chriselrod Hmm... Isn't it (also?) simply because LV generates better code than the Julia compiler? I see improvements with one thread and small array:

julia> function sum_squares_turbo(A::Array{T}) where {T}
           out = zero(T)
           @tturbo for i in eachindex(A)
               out += A[i]^2
           end
           out
       end;

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A; simd = true, basesize = typemax(Int));

julia> sum_squares_foldl(A) = foldxl(+, A |> Map(x->x^2); simd = true);

julia> xs = rand(2048);

julia> @btime sum_squares_turbo($xs);
  104.054 ns (0 allocations: 0 bytes)

julia> @btime sum_squares($xs);  # base case + some surrounding overhead
  275.578 ns (2 allocations: 32 bytes)

julia> @btime sum_squares_foldl($xs);  # base case, without overhead
  199.763 ns (0 allocations: 0 bytes)

view this post on Zulip Expanding Man (Sep 25 2021 at 01:42):

Takafumi Arakaki (tkf) said:

Well, I'd love to have tailwind for shoving transducers into Base :laughing:

Transducers in Base would be super awesome. In particular my complaints in my aforementioned discussion on slack was about how Base.Iterators is really weird because it's very basic functionality that's shoved into another module. A really cool thing about the idea of transducers in base (though perhaps this is a frivolous reason to want it) is that it can naturally fit without interfering with the existing namespace.

view this post on Zulip chriselrod (Sep 25 2021 at 02:00):

Takafumi Arakaki (tkf) said:

chriselrod Hmm... Isn't it (also?) simply because LV generates better code than the Julia compiler? I see improvements with one thread and small array:

I figured the arrays being memory bound would mean this doesn't matter.
You'll notice the difference go away if you make the arrays much larger.

If you're curious, LV is unrolling by 8 while LLVM unrolls by 4. If you have up to 2 loads per cycle and 2 fmas per cycle, the CPU ideally wouldn't be bottlenecked by memory if it is all in the L1 cache. With a latency of 4 cycles per fma and 2 fma issued/cycle, you'd need 8 in flight at a time without dependency chain conflicts. Hence LV will unroll it by 8.
A dot product on the other hand, which requires 2 loads per fma, it'll only unroll by 4 because the loads limit it to at best 1 fma/cycle.

view this post on Zulip chriselrod (Sep 25 2021 at 02:03):

The theoretical peak flops of a single core of my CPU is 124.8 GFLOPS. The multithreaded benchmark was making less than half that using 18 cores, meaning the cores themselves were probably idle most of the time rather than hitting the limit on fmas issued/cycle.

view this post on Zulip Takafumi Arakaki (tkf) (Sep 25 2021 at 02:11):

Ah, you are right. There's still some difference if it fits in L2 but if it hits L3 there's no difference

julia> xs = rand(2^15);

julia> sizeof(xs) / CPUSummary.cache_size(Val(1))
8.0

julia> xs = rand(2^15);

julia> @btime sum_squares_turbo($xs);
  2.463 μs (0 allocations: 0 bytes)

julia> @btime sum_squares_foldl($xs);
  3.081 μs (0 allocations: 0 bytes)

julia> xs = rand(2^19);

julia> sizeof(xs) / CPUSummary.cache_size(Val(2))
8.0

julia> @btime sum_squares_turbo($xs);
  57.410 μs (0 allocations: 0 bytes)

julia> @btime sum_squares_foldl($xs);
  57.509 μs (0 allocations: 0 bytes)

view this post on Zulip Mark Kittisopikul (Sep 25 2021 at 06:07):

Thanks for sorting this out guys. I learned a lot from reading over the conversation.

view this post on Zulip Mark Kittisopikul (Sep 25 2021 at 06:20):

In case it is help to you, here's is what Numba is doing in terms of LLVM and asm:
sum_squares_numba_llvm.txt
sum_squares_numba_x86_64_asm.txt

view this post on Zulip Mason Protter (Sep 25 2021 at 06:26):

Mason Protter said:

Mark Kittisopikul , how big is the array in Python? I noticed you wrote rand(2048, 2408) in the julia code

@Mark Kittisopikul can you confirm that the array you benchmarked in Numba is the same size as the array in Julia?

view this post on Zulip Mark Kittisopikul (Sep 25 2021 at 06:26):

Yes. Same size.

view this post on Zulip Mason Protter (Sep 25 2021 at 06:27):

Okay. The 2408 seemed like it was a typo

view this post on Zulip Mason Protter (Sep 25 2021 at 06:27):

That’s quite strange

view this post on Zulip Mark Kittisopikul (Sep 25 2021 at 06:28):

Oh, hmm, maybe

view this post on Zulip Mark Kittisopikul (Sep 25 2021 at 21:19):

Looking over the history, I think you were absolutely correct, @Mason Protter . I screwed up the inputs between the two languages in the end of my testing. Originally, my test suite was completely starting from pyjulia and using Main.eval from Python to conduct my tests, but I simplified to a pure Julia example for this post, and that is when I made the typo.

view this post on Zulip Mason Protter (Sep 25 2021 at 21:20):

So is there a performance difference between the Julia and Numba versions?

view this post on Zulip Mark Kittisopikul (Sep 25 2021 at 21:52):

In short, no. The longer version is that it depends how one measures it.


Last updated: Oct 23 2025 at 04:41 UTC