Sum of Squares, Numba vs Julia · helpdesk (published)

In [55]: @numba.jit(nopython=True, fastmath=False, parallel=True)
    ...: def sum_squares(a):
    ...:     return np.sum(np.square(a))
    ...:

In [58]: %timeit sum_squares(A)
874 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

julia> A = rand(2048, 2408);

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A)
sum_squares (generic function with 1 method)

julia> @btime sum_squares($A)
  982.900 μs (2328 allocations: 159.38 KiB)
1.6443015832629544e6

Expanding Man (Sep 24 2021 at 21:55):

I'm getting about the same. Also tried A |> Map(x -> x^2) |> sum with Transducers and got slightly worse. I often find the performance of threading in Julia frustratingly fiddly, especially for the additive cost. Not sure how to fix this. @Takafumi Arakaki (tkf) might have some insight.

Takafumi Arakaki (tkf) (Sep 24 2021 at 23:57):

My insight is that it's best to try conjuring @chriselrod for this kind of close-to-the-metal thing :)

Expanding Man (Sep 24 2021 at 23:59):

I'm curious how it scales. In my experience the additive cost with the threading libraries usually sucks, but they'll scale pretty well

Mason Protter (Sep 25 2021 at 00:02):

@Mark Kittisopikul , how big is the array in Python? I noticed you wrote rand(2048, 2408) in the julia code

Expanding Man (Sep 25 2021 at 00:03):

Mason Protter (Sep 25 2021 at 00:03):

julia> using LoopVectorization, ThreadsX

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A);

julia> function sum_squares_turbo(A::Matrix{T}) where {T}
           out = zero(T)
           @tturbo for i in eachindex(A)
               out += A[i]^2
           end
           out
       end;

julia> let A = rand(2048, 2048)
           @btime sum_squares($A)
           @btime sum_squares_turbo($A)
       end;
  804.960 μs (669 allocations: 42.56 KiB)
  747.194 μs (0 allocations: 0 bytes)

julia> let A = rand(2048, 2408)
           @btime sum_squares($A)
           @btime sum_squares_turbo($A)
       end;
  966.097 μs (672 allocations: 42.66 KiB)
  913.550 μs (0 allocations: 0 bytes)

Mason Protter (Sep 25 2021 at 00:05):

Theres some profit from using good vecotorization here and lower overhead in the multithreading, but this operation is almost totally dominated by memory bandwidth I believe, so it can only be so fast.

Expanding Man (Sep 25 2021 at 00:06):

It's worrying that performance is only worse by a factor of 2 or 3 if you do it sequentially

Expanding Man (Sep 25 2021 at 00:06):

I think Julia multi-threading still needs a lot of work. It seems fast for very large oprations, but it still seems to really suck for relatively small arrays

Mason Protter (Sep 25 2021 at 00:06):

I beleive that's because it's dominated by memory bandwidth, not a problem with julia's threading

Expanding Man (Sep 25 2021 at 00:07):

Is it possible that the Julia and python benchmarks work a little differently? I think I basically know what happens when bnechmarks run on non-parallel code, but I don't know if there are any weird caveats with multi-threading

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:09):

Actually, now that I look at the OP carefully, the difference to Numba is actually "only" about 100 μs. This kind of overhead is actually not super crazy if you have many cores at the moment, unfortunately.

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:10):

Expanding Man (Sep 25 2021 at 00:12):

Mason Protter (Sep 25 2021 at 00:14):

The fact that there's not much difference from LoopVectorization's @tturbo makes me think the overhead here is not the problem

Mason Protter (Sep 25 2021 at 00:14):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:17):

I agree it's memory-bound, but if it's completely a hardware property, I think you'd see more similar timing in Python.

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:18):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:19):

Actually, I'm now puzzled why Numba is fast for that code. Does Numba do map-reduce fusion to avoid allocating np.square(a)?

Expanding Man (Sep 25 2021 at 00:19):

Mason Protter (Sep 25 2021 at 00:20):

Expanding Man (Sep 25 2021 at 00:21):

I definitely recall one of the problems with numba being that you can never really know what it's doing, because they had to be so aggressive in optimizing innocent looking calls that it's basically re-writing all your code for you.

Expanding Man (Sep 25 2021 at 00:22):

And then you'll have something it can't optimize and it uses regular python code and surprise! your code now takes 100 ms

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:22):

julia> A = rand(2048, 2408);

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A);

julia> @btime sum_squares($A)
  1.350 ms (416 allocations: 26.41 KiB)
1.643906959681204e6

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A; simd = true);

julia> @btime sum_squares($A)
  586.626 μs (415 allocations: 26.36 KiB)
1.6439069596812006e6

julia> Threads.nthreads()
4

Expanding Man (Sep 25 2021 at 00:24):

Well, that basically solves it then, I pretty much guarantee you numba is doing simd

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:31):

This is true for any optimizing compiler including Julia. I think Julia experts don't feel "you can never really know what it's doing" mainly since it has great introspection tools and maybe also the language design is more transparent to this. It's conceivable that Numba has or will have nice tools for diving into code gen.

Expanding Man (Sep 25 2021 at 00:34):

It's a lot less extreme in the Julia case though. "Normal" Julia code doesn't run slower than everything else by a factor of 50. "Normal" python code is ungodly slow. I guess I'm thinking more about things that I have more experience here like trying to use anonymous functions to solve diff eq and stuff like that, but seems like the same kind of thing was happening with numba.

chriselrod (Sep 25 2021 at 00:37):

julia> using LoopVectorization, ThreadsX

julia> A = rand(2048, 2408);

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A; simd = true);

julia> function sum_squares_turbo(A::Matrix{T}) where {T}
                  out = zero(T)
                  @tturbo for i in eachindex(A)
                      out += A[i]^2
                  end
                  out
              end;

julia> @btime sum_squares($A)
  314.468 μs (5455 allocations: 346.11 KiB)
1.6439709895118927e6

julia> @btime sum_squares_turbo($A)
  167.646 μs (0 allocations: 0 bytes)
1.6439709895118927e6

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:38):

chriselrod (Sep 25 2021 at 00:38):

julia> sizeof(A) / (1 << 20)
37.625

A is 37.625 MiB. Thus, my computer has to stream A through RAM. This should be entirely hardware limited.

chriselrod (Sep 25 2021 at 00:38):

julia> Threads.nthreads()
36

julia> LoopVectorization.num_cores()
static(18)

Expanding Man (Sep 25 2021 at 00:39):

lol the benefits of having a huge EPYC machine or whatever it is you're running :laughing:

Expanding Man (Sep 25 2021 at 00:46):

While we're at it, @Takafumi Arakaki (tkf) , any idea why this is so slow? I would have thought this should be almost exactly the same thing as ThreadsX is doing, no?

◖◗ @btime $A |> Map(x -> x^2) |> sum
  3.461 ms (0 allocations: 0 bytes)
1.6446086881129586e6

chriselrod (Sep 25 2021 at 00:53):

julia> LoopVectorization.num_cores()
static(18)

julia> versioninfo()
Julia Version 1.8.0-DEV.607
Commit 523d851be2* (2021-09-23 22:21 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

IIRC, my RAM is clocked at 3.6 GHz (I should spend some time to see if I can get it higher).
DDR4 can do 8 bytes/channel/clock, and I have 4 channels, so the bandwidth should be

julia> 3.6 * 8 * 4
115.2

I.e., 115.2 GiB/s. So, dividing the number of GiB in A by the GiB gives us

julia> (sizeof(A) / (1 << 30)) / (3.6 * 8 * 4) # seconds
0.0003189510769314236

julia> 1e6ans # microseconds
318.95107693142364

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:54):

I think that's simply because it's single-threaded and also it doesn't enable SIMD

julia> @btime $A |> Map(x -> x^2) |> sum
  5.230 ms (0 allocations: 0 bytes)
1.6439112758022486e6

julia> @btime sum($A |> Map(x -> x^2); simd = true)
  3.202 ms (0 allocations: 0 bytes)
1.6439112758022742e6

julia> @btime ThreadsX.sum($A |> Map(x -> x^2), simd = true)
  643.555 μs (413 allocations: 26.30 KiB)
1.6439112758022815e6

julia> Threads.nthreads()
4

It's a bit puzzling that the speedup is somewhat superlinear though. Maybe some part of the array sits in L3 (but it's not 30 MB )?

chriselrod (Sep 25 2021 at 00:55):

julia> sizeof(A) / (CPUSummary.num_l2cache()*CPUSummary.cache_size(Val(2)) + CPUSummary.num_l3cache()*CPUSummary.cache_size(Val(3)))
0.8801169590643275

So I guess A actually fits in my CPU's cache, hence why I can get much better performance than the 320 microseconds theoretical peak based on memory bandwidth.

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:55):

I now wonder why @Mason Protter didn't see the difference, even with non-SIMD ThreadsX.mapreduce.

julia> @btime sum_squares_turbo($A)
  163.249 μs (0 allocations: 0 bytes)
1.6439112758022822e6

julia> @btime sum_squares($A)
  644.125 μs (415 allocations: 26.36 KiB)
1.6439112758022815e6

julia> versioninfo()
Julia Version 1.8.0-DEV.590
Commit 2c9e051c46 (2021-09-23 21:35 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD EPYC 7502 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver2)

julia> Threads.nthreads()
4

julia> LoopVectorization.num_cores()
static(64)

shell> nproc
128

Mason Protter (Sep 25 2021 at 00:58):

chriselrod (Sep 25 2021 at 00:59):

chriselrod (Sep 25 2021 at 01:00):

tkf's EPYC machine and my Intel machine both have enough cache to hold A without touching RAM.

chriselrod (Sep 25 2021 at 01:00):

So it could also be that this is the issue. (i.e., we see a huge speedup only because it fits in cache, and otherwise we wouldn't)

Takafumi Arakaki (tkf) (Sep 25 2021 at 01:01):

julia> sizeof(A) / (CPUSummary.num_l2cache()*CPUSummary.cache_size(Val(2)) + CPUSummary.num_l3cache()*CPUSummary.cache_size(Val(3)))
0.1306423611111111

chriselrod (Sep 25 2021 at 01:01):

In terms of flops, 58 GFLOPS wouldn't be a bad number to hit with a single core:

julia> 2e-9length(A) / 167.646e-6
58.83330350858356

chriselrod (Sep 25 2021 at 01:04):

julia> B = rand(2048*2, 2408*2);

julia> sizeof(B) / (CPUSummary.num_l2cache()*CPUSummary.cache_size(Val(2)) + CPUSummary.num_l3cache()*CPUSummary.cache_size(Val(3)))
3.52046783625731

julia> @btime sum_squares_turbo($B)
  1.657 ms (0 allocations: 0 bytes)
6.575621490774e6

julia> @btime sum_squares($B)
  1.905 ms (5478 allocations: 346.83 KiB)
6.575621490774e6

julia> 2e-9length(B) / 1.676e-3
23.539780429594273

Making B 4x larger than A so that it is now too large for my cache makes it take 10x longer.
I get half the GFLOPS from LV and performance is closer.

Most likely explanation: tkf and my machine are fast enough that overhead makes a huge difference.
LV only takes less than 170 microseconds for each of us.

julia> @benchmark fetch(Threads.@spawn 1+1)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):   2.022 μs … 68.784 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     59.636 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   55.873 μs ± 13.982 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▁▁ ▁▂▁▂▁▁ ▁                                     ▄▆████▇▇▅▄ ▂
  ███▇██████████▇▇▇▆▅▅▅▆▄▆▄▄▄▅▂▄▅▅▆▄▅▅▄▅▄▄▅▅▅▅▅▄▄▄▇██████████ █
  2.02 μs      Histogram: log(frequency) by time      65.8 μs <

 Memory estimate: 473 bytes, allocs estimate: 4.

Hence, the overhead of base threading makes a huge difference for us, but not for slower (fewer core) machines.

Expanding Man (Sep 25 2021 at 01:08):

Oh wait, if I use transducers Map it's not multi-threaded by default? I guess I need to read the docs? :sweat_smile:

chriselrod (Sep 25 2021 at 01:10):

Tacking on 150 microseconds of overhead makes a huge difference when LV takes 170 microseconds (for A), but much less difference when LV takes 1700 microseconds (for B).

I think LV handles scaling with many threads better than base's threading, hence @Takafumi Arakaki (tkf) gets 400+ microseconds of overhead.

With only 12 threads, maybe @Mason Protter 's threading overhead is around 50 microseconds, while LV's baseline time is much slower, combining to make only a small proportional difference.

Takafumi Arakaki (tkf) (Sep 25 2021 at 01:19):

@Expanding Man Yeah, your right. You'd need to combine it with ThreadsX.jl or Folds.jl (or use foldxt explicitly).

Long answer: Map does not care how it's executed and so it all depends on how you reduce it. sum (or rather it's internal foldl) is by default sequential. You'd need to opt-in parallelism since I can't analyze `Map(f) to see if it is safe to do it. You can opt-in parallelism it by choosing the variants of the "last function you apply" like sum, collect, Set, etc. I guess it's imaginable to make it parallel by default, but Transducers.jl started as a sequential computing package.

Expanding Man (Sep 25 2021 at 01:21):

That's really cool. I was recently discussing on slack how I love transducers but I tend to avoid it because it's such basic functionality (not necessarily basic implementation) that I expect it from base so I hesitate to add it as a dependency. I should stop doing this.

Takafumi Arakaki (tkf) (Sep 25 2021 at 01:34):

Takafumi Arakaki (tkf) (Sep 25 2021 at 01:39):

@chriselrod Hmm... Isn't it (also?) simply because LV generates better code than the Julia compiler? I see improvements with one thread and small array:

julia> function sum_squares_turbo(A::Array{T}) where {T}
           out = zero(T)
           @tturbo for i in eachindex(A)
               out += A[i]^2
           end
           out
       end;

julia> sum_squares(A) = ThreadsX.mapreduce(x->x^2, +, A; simd = true, basesize = typemax(Int));

julia> sum_squares_foldl(A) = foldxl(+, A |> Map(x->x^2); simd = true);

julia> xs = rand(2048);

julia> @btime sum_squares_turbo($xs);
  104.054 ns (0 allocations: 0 bytes)

julia> @btime sum_squares($xs);  # base case + some surrounding overhead
  275.578 ns (2 allocations: 32 bytes)

julia> @btime sum_squares_foldl($xs);  # base case, without overhead
  199.763 ns (0 allocations: 0 bytes)

Expanding Man (Sep 25 2021 at 01:42):

Transducers in Base would be super awesome. In particular my complaints in my aforementioned discussion on slack was about how Base.Iterators is really weird because it's very basic functionality that's shoved into another module. A really cool thing about the idea of transducers in base (though perhaps this is a frivolous reason to want it) is that it can naturally fit without interfering with the existing namespace.

chriselrod (Sep 25 2021 at 02:00):

I figured the arrays being memory bound would mean this doesn't matter.
You'll notice the difference go away if you make the arrays much larger.

If you're curious, LV is unrolling by 8 while LLVM unrolls by 4. If you have up to 2 loads per cycle and 2 fmas per cycle, the CPU ideally wouldn't be bottlenecked by memory if it is all in the L1 cache. With a latency of 4 cycles per fma and 2 fma issued/cycle, you'd need 8 in flight at a time without dependency chain conflicts. Hence LV will unroll it by 8.
A dot product on the other hand, which requires 2 loads per fma, it'll only unroll by 4 because the loads limit it to at best 1 fma/cycle.

chriselrod (Sep 25 2021 at 02:03):

The theoretical peak flops of a single core of my CPU is 124.8 GFLOPS. The multithreaded benchmark was making less than half that using 18 cores, meaning the cores themselves were probably idle most of the time rather than hitting the limit on fmas issued/cycle.

Takafumi Arakaki (tkf) (Sep 25 2021 at 02:11):

Ah, you are right. There's still some difference if it fits in L2 but if it hits L3 there's no difference

julia> xs = rand(2^15);

julia> sizeof(xs) / CPUSummary.cache_size(Val(1))
8.0

julia> xs = rand(2^15);

julia> @btime sum_squares_turbo($xs);
  2.463 μs (0 allocations: 0 bytes)

julia> @btime sum_squares_foldl($xs);
  3.081 μs (0 allocations: 0 bytes)

julia> xs = rand(2^19);

julia> sizeof(xs) / CPUSummary.cache_size(Val(2))
8.0

julia> @btime sum_squares_turbo($xs);
  57.410 μs (0 allocations: 0 bytes)

julia> @btime sum_squares_foldl($xs);
  57.509 μs (0 allocations: 0 bytes)

Mark Kittisopikul (Sep 25 2021 at 06:07):

Thanks for sorting this out guys. I learned a lot from reading over the conversation.

Mark Kittisopikul (Sep 25 2021 at 06:20):

Mason Protter (Sep 25 2021 at 06:26):

@Mark Kittisopikul can you confirm that the array you benchmarked in Numba is the same size as the array in Julia?

Mark Kittisopikul (Sep 25 2021 at 06:26):

Mason Protter (Sep 25 2021 at 06:27):

Mark Kittisopikul (Sep 25 2021 at 06:28):

Mark Kittisopikul (Sep 25 2021 at 21:19):

Looking over the history, I think you were absolutely correct, @Mason Protter . I screwed up the inputs between the two languages in the end of my testing. Originally, my test suite was completely starting from pyjulia and using Main.eval from Python to conduct my tests, but I simplified to a pure Julia example for this post, and that is when I made the typo.

Stream: helpdesk (published)

Topic: Sum of Squares, Numba vs Julia

Mark Kittisopikul (Sep 24 2021 at 21:39):

Expanding Man (Sep 24 2021 at 21:55):

Takafumi Arakaki (tkf) (Sep 24 2021 at 23:57):

Expanding Man (Sep 24 2021 at 23:59):

Mason Protter (Sep 25 2021 at 00:02):

Expanding Man (Sep 25 2021 at 00:03):

Mason Protter (Sep 25 2021 at 00:03):

Mason Protter (Sep 25 2021 at 00:05):

Expanding Man (Sep 25 2021 at 00:06):

Expanding Man (Sep 25 2021 at 00:06):

Mason Protter (Sep 25 2021 at 00:06):

Expanding Man (Sep 25 2021 at 00:07):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:09):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:10):

Expanding Man (Sep 25 2021 at 00:12):

Mason Protter (Sep 25 2021 at 00:14):

Mason Protter (Sep 25 2021 at 00:14):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:17):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:18):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:19):

Expanding Man (Sep 25 2021 at 00:19):

Mason Protter (Sep 25 2021 at 00:20):

Expanding Man (Sep 25 2021 at 00:21):

Expanding Man (Sep 25 2021 at 00:22):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:22):

Expanding Man (Sep 25 2021 at 00:24):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:31):

Expanding Man (Sep 25 2021 at 00:34):

chriselrod (Sep 25 2021 at 00:37):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:38):

chriselrod (Sep 25 2021 at 00:38):

chriselrod (Sep 25 2021 at 00:38):

Expanding Man (Sep 25 2021 at 00:39):

Expanding Man (Sep 25 2021 at 00:46):

chriselrod (Sep 25 2021 at 00:53):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:54):

chriselrod (Sep 25 2021 at 00:55):

Takafumi Arakaki (tkf) (Sep 25 2021 at 00:55):

Mason Protter (Sep 25 2021 at 00:58):

chriselrod (Sep 25 2021 at 00:59):

chriselrod (Sep 25 2021 at 01:00):

chriselrod (Sep 25 2021 at 01:00):

chriselrod (Sep 25 2021 at 01:00):

Takafumi Arakaki (tkf) (Sep 25 2021 at 01:01):

chriselrod (Sep 25 2021 at 01:01):

chriselrod (Sep 25 2021 at 01:04):

Expanding Man (Sep 25 2021 at 01:08):

chriselrod (Sep 25 2021 at 01:10):

Takafumi Arakaki (tkf) (Sep 25 2021 at 01:19):

Expanding Man (Sep 25 2021 at 01:21):

Takafumi Arakaki (tkf) (Sep 25 2021 at 01:34):

Takafumi Arakaki (tkf) (Sep 25 2021 at 01:39):

Expanding Man (Sep 25 2021 at 01:42):

chriselrod (Sep 25 2021 at 02:00):

chriselrod (Sep 25 2021 at 02:03):

Takafumi Arakaki (tkf) (Sep 25 2021 at 02:11):

Mark Kittisopikul (Sep 25 2021 at 06:07):

Mark Kittisopikul (Sep 25 2021 at 06:20):

Mason Protter (Sep 25 2021 at 06:26):

Mark Kittisopikul (Sep 25 2021 at 06:26):

Mason Protter (Sep 25 2021 at 06:27):

Mason Protter (Sep 25 2021 at 06:27):

Mark Kittisopikul (Sep 25 2021 at 06:28):

Mark Kittisopikul (Sep 25 2021 at 21:19):

Mason Protter (Sep 25 2021 at 21:20):

Mark Kittisopikul (Sep 25 2021 at 21:52):