Multithreaded copyto! and permutedims! · helpdesk (published)

My initial experiments with SIMD.jl show it is possible to speed up these procedures. I'm wondering if someone has already done this?

Mark Kittisopikul (May 13 2022 at 18:12):

julia> A = Array{UInt16}(undef, 4608, 2592, 128);

julia> B = Array{UInt16}(undef, 4608, 2688, 128);

julia> @time copyto!(@view(B[:, axes(A)[2], :]), A);
  3.951746 seconds (13 allocations: 384 bytes)

Mark Kittisopikul (May 13 2022 at 18:15):

julia> @time Threads.@threads for i in 1:128
           copyto!(@view(B[:, axes(A)[2], :]), @view(A[:,:,i]))
       end
  0.647388 seconds (21.29 k allocations: 1.134 MiB, 6.40% compilation time)

Mason Protter (May 13 2022 at 18:21):

julia> using LoopVectorization

julia> function my_copyto_tturbo!(A, B)
           @tturbo for i ∈ eachindex(A, B)
               A[i] = B[i]
           end
           A
       end;

julia> let A = Array{UInt16}(undef, 4608, 2592, 128),  B = Array{UInt16}(undef, 4608, 2688, 128)
           @btime copyto!(@view($B[:, axes($A)[2], :]), $A);
           @btime my_copyto_tturbo!(@view($B[:, axes($A)[2], :]), $A);
       end;
  4.636 s (0 allocations: 0 bytes)
  200.800 ms (0 allocations: 0 bytes)

Michael Abbott (May 13 2022 at 23:20):

julia> using TensorOperations, LoopVectorization

julia> function my_permute_tturbo!(A, B)
           @tturbo for i in axes(A,1), j in axes(A,2), k in axes(A,3)
               B[k,j,i] = A[i,j,k]
           end
           A
       end;

julia> let A = rand(UInt16, 4608, 2592, 128),  B = Array{UInt16}(undef, 128, 2592, 4608)
           @btime permutedims!($B, $A, (3,2,1))
           @btime my_permute_tturbo!($B, $A)
           @btime @tensor $B[k,j,i] = $A[i,j,k]
       end;
  13.564 s (0 allocations: 0 bytes)
  6.656 s (0 allocations: 0 bytes)
  1.068 s (62 allocations: 4.89 KiB)

chriselrod (May 13 2022 at 23:47):

julia> function my_permute_polyester!(A, B)
           @batch for j in axes(A,2)
           @turbo for i in axes(A,1), k in axes(A,3)
               B[k,j,i] = A[i,j,k]
           end; end
           A
       end;

julia> let A = rand(UInt16, 4608, 2592, 128),  B = Array{UInt16}(undef, 128, 2592, 4608)
           @time permutedims!(B, A, (3,2,1))
           @time my_permute_tturbo!(B, A)
           @time my_permute_polyester!(B, A)
           @time @tensor B[k,j,i] = A[i,j,k]
       end;
 14.993049 seconds
  3.187791 seconds (3 allocations: 96 bytes)
  0.495429 seconds
  0.721222 seconds (515 allocations: 41.172 KiB)

The allocation is because I'm using @time, and Polyester's threads have to wake up.

chriselrod (May 13 2022 at 23:52):

Anyway, as the Polyester example shows, LV is obviously making a stupid threading decision. I doubt my @batch on axis 2 is particularly smart, either, it was just the first thing I tried, and apparently 6x less stupid.

Mason Protter (May 13 2022 at 23:54):

I noticed with the copyto! example, there was no advantage to using @tturbo vs @turbo even though it was taking hundreds of miliseconds, so probably a bad heuristic there too

chriselrod (May 13 2022 at 23:58):

chriselrod (May 13 2022 at 23:59):

copyto! forwards to memcpy which can have optimized implementations using non-temporal stores (as well as arch-based dispatch).

chriselrod (May 14 2022 at 00:00):

vmapnt!(identity, dst, src) matched it in that example, but of course the advantage of the memcpy implementations is that they'll only use non-temporal stores for large arrays, where early copied elements are going to be flushed from your cache anyway.

chriselrod (May 14 2022 at 00:04):

julia> LoopVectorization.choose_order(ls)
([:k, :i, :j], :i, :k, :i, 1, 1)

So it looks like it's putting the j loop as the inner-most loop, and then threading the k and i loop. The i loop is also what it is vectorizing, meaning it is going with contiguous loads + scatters, which also seems questionable. gathers + contiguous stores should be faster, but I think as a heuristic it prefers faster loads as you're more likely to be bottlenecked on latency.

Actually, a really clever thing I should try to get the rewrite to do is have a combined kernel that uses both gathers + contiguous stores and contiguous loads + scatters, to balance loads and store (in proportion to the actual number of each the CPU is capable of)...

Michael Abbott (May 14 2022 at 00:26):

It's a pity Base's permutedims! is so much slower, even without multithreading. Maybe TO's algorithm should be built-in? BLAS doesn't do this if google is telling the truth, but do libraries like MKL have fast routines? copyto! is just tripping over the view somehow, costing 20x.

Michael Abbott (May 14 2022 at 00:27):

(IIRC there are actually at 2 or 3 different blocking stories for permutedims! in different parts of Base.)

chriselrod (May 14 2022 at 00:35):

julia> @time my_permute_tturbo!(B, A);
  0.145698 seconds

julia> @time my_permute_tturbo!(B, A);
  0.139114 seconds

chriselrod (May 14 2022 at 00:39):

I just pushed to master.
It is still far from optimal, but I at least made a minimal tweak to fix the really bad behavior seen currently.

I don't want to make more substantial changes, as my LV efforts are being focused on the rewrite.
But possibly handling something like this really well via balanced gather/scatters is an interesting idea I'll try and get around to trying out eventually...
Of course, part of the trick there is that you need to divide up the iteration space, and then fuse overlapping pieces.
That could perhaps be guarded by a "I don't care how much code you're generating!" flag.

chriselrod (May 14 2022 at 02:03):

The @tturbo version's performance should be better with the newly released LoopVectorization v0.12.109.

Maarten (May 14 2022 at 18:57):

do the polyester threads already play nice with the "normal" base julia threads?

Mason Protter (May 14 2022 at 19:46):

Brenhin Keller (May 14 2022 at 20:27):

chriselrod (May 14 2022 at 21:01):

Kiran was in Boston for a week a few weeks ago (and was the hecklee at a CAJUN "heckle a Julia developer").
Improving performance of Base threads was a priority of his.

He also said he'd create a benchmark repo that contains Julia and OpenMP versions for comparison, and would accept a PR that also adds Polyester/LV versions for comparison as well.

Brenhin Keller (May 14 2022 at 21:31):

Mason Protter (May 14 2022 at 21:34):

Mason Protter (May 14 2022 at 21:35):

If I was to have a language with only one threading system, I wouldn't choose Polyester's set of tradeoffs. But as a choice of one way to thread code, it's fantastic

Brenhin Keller (May 14 2022 at 21:56):

Oh not being too serious or anything -- and "better" is super subjective. That said, while it's great that there is multithreading in base, and the async / spawn approach is cool and all, for me personally the cases where multithreading is super useful are kind of squeezed between the efficiency you can get from a single thread with SIMD on one side, and multiprocessing on the other side. For the cases in-between where I want more than one thread, but don't want to reach all the way over to message passing and such, low overhead is kinda the key priority personally.

chriselrod (May 14 2022 at 22:35):

I have never worked in/with HPC, but I've been told people who do really want composable task-based systems (i.e., base Julia, not Polyester).

chriselrod (May 14 2022 at 22:36):

Dynamic scheduling will also become increasingly important as architectures becoming increasingly heterogenous.

chriselrod (May 14 2022 at 22:39):

In principle, the idea of a few big cores optimized for single threaded performance + a bunch of little cores optimized for performance/die area makes a lot of sense for maximizing a CPU's performance.
Many programs just aren't going to scale beyond a few cores, so have a few big cores to make them fast. For programs that do scale well with more cores, throw as many as you can at them.
Intel is going this route, with their next generation of CPUs being rumored to have 16 little cores.
I'm not sure about Apple, which tends to have much fewer little cores aimed at background processes rather than doing work?
I watched a presentation by Tenstorrent that advertised a similar concept (fast big core + tons of small ones) for ML work.

So maybe breadth first or work stealing work well here; i.e. we want something dynamic to best be able to take advantage of heterogenous compute, where it would be difficult to know how to actually divide work up a priori.

chriselrod (May 14 2022 at 22:40):

There's a decent chance Julia moves to work stealing in the future. With work stealing, we can't be better than the competition, but -- given it is what the competition uses -- at least we'll actually be able to match it.

Kiran maintains that depth first is the better algorithm in theory, but there's a lot of unsolved issues for a good implementation, while work stealing is much more of a known entity.

chriselrod (May 14 2022 at 22:46):

Julia probably wants heterogenous architectures.
Compiling sucks on the A64FX, which is basically just a bunch of small cores (that have big vectors). Bolting a big core or two on there to at least run the JIT would make it much nicer for Julia

Brenhin Keller (May 14 2022 at 22:50):

Yeah, that does make sense. I'm probably an outlier; real HPC folks have been all about "hierarchical parallelism" for a while now

Mosè Giordano (May 14 2022 at 22:50):

Brenhin Keller (May 14 2022 at 22:52):

Mosè Giordano (May 14 2022 at 22:54):

With parallel precompilation in v1.7, I have to go through Julia-1.0-level precompilation times. It is very common to stare the screen for 5-10 minutes for precompilation to complete

Brenhin Keller (May 14 2022 at 22:55):

Brenhin Keller (May 14 2022 at 22:56):

Mosè Giordano (May 14 2022 at 23:00):

Brenhin Keller (May 14 2022 at 23:03):

Oh huh, so slow but not insanely slow.. so then it's lack of cache and low instructions-per-cycle and such then that limits precompilation?

Mosè Giordano (May 14 2022 at 23:38):

I'm not sure about why it's so slow, maybe Chris has some ideas. In general Julia's compilation is very slow on some aarch64 boards with low memory, but I don't think this is the case for this CPU. In any case the experience is pretty bad.

In the last weeks I'm doing some MPI performance to measure latency when using MPI.jl and Julia overhead compared to C. Some results are good, others less so, but I want to run more benchmarks in the next weeks.

Mosè Giordano (May 15 2022 at 00:41):

chriselrod (May 15 2022 at 01:57):

Skeptical may not have been the right word. I guess I could say "skeptical that the approach will be able enable small multithreaded BLAS that is competitive with MKL", but that's also not really the use case they're targeting, and other than looking good on benchmarks, I have to wonder how valuable that really is.

chriselrod (May 15 2022 at 01:57):

chriselrod (May 15 2022 at 01:58):

chriselrod (May 15 2022 at 02:00):

"Commit stack entry" of 128, which if I'm reading correctly, should be compared with Skylake's ROB of 224.

chriselrod (May 15 2022 at 02:00):

chriselrod (May 15 2022 at 02:04):

I.e., it looks slower than Skylake on paper, but compare that with chips like the A53 and A55, which are only dual-issue (half the A64FX's 4, which basically matches Skylake).

Stream: helpdesk (published)

Topic: Multithreaded copyto! and permutedims!

Mark Kittisopikul (May 13 2022 at 13:07):

Mark Kittisopikul (May 13 2022 at 18:12):

Mark Kittisopikul (May 13 2022 at 18:15):

Mason Protter (May 13 2022 at 18:21):

Michael Abbott (May 13 2022 at 23:20):

chriselrod (May 13 2022 at 23:47):

chriselrod (May 13 2022 at 23:52):

Mason Protter (May 13 2022 at 23:54):

chriselrod (May 13 2022 at 23:58):

chriselrod (May 13 2022 at 23:58):

chriselrod (May 13 2022 at 23:59):

chriselrod (May 14 2022 at 00:00):

chriselrod (May 14 2022 at 00:04):

Michael Abbott (May 14 2022 at 00:26):

Michael Abbott (May 14 2022 at 00:27):

chriselrod (May 14 2022 at 00:35):

chriselrod (May 14 2022 at 00:39):

chriselrod (May 14 2022 at 02:03):

Maarten (May 14 2022 at 18:57):

Mason Protter (May 14 2022 at 19:46):

Brenhin Keller (May 14 2022 at 20:27):

chriselrod (May 14 2022 at 21:01):

Brenhin Keller (May 14 2022 at 21:31):

Mason Protter (May 14 2022 at 21:34):

Mason Protter (May 14 2022 at 21:35):

Brenhin Keller (May 14 2022 at 21:56):

chriselrod (May 14 2022 at 22:35):

chriselrod (May 14 2022 at 22:36):

chriselrod (May 14 2022 at 22:39):

chriselrod (May 14 2022 at 22:40):

chriselrod (May 14 2022 at 22:46):

Brenhin Keller (May 14 2022 at 22:50):

Mosè Giordano (May 14 2022 at 22:50):

Brenhin Keller (May 14 2022 at 22:52):

Mosè Giordano (May 14 2022 at 22:54):

Brenhin Keller (May 14 2022 at 22:55):

Brenhin Keller (May 14 2022 at 22:56):

Mosè Giordano (May 14 2022 at 23:00):

Brenhin Keller (May 14 2022 at 23:03):

Mosè Giordano (May 14 2022 at 23:38):

Mosè Giordano (May 15 2022 at 00:41):

chriselrod (May 15 2022 at 01:57):

chriselrod (May 15 2022 at 01:57):

chriselrod (May 15 2022 at 01:58):

chriselrod (May 15 2022 at 02:00):

chriselrod (May 15 2022 at 02:00):

chriselrod (May 15 2022 at 02:04):

Mark Kittisopikul (May 17 2022 at 16:20):