Stream: helpdesk (published)

Topic: Multithreaded copyto! and permutedims!


view this post on Zulip Mark Kittisopikul (May 13 2022 at 13:07):

Is there an accelerated version of copyto! and permutedims! somewhere?

My initial experiments with SIMD.jl show it is possible to speed up these procedures. I'm wondering if someone has already done this?

view this post on Zulip Mark Kittisopikul (May 13 2022 at 18:12):

Here is a concrete problem I have. I want to make the following faster:

julia> A = Array{UInt16}(undef, 4608, 2592, 128);

julia> B = Array{UInt16}(undef, 4608, 2688, 128);

julia> @time copyto!(@view(B[:, axes(A)[2], :]), A);
  3.951746 seconds (13 allocations: 384 bytes)

view this post on Zulip Mark Kittisopikul (May 13 2022 at 18:15):

I could do this, but is there a faster way?

julia> @time Threads.@threads for i in 1:128
           copyto!(@view(B[:, axes(A)[2], :]), @view(A[:,:,i]))
       end
  0.647388 seconds (21.29 k allocations: 1.134 MiB, 6.40% compilation time)

view this post on Zulip Mason Protter (May 13 2022 at 18:21):

Chris Elrod is your friend here.

julia> using LoopVectorization

julia> function my_copyto_tturbo!(A, B)
           @tturbo for i  eachindex(A, B)
               A[i] = B[i]
           end
           A
       end;

julia> let A = Array{UInt16}(undef, 4608, 2592, 128),  B = Array{UInt16}(undef, 4608, 2688, 128)
           @btime copyto!(@view($B[:, axes($A)[2], :]), $A);
           @btime my_copyto_tturbo!(@view($B[:, axes($A)[2], :]), $A);
       end;
  4.636 s (0 allocations: 0 bytes)
  200.800 ms (0 allocations: 0 bytes)

view this post on Zulip Michael Abbott (May 13 2022 at 23:20):

For permutedims!, btw, TensorOperations.jl is often very fast:

julia> using TensorOperations, LoopVectorization

julia> function my_permute_tturbo!(A, B)
           @tturbo for i in axes(A,1), j in axes(A,2), k in axes(A,3)
               B[k,j,i] = A[i,j,k]
           end
           A
       end;

julia> let A = rand(UInt16, 4608, 2592, 128),  B = Array{UInt16}(undef, 128, 2592, 4608)
           @btime permutedims!($B, $A, (3,2,1))
           @btime my_permute_tturbo!($B, $A)
           @btime @tensor $B[k,j,i] = $A[i,j,k]
       end;
  13.564 s (0 allocations: 0 bytes)
  6.656 s (0 allocations: 0 bytes)
  1.068 s (62 allocations: 4.89 KiB)

view this post on Zulip chriselrod (May 13 2022 at 23:47):

LoopVectorization's multithreading seems to be doing something bad here.

julia> function my_permute_polyester!(A, B)
           @batch for j in axes(A,2)
           @turbo for i in axes(A,1), k in axes(A,3)
               B[k,j,i] = A[i,j,k]
           end; end
           A
       end;

julia> let A = rand(UInt16, 4608, 2592, 128),  B = Array{UInt16}(undef, 128, 2592, 4608)
           @time permutedims!(B, A, (3,2,1))
           @time my_permute_tturbo!(B, A)
           @time my_permute_polyester!(B, A)
           @time @tensor B[k,j,i] = A[i,j,k]
       end;
 14.993049 seconds
  3.187791 seconds (3 allocations: 96 bytes)
  0.495429 seconds
  0.721222 seconds (515 allocations: 41.172 KiB)

The allocation is because I'm using @time, and Polyester's threads have to wake up.

view this post on Zulip chriselrod (May 13 2022 at 23:52):

Anyway, as the Polyester example shows, LV is obviously making a stupid threading decision. I doubt my @batch on axis 2 is particularly smart, either, it was just the first thing I tried, and apparently 6x less stupid.

view this post on Zulip Mason Protter (May 13 2022 at 23:54):

I noticed with the copyto! example, there was no advantage to using @tturbo vs @turbo even though it was taking hundreds of miliseconds, so probably a bad heuristic there too

view this post on Zulip chriselrod (May 13 2022 at 23:58):

I think it isn't picking good loops to thread.

view this post on Zulip chriselrod (May 13 2022 at 23:58):

Also, copyto! will likely be faster than @turbo.

view this post on Zulip chriselrod (May 13 2022 at 23:59):

copyto! forwards to memcpy which can have optimized implementations using non-temporal stores (as well as arch-based dispatch).

view this post on Zulip chriselrod (May 14 2022 at 00:00):

vmapnt!(identity, dst, src) matched it in that example, but of course the advantage of the memcpy implementations is that they'll only use non-temporal stores for large arrays, where early copied elements are going to be flushed from your cache anyway.

view this post on Zulip chriselrod (May 14 2022 at 00:04):

ls corresponds to my_permute_tturbo!:

julia> LoopVectorization.choose_order(ls)
([:k, :i, :j], :i, :k, :i, 1, 1)

So it looks like it's putting the j loop as the inner-most loop, and then threading the k and i loop. The i loop is also what it is vectorizing, meaning it is going with contiguous loads + scatters, which also seems questionable. gathers + contiguous stores should be faster, but I think as a heuristic it prefers faster loads as you're more likely to be bottlenecked on latency.

Actually, a really clever thing I should try to get the rewrite to do is have a combined kernel that uses both gathers + contiguous stores and contiguous loads + scatters, to balance loads and store (in proportion to the actual number of each the CPU is capable of)...

view this post on Zulip Michael Abbott (May 14 2022 at 00:26):

It's a pity Base's permutedims! is so much slower, even without multithreading. Maybe TO's algorithm should be built-in? BLAS doesn't do this if google is telling the truth, but do libraries like MKL have fast routines? copyto! is just tripping over the view somehow, costing 20x.

view this post on Zulip Michael Abbott (May 14 2022 at 00:27):

(IIRC there are actually at 2 or 3 different blocking stories for permutedims! in different parts of Base.)

view this post on Zulip chriselrod (May 14 2022 at 00:35):

Locally:

julia> @time my_permute_tturbo!(B, A);
  0.145698 seconds

julia> @time my_permute_tturbo!(B, A);
  0.139114 seconds

view this post on Zulip chriselrod (May 14 2022 at 00:39):

I just pushed to master.
It is still far from optimal, but I at least made a minimal tweak to fix the really bad behavior seen currently.

I don't want to make more substantial changes, as my LV efforts are being focused on the rewrite.
But possibly handling something like this really well via balanced gather/scatters is an interesting idea I'll try and get around to trying out eventually...
Of course, part of the trick there is that you need to divide up the iteration space, and then fuse overlapping pieces.
That could perhaps be guarded by a "I don't care how much code you're generating!" flag.

view this post on Zulip chriselrod (May 14 2022 at 02:03):

The @tturbo version's performance should be better with the newly released LoopVectorization v0.12.109.

view this post on Zulip Maarten (May 14 2022 at 18:57):

do the polyester threads already play nice with the "normal" base julia threads?

view this post on Zulip Mason Protter (May 14 2022 at 19:46):

AFAIK no, and there’s no plans for that.

view this post on Zulip Brenhin Keller (May 14 2022 at 20:27):

Polyester threads are better anyways :grinning:

view this post on Zulip chriselrod (May 14 2022 at 21:01):

Kiran was in Boston for a week a few weeks ago (and was the hecklee at a CAJUN "heckle a Julia developer").
Improving performance of Base threads was a priority of his.

He also said he'd create a benchmark repo that contains Julia and OpenMP versions for comparison, and would accept a PR that also adds Polyester/LV versions for comparison as well.

Hopefully things will get better, but I do remain a bit skeptical.

view this post on Zulip Brenhin Keller (May 14 2022 at 21:31):

Oh that'd be cool!

view this post on Zulip Mason Protter (May 14 2022 at 21:34):

Brenhin Keller said:

Polyester threads are better anyways :grinning:

IMO they're not better, but they are very useful.

view this post on Zulip Mason Protter (May 14 2022 at 21:35):

If I was to have a language with only one threading system, I wouldn't choose Polyester's set of tradeoffs. But as a choice of one way to thread code, it's fantastic

view this post on Zulip Brenhin Keller (May 14 2022 at 21:56):

Oh not being too serious or anything -- and "better" is super subjective. That said, while it's great that there is multithreading in base, and the async / spawn approach is cool and all, for me personally the cases where multithreading is super useful are kind of squeezed between the efficiency you can get from a single thread with SIMD on one side, and multiprocessing on the other side. For the cases in-between where I want more than one thread, but don't want to reach all the way over to message passing and such, low overhead is kinda the key priority personally.

view this post on Zulip chriselrod (May 14 2022 at 22:35):

I have never worked in/with HPC, but I've been told people who do really want composable task-based systems (i.e., base Julia, not Polyester).

view this post on Zulip chriselrod (May 14 2022 at 22:36):

Dynamic scheduling will also become increasingly important as architectures becoming increasingly heterogenous.

view this post on Zulip chriselrod (May 14 2022 at 22:39):

In principle, the idea of a few big cores optimized for single threaded performance + a bunch of little cores optimized for performance/die area makes a lot of sense for maximizing a CPU's performance.
Many programs just aren't going to scale beyond a few cores, so have a few big cores to make them fast. For programs that do scale well with more cores, throw as many as you can at them.
Intel is going this route, with their next generation of CPUs being rumored to have 16 little cores.
I'm not sure about Apple, which tends to have much fewer little cores aimed at background processes rather than doing work?
I watched a presentation by Tenstorrent that advertised a similar concept (fast big core + tons of small ones) for ML work.

So maybe breadth first or work stealing work well here; i.e. we want something dynamic to best be able to take advantage of heterogenous compute, where it would be difficult to know how to actually divide work up a priori.

view this post on Zulip chriselrod (May 14 2022 at 22:40):

There's a decent chance Julia moves to work stealing in the future. With work stealing, we can't be better than the competition, but -- given it is what the competition uses -- at least we'll actually be able to match it.

Kiran maintains that depth first is the better algorithm in theory, but there's a lot of unsolved issues for a good implementation, while work stealing is much more of a known entity.

view this post on Zulip chriselrod (May 14 2022 at 22:46):

Julia probably wants heterogenous architectures.
Compiling sucks on the A64FX, which is basically just a bunch of small cores (that have big vectors). Bolting a big core or two on there to at least run the JIT would make it much nicer for Julia

view this post on Zulip Brenhin Keller (May 14 2022 at 22:50):

Yeah, that does make sense. I'm probably an outlier; real HPC folks have been all about "hierarchical parallelism" for a while now

view this post on Zulip Mosè Giordano (May 14 2022 at 22:50):

Man, I've been doing benchmarks on a64fx, it's sooooo slow

view this post on Zulip Brenhin Keller (May 14 2022 at 22:52):

Haha, oh no

view this post on Zulip Mosè Giordano (May 14 2022 at 22:54):

With parallel precompilation in v1.7, I have to go through Julia-1.0-level precompilation times. It is very common to stare the screen for 5-10 minutes for precompilation to complete

view this post on Zulip Brenhin Keller (May 14 2022 at 22:55):

A blast from the past :laughing:

view this post on Zulip Brenhin Keller (May 14 2022 at 22:56):

What's the clock speed on those again?

view this post on Zulip Mosè Giordano (May 14 2022 at 23:00):

2.0-2.2 GHz

view this post on Zulip Brenhin Keller (May 14 2022 at 23:03):

Oh huh, so slow but not insanely slow.. so then it's lack of cache and low instructions-per-cycle and such then that limits precompilation?

view this post on Zulip Mosè Giordano (May 14 2022 at 23:38):

I'm not sure about why it's so slow, maybe Chris has some ideas. In general Julia's compilation is very slow on some aarch64 boards with low memory, but I don't think this is the case for this CPU. In any case the experience is pretty bad.

In the last weeks I'm doing some MPI performance to measure latency when using MPI.jl and Julia overhead compared to C. Some results are good, others less so, but I want to run more benchmarks in the next weeks.

view this post on Zulip Mosè Giordano (May 15 2022 at 00:41):

chriselrod said:

but I do remain a bit skeptical.

skeptical about what specifically?

view this post on Zulip chriselrod (May 15 2022 at 01:57):

Mosè Giordano said:

chriselrod said:

but I do remain a bit skeptical.

skeptical about what specifically?

Skeptical may not have been the right word. I guess I could say "skeptical that the approach will be able enable small multithreaded BLAS that is competitive with MKL", but that's also not really the use case they're targeting, and other than looking good on benchmarks, I have to wonder how valuable that really is.

view this post on Zulip chriselrod (May 15 2022 at 01:57):

The A64FX really doesn't look that bad on paper: https://www.stonybrook.edu/commcms/ookami/support/_docs/A64FX_Microarchitecture_Manual_en_1.3.pdf

view this post on Zulip chriselrod (May 15 2022 at 01:58):

E.g., it has 128 floating point registers, vs 168 in Intel Skylake-X.

view this post on Zulip chriselrod (May 15 2022 at 02:00):

"Commit stack entry" of 128, which if I'm reading correctly, should be compared with Skylake's ROB of 224.

view this post on Zulip chriselrod (May 15 2022 at 02:00):

Both can decode 4 instructions/cycle.

view this post on Zulip chriselrod (May 15 2022 at 02:04):

I.e., it looks slower than Skylake on paper, but compare that with chips like the A53 and A55, which are only dual-issue (half the A64FX's 4, which basically matches Skylake).

view this post on Zulip Mark Kittisopikul (May 17 2022 at 16:20):

I'm just catching up here. Thank you for the great tips.


Last updated: Dec 28 2024 at 04:38 UTC