Is there an accelerated version of copyto!
and permutedims!
somewhere?
My initial experiments with SIMD.jl show it is possible to speed up these procedures. I'm wondering if someone has already done this?
Here is a concrete problem I have. I want to make the following faster:
julia> A = Array{UInt16}(undef, 4608, 2592, 128);
julia> B = Array{UInt16}(undef, 4608, 2688, 128);
julia> @time copyto!(@view(B[:, axes(A)[2], :]), A);
3.951746 seconds (13 allocations: 384 bytes)
I could do this, but is there a faster way?
julia> @time Threads.@threads for i in 1:128
copyto!(@view(B[:, axes(A)[2], :]), @view(A[:,:,i]))
end
0.647388 seconds (21.29 k allocations: 1.134 MiB, 6.40% compilation time)
Chris Elrod is your friend here.
julia> using LoopVectorization
julia> function my_copyto_tturbo!(A, B)
@tturbo for i ∈ eachindex(A, B)
A[i] = B[i]
end
A
end;
julia> let A = Array{UInt16}(undef, 4608, 2592, 128), B = Array{UInt16}(undef, 4608, 2688, 128)
@btime copyto!(@view($B[:, axes($A)[2], :]), $A);
@btime my_copyto_tturbo!(@view($B[:, axes($A)[2], :]), $A);
end;
4.636 s (0 allocations: 0 bytes)
200.800 ms (0 allocations: 0 bytes)
For permutedims!
, btw, TensorOperations.jl is often very fast:
julia> using TensorOperations, LoopVectorization
julia> function my_permute_tturbo!(A, B)
@tturbo for i in axes(A,1), j in axes(A,2), k in axes(A,3)
B[k,j,i] = A[i,j,k]
end
A
end;
julia> let A = rand(UInt16, 4608, 2592, 128), B = Array{UInt16}(undef, 128, 2592, 4608)
@btime permutedims!($B, $A, (3,2,1))
@btime my_permute_tturbo!($B, $A)
@btime @tensor $B[k,j,i] = $A[i,j,k]
end;
13.564 s (0 allocations: 0 bytes)
6.656 s (0 allocations: 0 bytes)
1.068 s (62 allocations: 4.89 KiB)
LoopVectorization's multithreading seems to be doing something bad here.
julia> function my_permute_polyester!(A, B)
@batch for j in axes(A,2)
@turbo for i in axes(A,1), k in axes(A,3)
B[k,j,i] = A[i,j,k]
end; end
A
end;
julia> let A = rand(UInt16, 4608, 2592, 128), B = Array{UInt16}(undef, 128, 2592, 4608)
@time permutedims!(B, A, (3,2,1))
@time my_permute_tturbo!(B, A)
@time my_permute_polyester!(B, A)
@time @tensor B[k,j,i] = A[i,j,k]
end;
14.993049 seconds
3.187791 seconds (3 allocations: 96 bytes)
0.495429 seconds
0.721222 seconds (515 allocations: 41.172 KiB)
The allocation is because I'm using @time
, and Polyester's threads have to wake up.
Anyway, as the Polyester
example shows, LV is obviously making a stupid threading decision. I doubt my @batch
on axis 2 is particularly smart, either, it was just the first thing I tried, and apparently 6x less stupid.
I noticed with the copyto!
example, there was no advantage to using @tturbo
vs @turbo
even though it was taking hundreds of miliseconds, so probably a bad heuristic there too
I think it isn't picking good loops to thread.
Also, copyto!
will likely be faster than @turbo
.
copyto!
forwards to memcpy which can have optimized implementations using non-temporal stores (as well as arch-based dispatch).
vmapnt!(identity, dst, src)
matched it in that example, but of course the advantage of the memcpy implementations is that they'll only use non-temporal stores for large arrays, where early copied elements are going to be flushed from your cache anyway.
ls
corresponds to my_permute_tturbo!
:
julia> LoopVectorization.choose_order(ls)
([:k, :i, :j], :i, :k, :i, 1, 1)
So it looks like it's putting the j
loop as the inner-most loop, and then threading the k
and i
loop. The i
loop is also what it is vectorizing, meaning it is going with contiguous loads + scatters, which also seems questionable. gathers + contiguous stores should be faster, but I think as a heuristic it prefers faster loads as you're more likely to be bottlenecked on latency.
Actually, a really clever thing I should try to get the rewrite to do is have a combined kernel that uses both gathers + contiguous stores and contiguous loads + scatters, to balance loads and store (in proportion to the actual number of each the CPU is capable of)...
It's a pity Base's permutedims!
is so much slower, even without multithreading. Maybe TO's algorithm should be built-in? BLAS doesn't do this if google is telling the truth, but do libraries like MKL have fast routines? copyto!
is just tripping over the view
somehow, costing 20x.
(IIRC there are actually at 2 or 3 different blocking stories for permutedims!
in different parts of Base.)
Locally:
julia> @time my_permute_tturbo!(B, A);
0.145698 seconds
julia> @time my_permute_tturbo!(B, A);
0.139114 seconds
I just pushed to master.
It is still far from optimal, but I at least made a minimal tweak to fix the really bad behavior seen currently.
I don't want to make more substantial changes, as my LV efforts are being focused on the rewrite.
But possibly handling something like this really well via balanced gather/scatters is an interesting idea I'll try and get around to trying out eventually...
Of course, part of the trick there is that you need to divide up the iteration space, and then fuse overlapping pieces.
That could perhaps be guarded by a "I don't care how much code you're generating!" flag.
The @tturbo
version's performance should be better with the newly released LoopVectorization v0.12.109.
do the polyester threads already play nice with the "normal" base julia threads?
AFAIK no, and there’s no plans for that.
Polyester threads are better anyways :grinning:
Kiran was in Boston for a week a few weeks ago (and was the hecklee at a CAJUN "heckle a Julia developer").
Improving performance of Base threads was a priority of his.
He also said he'd create a benchmark repo that contains Julia and OpenMP versions for comparison, and would accept a PR that also adds Polyester/LV versions for comparison as well.
Hopefully things will get better, but I do remain a bit skeptical.
Oh that'd be cool!
Brenhin Keller said:
Polyester threads are better anyways :grinning:
IMO they're not better, but they are very useful.
If I was to have a language with only one threading system, I wouldn't choose Polyester's set of tradeoffs. But as a choice of one way to thread code, it's fantastic
Oh not being too serious or anything -- and "better" is super subjective. That said, while it's great that there is multithreading in base, and the async / spawn approach is cool and all, for me personally the cases where multithreading is super useful are kind of squeezed between the efficiency you can get from a single thread with SIMD on one side, and multiprocessing on the other side. For the cases in-between where I want more than one thread, but don't want to reach all the way over to message passing and such, low overhead is kinda the key priority personally.
I have never worked in/with HPC, but I've been told people who do really want composable task-based systems (i.e., base Julia, not Polyester).
Dynamic scheduling will also become increasingly important as architectures becoming increasingly heterogenous.
In principle, the idea of a few big cores optimized for single threaded performance + a bunch of little cores optimized for performance/die area makes a lot of sense for maximizing a CPU's performance.
Many programs just aren't going to scale beyond a few cores, so have a few big cores to make them fast. For programs that do scale well with more cores, throw as many as you can at them.
Intel is going this route, with their next generation of CPUs being rumored to have 16 little cores.
I'm not sure about Apple, which tends to have much fewer little cores aimed at background processes rather than doing work?
I watched a presentation by Tenstorrent that advertised a similar concept (fast big core + tons of small ones) for ML work.
So maybe breadth first or work stealing work well here; i.e. we want something dynamic to best be able to take advantage of heterogenous compute, where it would be difficult to know how to actually divide work up a priori.
There's a decent chance Julia moves to work stealing in the future. With work stealing, we can't be better than the competition, but -- given it is what the competition uses -- at least we'll actually be able to match it.
Kiran maintains that depth first is the better algorithm in theory, but there's a lot of unsolved issues for a good implementation, while work stealing is much more of a known entity.
Julia probably wants heterogenous architectures.
Compiling sucks on the A64FX, which is basically just a bunch of small cores (that have big vectors). Bolting a big core or two on there to at least run the JIT would make it much nicer for Julia
Yeah, that does make sense. I'm probably an outlier; real HPC folks have been all about "hierarchical parallelism" for a while now
Man, I've been doing benchmarks on a64fx, it's sooooo slow
Haha, oh no
With parallel precompilation in v1.7, I have to go through Julia-1.0-level precompilation times. It is very common to stare the screen for 5-10 minutes for precompilation to complete
A blast from the past :laughing:
What's the clock speed on those again?
2.0-2.2 GHz
Oh huh, so slow but not insanely slow.. so then it's lack of cache and low instructions-per-cycle and such then that limits precompilation?
I'm not sure about why it's so slow, maybe Chris has some ideas. In general Julia's compilation is very slow on some aarch64 boards with low memory, but I don't think this is the case for this CPU. In any case the experience is pretty bad.
In the last weeks I'm doing some MPI performance to measure latency when using MPI.jl and Julia overhead compared to C. Some results are good, others less so, but I want to run more benchmarks in the next weeks.
chriselrod said:
but I do remain a bit skeptical.
skeptical about what specifically?
Mosè Giordano said:
chriselrod said:
but I do remain a bit skeptical.
skeptical about what specifically?
Skeptical may not have been the right word. I guess I could say "skeptical that the approach will be able enable small multithreaded BLAS that is competitive with MKL", but that's also not really the use case they're targeting, and other than looking good on benchmarks, I have to wonder how valuable that really is.
The A64FX really doesn't look that bad on paper: https://www.stonybrook.edu/commcms/ookami/support/_docs/A64FX_Microarchitecture_Manual_en_1.3.pdf
E.g., it has 128 floating point registers, vs 168 in Intel Skylake-X.
"Commit stack entry" of 128, which if I'm reading correctly, should be compared with Skylake's ROB of 224.
Both can decode 4 instructions/cycle.
I.e., it looks slower than Skylake on paper, but compare that with chips like the A53 and A55, which are only dual-issue (half the A64FX's 4, which basically matches Skylake).
I'm just catching up here. Thank you for the great tips.
Last updated: Dec 28 2024 at 04:38 UTC