can I run multiple different matrix multiplications in parallel, where I use multiple blas threads per matrix multiplication?
It seems that if I run 4 blas threads, and spawn different threads doing matrix multiplications, that actually the multiplications are done serially, over those 4 threads.
Maarten has marked this topic as resolved.
Maarten has marked this topic as unresolved.
I'm not sure if I understand your thread setup. What is the output of Threads.nthreads()
?
8 or so. I just want to verify that if I have 8 julia threads, where I spawn multiple different matrix contractions (with blas.set_num_threads(4)), that those mulitplications are done in parallel. It looks like they are done serially, and every individual one is done over those 4 blas threads.
perhaps to clarify. What can I do to make parallelworker outperform serialworker:
using LinearAlgebra,Base.Threads,BenchmarkTools;
function serialworker(As,Bs)
sum(map(zip(As,Bs)) do (a,b)
a*b
end)
end
function parallelworker(As,Bs)
temp = similar(As);
@sync for i in 1:length(As)
@Threads.spawn temp[i] = As[i]*Bs[i]
end
sum(temp);
end
let
BLAS.set_num_threads(4);
As = [rand(3000,3000) for a in 1:10];
Bs = [rand(3000,3000) for b in 1:10];
@btime serialworker($As,$Bs);
@btime parallelworker($As,$Bs);
end
That code is going to oversubscribe threads, so it will be slower.
julia> function parallelworker(As,Bs)
nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads();
temp = similar(As, nt)
@sync for i in 1:nt
Threads.@spawn begin
Al = $As; Bl = $Bs; n = $nt
start = 1+((i-1)*length(Al)) ÷ n
stop = ( i *length(Al)) ÷ n
C = Al[start] * Bl[start]
for j in start+1:stop
mul!(C, Al[j], Bl[j], 1.0, 1.0)
end
$temp[i] = C
end
end
sum(temp);
end
parallelworker (generic function with 1 method)
julia> @time parallelworker(As, Bs);
0.597088 seconds (87.75 k allocations: 485.472 MiB, 6.26% compilation time)
julia> @time parallelworker(As, Bs);
0.542520 seconds (53 allocations: 480.655 MiB)
julia> @time serialworker(As, Bs);
1.468851 seconds (39 allocations: 1.274 GiB, 0.96% gc time)
Note that I'm only using nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads()
.
julia> As = [rand(3000,3000) for a in 1:100];
julia> Bs = [rand(3000,3000) for b in 1:100];
julia> @time parallelworker(As, Bs);
3.940653 seconds (53 allocations: 480.655 MiB)
julia> @time serialworker(As, Bs);
15.268181 seconds (399 allocations: 13.344 GiB, 3.55% gc time)
julia> Threads.nthreads(), BLAS.get_num_threads()
(36, 4)
I cannot replicate this
julia> using LinearAlgebra,Base.Threads,BenchmarkTools;
julia> function parallelworker(As,Bs)
nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads();
temp = similar(As, nt)
@sync for i in 1:nt
Threads.@spawn begin
Al = $As; Bl = $Bs; n = $nt
start = 1+((i-1)*length(Al)) ÷ n
stop = ( i *length(Al)) ÷ n
C = Al[start] * Bl[start]
for j in start+1:stop
mul!(C, Al[j], Bl[j], 1.0, 1.0)
end
$temp[i] = C
end
end
sum(temp);
end
parallelworker (generic function with 1 method)
julia> function serialworker(As,Bs)
nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads();
temp = similar(As, nt)
for i in 1:nt
Al = As; Bl = Bs; n = nt
start = 1+((i-1)*length(Al)) ÷ n
stop = ( i *length(Al)) ÷ n
C = Al[start] * Bl[start]
for j in start+1:stop
mul!(C, Al[j], Bl[j], 1.0, 1.0)
end
temp[i] = C
end
sum(temp);
end
serialworker (generic function with 1 method)
julia> As = [rand(1000,1000) for a in 1:50];
julia> Bs = [rand(1000,1000) for b in 1:50];
julia> @time parallelworker(As,Bs);
1.790027 seconds (4.18 M allocations: 215.780 MiB, 61.53% compilation time)
julia> @time serialworker(As,Bs);
0.806896 seconds (29.02 k allocations: 9.217 MiB, 7.13% compilation time)
julia> @time parallelworker(As,Bs);
0.700317 seconds (22 allocations: 7.631 MiB)
julia> @time serialworker(As,Bs);
0.700381 seconds (3 allocations: 7.630 MiB)
julia> Threads.nthreads(), BLAS.get_num_threads()
(16, 8)
julia> BLAS.set_num_threads(4);
julia> @time parallelworker(As,Bs);
0.721815 seconds (32 allocations: 22.890 MiB)
julia> @time parallelworker(As,Bs);
0.691697 seconds (32 allocations: 22.890 MiB)
julia> @time serialworker(As,Bs);
0.654302 seconds (7 allocations: 22.888 MiB)
julia> @time serialworker(As,Bs);
0.663570 seconds (7 allocations: 22.888 MiB)
I also don't understand how it would work. I have difficulty finding documentation on how the blas and julia threads play together, but it is my understanding that they kind of don't. Currently I'm getting really good results for my particular problem by using Strided and blas_num_threads(1), where strided then divide&conquer's the matrix multiplications, and spawns appropriate mul! jobs.
The entire point behind the design of Julia's multithreading system was to avoid this flaw in the way BLAS and other systems use threads
The whole problem with most multithreaded programs is that they suffer catastrophic performance losses if you nest them
That was part of why I was playing around with Gaius.jl because it won't suffer this slowdown
@Takafumi Arakaki (tkf) has a branch of it that requires his TAPIR branch of Julia that is way faster
I looked at gaius, but it seems that the successor is octavian, which in turn builds on cheapthreads (polyester), which is then incompatible with Threads.@spawn :'(
If TAPIR progresses well, then Gaius.jl might live on
giving Gaius a LRU cache, Tapir, and some more tuning would probably put it in the same ballpark as Octavian
Out of curiousity I will benchmark with gaius as is. This tapir is a different kind of scheduler? Is it a serious possible successor of the current scheduler?
It's basically a way to make stronger guarantees to the compiler about what can possibly happen in multithreaded code https://github.com/JuliaLang/julia/pull/39773
I'll bump the compat bounds on Gaius.jl so that it doesn't downgrade your other packages. It'll take a little while to appear though
it looks very cool!
https://github.com/JuliaRegistries/General/pull/46981
thanks! it's crazy that it performs on par with openblas, which I assume is rather optimized for specific pc architectures
Maarten said:
I looked at gaius, but it seems that the successor is octavian, which in turn builds on cheapthreads (polyester), which is then incompatible with Threads.@spawn :'(
It is compatible with Polyester.@batch
, though.
The Tapir work definitely looks exciting, though.
I've just been playing with Gaius.jl a couple of days. I was trying to analyze the task DAG structure using https://github.com/tkf/TaskDAGAnalyzers.jl
image.png
If you look at a typical DAG like above, its width expands and shrinks a few times during execution. This is because GEMM is a complex mixture of parallel tasks and sequential dependencies. So, it's helpful if the scheduler helps you to run another GEMMs if your GEMM is in the middle of the sequential portion. But, at the same time, the scheduler shouldn't be destroying the memory locality of each library function invocation; i.e., the second GEMM should "back off" if the first GEMM starts executing the parallel portion in the DGA.
Last updated: Nov 06 2024 at 04:40 UTC