Stream: helpdesk (published)

Topic: multiple matrix multiplications in parallal


view this post on Zulip Maarten (Oct 16 2021 at 07:00):

can I run multiple different matrix multiplications in parallel, where I use multiple blas threads per matrix multiplication?

It seems that if I run 4 blas threads, and spawn different threads doing matrix multiplications, that actually the multiplications are done serially, over those 4 threads.

view this post on Zulip Notification Bot (Oct 16 2021 at 07:08):

Maarten has marked this topic as resolved.

view this post on Zulip Notification Bot (Oct 16 2021 at 07:08):

Maarten has marked this topic as unresolved.

view this post on Zulip Mark Kittisopikul (Oct 16 2021 at 07:55):

I'm not sure if I understand your thread setup. What is the output of Threads.nthreads()?

view this post on Zulip Maarten (Oct 17 2021 at 08:36):

8 or so. I just want to verify that if I have 8 julia threads, where I spawn multiple different matrix contractions (with blas.set_num_threads(4)), that those mulitplications are done in parallel. It looks like they are done serially, and every individual one is done over those 4 blas threads.

view this post on Zulip Maarten (Oct 18 2021 at 08:05):

perhaps to clarify. What can I do to make parallelworker outperform serialworker:

using LinearAlgebra,Base.Threads,BenchmarkTools;

function serialworker(As,Bs)
    sum(map(zip(As,Bs)) do (a,b)
        a*b
    end)
end

function parallelworker(As,Bs)
    temp = similar(As);
    @sync for i in 1:length(As)
        @Threads.spawn temp[i] = As[i]*Bs[i]
    end
    sum(temp);
end

let
BLAS.set_num_threads(4);

As = [rand(3000,3000) for a in 1:10];
Bs = [rand(3000,3000) for b in 1:10];


@btime serialworker($As,$Bs);

@btime parallelworker($As,$Bs);

end

view this post on Zulip chriselrod (Oct 18 2021 at 13:06):

That code is going to oversubscribe threads, so it will be slower.

julia> function parallelworker(As,Bs)
           nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads();
           temp = similar(As, nt)
           @sync for i in 1:nt
               Threads.@spawn begin
                   Al = $As; Bl = $Bs; n = $nt
                   start = 1+((i-1)*length(Al)) ÷ n
                   stop =    ( i   *length(Al)) ÷ n
                   C = Al[start] * Bl[start]
                   for j in start+1:stop
                       mul!(C, Al[j], Bl[j], 1.0, 1.0)
                   end
                   $temp[i] = C
               end
           end
           sum(temp);
       end
parallelworker (generic function with 1 method)

julia> @time parallelworker(As, Bs);
  0.597088 seconds (87.75 k allocations: 485.472 MiB, 6.26% compilation time)

julia> @time parallelworker(As, Bs);
  0.542520 seconds (53 allocations: 480.655 MiB)

julia> @time serialworker(As, Bs);
  1.468851 seconds (39 allocations: 1.274 GiB, 0.96% gc time)

Note that I'm only using nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads().

view this post on Zulip chriselrod (Oct 18 2021 at 13:07):

julia> As = [rand(3000,3000) for a in 1:100];

julia> Bs = [rand(3000,3000) for b in 1:100];

julia> @time parallelworker(As, Bs);
  3.940653 seconds (53 allocations: 480.655 MiB)

julia> @time serialworker(As, Bs);
 15.268181 seconds (399 allocations: 13.344 GiB, 3.55% gc time)

julia> Threads.nthreads(), BLAS.get_num_threads()
(36, 4)

view this post on Zulip Maarten (Oct 18 2021 at 13:29):

I cannot replicate this

julia> using LinearAlgebra,Base.Threads,BenchmarkTools;

julia> function parallelworker(As,Bs)
          nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads();
          temp = similar(As, nt)
          @sync for i in 1:nt
              Threads.@spawn begin
                  Al = $As; Bl = $Bs; n = $nt
                  start = 1+((i-1)*length(Al)) ÷ n
                  stop =    ( i   *length(Al)) ÷ n
                  C = Al[start] * Bl[start]
                  for j in start+1:stop
                      mul!(C, Al[j], Bl[j], 1.0, 1.0)
                  end
                  $temp[i] = C
              end
          end
          sum(temp);
       end
parallelworker (generic function with 1 method)

julia> function serialworker(As,Bs)
          nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads();
          temp = similar(As, nt)
          for i in 1:nt

              Al = As; Bl = Bs; n = nt
              start = 1+((i-1)*length(Al)) ÷ n
              stop =    ( i   *length(Al)) ÷ n
              C = Al[start] * Bl[start]
              for j in start+1:stop
                  mul!(C, Al[j], Bl[j], 1.0, 1.0)
              end
              temp[i] = C
          end
          sum(temp);
       end
serialworker (generic function with 1 method)

julia> As = [rand(1000,1000) for a in 1:50];

julia> Bs = [rand(1000,1000) for b in 1:50];

julia> @time parallelworker(As,Bs);
  1.790027 seconds (4.18 M allocations: 215.780 MiB, 61.53% compilation time)

julia> @time serialworker(As,Bs);
  0.806896 seconds (29.02 k allocations: 9.217 MiB, 7.13% compilation time)

julia> @time parallelworker(As,Bs);
  0.700317 seconds (22 allocations: 7.631 MiB)

julia> @time serialworker(As,Bs);
  0.700381 seconds (3 allocations: 7.630 MiB)

julia> Threads.nthreads(), BLAS.get_num_threads()
(16, 8)

julia> BLAS.set_num_threads(4);

julia> @time parallelworker(As,Bs);
  0.721815 seconds (32 allocations: 22.890 MiB)

julia> @time parallelworker(As,Bs);
  0.691697 seconds (32 allocations: 22.890 MiB)

julia> @time serialworker(As,Bs);
  0.654302 seconds (7 allocations: 22.888 MiB)

julia> @time serialworker(As,Bs);
  0.663570 seconds (7 allocations: 22.888 MiB)

view this post on Zulip Maarten (Oct 18 2021 at 13:31):

I also don't understand how it would work. I have difficulty finding documentation on how the blas and julia threads play together, but it is my understanding that they kind of don't. Currently I'm getting really good results for my particular problem by using Strided and blas_num_threads(1), where strided then divide&conquer's the matrix multiplications, and spawns appropriate mul! jobs.

view this post on Zulip Mason Protter (Oct 18 2021 at 19:17):

The entire point behind the design of Julia's multithreading system was to avoid this flaw in the way BLAS and other systems use threads

view this post on Zulip Mason Protter (Oct 18 2021 at 19:17):

The whole problem with most multithreaded programs is that they suffer catastrophic performance losses if you nest them

view this post on Zulip Mason Protter (Oct 18 2021 at 19:18):

That was part of why I was playing around with Gaius.jl because it won't suffer this slowdown

view this post on Zulip Mason Protter (Oct 18 2021 at 19:19):

@Takafumi Arakaki (tkf) has a branch of it that requires his TAPIR branch of Julia that is way faster

view this post on Zulip Maarten (Oct 18 2021 at 19:39):

I looked at gaius, but it seems that the successor is octavian, which in turn builds on cheapthreads (polyester), which is then incompatible with Threads.@spawn :'(

view this post on Zulip Mason Protter (Oct 18 2021 at 19:41):

If TAPIR progresses well, then Gaius.jl might live on

view this post on Zulip Mason Protter (Oct 18 2021 at 19:42):

giving Gaius a LRU cache, Tapir, and some more tuning would probably put it in the same ballpark as Octavian

view this post on Zulip Maarten (Oct 18 2021 at 19:43):

Out of curiousity I will benchmark with gaius as is. This tapir is a different kind of scheduler? Is it a serious possible successor of the current scheduler?

view this post on Zulip Mason Protter (Oct 18 2021 at 19:46):

It's basically a way to make stronger guarantees to the compiler about what can possibly happen in multithreaded code https://github.com/JuliaLang/julia/pull/39773

view this post on Zulip Mason Protter (Oct 18 2021 at 19:57):

I'll bump the compat bounds on Gaius.jl so that it doesn't downgrade your other packages. It'll take a little while to appear though

view this post on Zulip Maarten (Oct 18 2021 at 19:57):

it looks very cool!

view this post on Zulip Mason Protter (Oct 18 2021 at 20:06):

https://github.com/JuliaRegistries/General/pull/46981

view this post on Zulip Maarten (Oct 18 2021 at 20:08):

thanks! it's crazy that it performs on par with openblas, which I assume is rather optimized for specific pc architectures

view this post on Zulip chriselrod (Oct 18 2021 at 20:47):

Maarten said:

I looked at gaius, but it seems that the successor is octavian, which in turn builds on cheapthreads (polyester), which is then incompatible with Threads.@spawn :'(

It is compatible with Polyester.@batch, though.

The Tapir work definitely looks exciting, though.

view this post on Zulip Takafumi Arakaki (tkf) (Oct 18 2021 at 23:16):

I've just been playing with Gaius.jl a couple of days. I was trying to analyze the task DAG structure using https://github.com/tkf/TaskDAGAnalyzers.jl
image.png

view this post on Zulip Takafumi Arakaki (tkf) (Oct 18 2021 at 23:17):

If you look at a typical DAG like above, its width expands and shrinks a few times during execution. This is because GEMM is a complex mixture of parallel tasks and sequential dependencies. So, it's helpful if the scheduler helps you to run another GEMMs if your GEMM is in the middle of the sequential portion. But, at the same time, the scheduler shouldn't be destroying the memory locality of each library function invocation; i.e., the second GEMM should "back off" if the first GEMM starts executing the parallel portion in the DGA.


Last updated: Nov 06 2024 at 04:40 UTC