multiple matrix multiplications in parallal · helpdesk (published)

can I run multiple different matrix multiplications in parallel, where I use multiple blas threads per matrix multiplication?

It seems that if I run 4 blas threads, and spawn different threads doing matrix multiplications, that actually the multiplications are done serially, over those 4 threads.

Notification Bot (Oct 16 2021 at 07:08):

Mark Kittisopikul (Oct 16 2021 at 07:55):

I'm not sure if I understand your thread setup. What is the output of Threads.nthreads()?

Maarten (Oct 17 2021 at 08:36):

8 or so. I just want to verify that if I have 8 julia threads, where I spawn multiple different matrix contractions (with blas.set_num_threads(4)), that those mulitplications are done in parallel. It looks like they are done serially, and every individual one is done over those 4 blas threads.

Maarten (Oct 18 2021 at 08:05):

perhaps to clarify. What can I do to make parallelworker outperform serialworker:

using LinearAlgebra,Base.Threads,BenchmarkTools;

function serialworker(As,Bs)
    sum(map(zip(As,Bs)) do (a,b)
        a*b
    end)
end

function parallelworker(As,Bs)
    temp = similar(As);
    @sync for i in 1:length(As)
        @Threads.spawn temp[i] = As[i]*Bs[i]
    end
    sum(temp);
end

let
BLAS.set_num_threads(4);

As = [rand(3000,3000) for a in 1:10];
Bs = [rand(3000,3000) for b in 1:10];


@btime serialworker($As,$Bs);

@btime parallelworker($As,$Bs);

end

chriselrod (Oct 18 2021 at 13:06):

julia> function parallelworker(As,Bs)
           nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads();
           temp = similar(As, nt)
           @sync for i in 1:nt
               Threads.@spawn begin
                   Al = $As; Bl = $Bs; n = $nt
                   start = 1+((i-1)*length(Al)) ÷ n
                   stop =    ( i   *length(Al)) ÷ n
                   C = Al[start] * Bl[start]
                   for j in start+1:stop
                       mul!(C, Al[j], Bl[j], 1.0, 1.0)
                   end
                   $temp[i] = C
               end
           end
           sum(temp);
       end
parallelworker (generic function with 1 method)

julia> @time parallelworker(As, Bs);
  0.597088 seconds (87.75 k allocations: 485.472 MiB, 6.26% compilation time)

julia> @time parallelworker(As, Bs);
  0.542520 seconds (53 allocations: 480.655 MiB)

julia> @time serialworker(As, Bs);
  1.468851 seconds (39 allocations: 1.274 GiB, 0.96% gc time)

Note that I'm only using nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads().

chriselrod (Oct 18 2021 at 13:07):

julia> As = [rand(3000,3000) for a in 1:100];

julia> Bs = [rand(3000,3000) for b in 1:100];

julia> @time parallelworker(As, Bs);
  3.940653 seconds (53 allocations: 480.655 MiB)

julia> @time serialworker(As, Bs);
 15.268181 seconds (399 allocations: 13.344 GiB, 3.55% gc time)

julia> Threads.nthreads(), BLAS.get_num_threads()
(36, 4)

Maarten (Oct 18 2021 at 13:29):

julia> using LinearAlgebra,Base.Threads,BenchmarkTools;

julia> function parallelworker(As,Bs)
          nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads();
          temp = similar(As, nt)
          @sync for i in 1:nt
              Threads.@spawn begin
                  Al = $As; Bl = $Bs; n = $nt
                  start = 1+((i-1)*length(Al)) ÷ n
                  stop =    ( i   *length(Al)) ÷ n
                  C = Al[start] * Bl[start]
                  for j in start+1:stop
                      mul!(C, Al[j], Bl[j], 1.0, 1.0)
                  end
                  $temp[i] = C
              end
          end
          sum(temp);
       end
parallelworker (generic function with 1 method)

julia> function serialworker(As,Bs)
          nt = (Threads.nthreads()÷ 2) ÷ BLAS.get_num_threads();
          temp = similar(As, nt)
          for i in 1:nt

              Al = As; Bl = Bs; n = nt
              start = 1+((i-1)*length(Al)) ÷ n
              stop =    ( i   *length(Al)) ÷ n
              C = Al[start] * Bl[start]
              for j in start+1:stop
                  mul!(C, Al[j], Bl[j], 1.0, 1.0)
              end
              temp[i] = C
          end
          sum(temp);
       end
serialworker (generic function with 1 method)

julia> As = [rand(1000,1000) for a in 1:50];

julia> Bs = [rand(1000,1000) for b in 1:50];

julia> @time parallelworker(As,Bs);
  1.790027 seconds (4.18 M allocations: 215.780 MiB, 61.53% compilation time)

julia> @time serialworker(As,Bs);
  0.806896 seconds (29.02 k allocations: 9.217 MiB, 7.13% compilation time)

julia> @time parallelworker(As,Bs);
  0.700317 seconds (22 allocations: 7.631 MiB)

julia> @time serialworker(As,Bs);
  0.700381 seconds (3 allocations: 7.630 MiB)

julia> Threads.nthreads(), BLAS.get_num_threads()
(16, 8)

julia> BLAS.set_num_threads(4);

julia> @time parallelworker(As,Bs);
  0.721815 seconds (32 allocations: 22.890 MiB)

julia> @time parallelworker(As,Bs);
  0.691697 seconds (32 allocations: 22.890 MiB)

julia> @time serialworker(As,Bs);
  0.654302 seconds (7 allocations: 22.888 MiB)

julia> @time serialworker(As,Bs);
  0.663570 seconds (7 allocations: 22.888 MiB)

Maarten (Oct 18 2021 at 13:31):

I also don't understand how it would work. I have difficulty finding documentation on how the blas and julia threads play together, but it is my understanding that they kind of don't. Currently I'm getting really good results for my particular problem by using Strided and blas_num_threads(1), where strided then divide&conquer's the matrix multiplications, and spawns appropriate mul! jobs.

Mason Protter (Oct 18 2021 at 19:17):

The entire point behind the design of Julia's multithreading system was to avoid this flaw in the way BLAS and other systems use threads

Mason Protter (Oct 18 2021 at 19:17):

The whole problem with most multithreaded programs is that they suffer catastrophic performance losses if you nest them

Mason Protter (Oct 18 2021 at 19:18):

That was part of why I was playing around with Gaius.jl because it won't suffer this slowdown

Mason Protter (Oct 18 2021 at 19:19):

@Takafumi Arakaki (tkf) has a branch of it that requires his TAPIR branch of Julia that is way faster

Maarten (Oct 18 2021 at 19:39):

I looked at gaius, but it seems that the successor is octavian, which in turn builds on cheapthreads (polyester), which is then incompatible with Threads.@spawn :'(

Mason Protter (Oct 18 2021 at 19:41):

Mason Protter (Oct 18 2021 at 19:42):

giving Gaius a LRU cache, Tapir, and some more tuning would probably put it in the same ballpark as Octavian

Maarten (Oct 18 2021 at 19:43):

Out of curiousity I will benchmark with gaius as is. This tapir is a different kind of scheduler? Is it a serious possible successor of the current scheduler?

Mason Protter (Oct 18 2021 at 19:46):

Mason Protter (Oct 18 2021 at 19:57):

I'll bump the compat bounds on Gaius.jl so that it doesn't downgrade your other packages. It'll take a little while to appear though

Maarten (Oct 18 2021 at 19:57):

Mason Protter (Oct 18 2021 at 20:06):

Maarten (Oct 18 2021 at 20:08):

thanks! it's crazy that it performs on par with openblas, which I assume is rather optimized for specific pc architectures

chriselrod (Oct 18 2021 at 20:47):

Takafumi Arakaki (tkf) (Oct 18 2021 at 23:16):

Takafumi Arakaki (tkf) (Oct 18 2021 at 23:17):

If you look at a typical DAG like above, its width expands and shrinks a few times during execution. This is because GEMM is a complex mixture of parallel tasks and sequential dependencies. So, it's helpful if the scheduler helps you to run another GEMMs if your GEMM is in the middle of the sequential portion. But, at the same time, the scheduler shouldn't be destroying the memory locality of each library function invocation; i.e., the second GEMM should "back off" if the first GEMM starts executing the parallel portion in the DGA.

Stream: helpdesk (published)

Topic: multiple matrix multiplications in parallal

Maarten (Oct 16 2021 at 07:00):

Notification Bot (Oct 16 2021 at 07:08):

Notification Bot (Oct 16 2021 at 07:08):

Mark Kittisopikul (Oct 16 2021 at 07:55):

Maarten (Oct 17 2021 at 08:36):

Maarten (Oct 18 2021 at 08:05):

chriselrod (Oct 18 2021 at 13:06):

chriselrod (Oct 18 2021 at 13:07):

Maarten (Oct 18 2021 at 13:29):

Maarten (Oct 18 2021 at 13:31):

Mason Protter (Oct 18 2021 at 19:17):

Mason Protter (Oct 18 2021 at 19:17):

Mason Protter (Oct 18 2021 at 19:18):

Mason Protter (Oct 18 2021 at 19:19):

Maarten (Oct 18 2021 at 19:39):

Mason Protter (Oct 18 2021 at 19:41):

Mason Protter (Oct 18 2021 at 19:42):

Maarten (Oct 18 2021 at 19:43):

Mason Protter (Oct 18 2021 at 19:46):

Mason Protter (Oct 18 2021 at 19:57):

Maarten (Oct 18 2021 at 19:57):

Mason Protter (Oct 18 2021 at 20:06):

Maarten (Oct 18 2021 at 20:08):

chriselrod (Oct 18 2021 at 20:47):

Takafumi Arakaki (tkf) (Oct 18 2021 at 23:16):

Takafumi Arakaki (tkf) (Oct 18 2021 at 23:17):