Function not vectorizing properly for some array types · helpdesk (published)

I have a function that's intended to take a Array{UInt32} and shuffle the bytes of each element from AAA...BBB...CCC...DDD... to ABCDABCDABCD...:

function deshuffle(data, out)
    byte_stride = length(out) ÷ 4

    @inbounds for i in 1:byte_stride
        out_i = 1 + (i - 1) * 4

        out[out_i] = data[i]
        out[out_i + 1] = data[i + byte_stride]
        out[out_i + 2] = data[i + (2 * byte_stride)]
        out[out_i + 3] = data[i + (3 * byte_stride)]
    end
end

My problem is that if I pass in a reinterpreted/reshaped array as out instead of a plain Array, the performance drops:

julia> data = rand(UInt8, 512 * 128 * 4);

julia> out = similar(data);

julia> out2 = zeros(UInt32, 512, 128);

julia> @benchmark deshuffle($data, $out)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  20.130 μs …  51.287 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.459 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.552 μs ± 704.751 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▄█
  ▂▂▃██▅▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  20.1 μs         Histogram: frequency by time         24.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark deshuffle($data, reinterpret(UInt8, vec($out2)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  30.474 μs … 59.981 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.798 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.096 μs ±  1.953 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▇██▆▃▁                                                     ▂
  ███████▇▃▄▁▃▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▄▁▁▁▁▁▄▁▃▄▅▄▇▆▆▇▇▇▆▇█▇▇▇▇▇▆▆ █
  30.5 μs      Histogram: log(frequency) by time      37.2 μs <

 Memory estimate: 112 bytes, allocs estimate: 3.

It still vectorizes but not quite as effectively, and I don't know how to tell the compiler that these two array types are functionally the same. @code_llvm shows that the vectorization is quite a bit more complicated. One workaround is to unwrap the the reinterpreted array:

julia> out2_flat = reinterpret(UInt8, vec(out2));

julia> @benchmark deshuffle($data, unsafe_wrap(Array, pointer($out2_flat), size($out2_flat)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  20.101 μs …  51.699 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.541 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.636 μs ± 702.697 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂▆█▆▂                                                     ▁
  ▇█▇█████▇▆▅▄▃▁▁▁▁▁▁▁▃▁▁▃▃▃▃▁▁▄▃▁▃▁▁▁▁▃▃▃▁▁▁▃▄▃▄▅▆▆▆▆▆▅▆▅▆▆▅▅ █
  20.1 μs       Histogram: log(frequency) by time      24.9 μs <

 Memory estimate: 64 bytes, allocs estimate: 2.

But that's quite an ugly hack... anyone have ideas on how to make the performance match that of Array?

James Wrigley (Sep 18 2022 at 19:38):

julia> versioninfo()
Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 40 × Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 1 on 40 virtual cores

Michael Fiano (Sep 18 2022 at 21:10):

This is likely due to the implicit arithmetic done on the view of the reinterpreted array. Every element access has the same type of strided offset computed on the fly for a view.

Stream: helpdesk (published)

Topic: Function not vectorizing properly for some array types

James Wrigley (Sep 18 2022 at 19:35):

James Wrigley (Sep 18 2022 at 19:38):

Michael Fiano (Sep 18 2022 at 21:10):

James Wrigley (Sep 19 2022 at 00:32):

James Wrigley (Sep 19 2022 at 01:01):