I have a function that's intended to take a Array{UInt32}
and shuffle the bytes of each element from AAA...BBB...CCC...DDD... to ABCDABCDABCD...:
function deshuffle(data, out)
byte_stride = length(out) ÷ 4
@inbounds for i in 1:byte_stride
out_i = 1 + (i - 1) * 4
out[out_i] = data[i]
out[out_i + 1] = data[i + byte_stride]
out[out_i + 2] = data[i + (2 * byte_stride)]
out[out_i + 3] = data[i + (3 * byte_stride)]
end
end
My problem is that if I pass in a reinterpreted/reshaped array as out
instead of a plain Array
, the performance drops:
julia> data = rand(UInt8, 512 * 128 * 4);
julia> out = similar(data);
julia> out2 = zeros(UInt32, 512, 128);
julia> @benchmark deshuffle($data, $out)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 20.130 μs … 51.287 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 20.459 μs ┊ GC (median): 0.00%
Time (mean ± σ): 20.552 μs ± 704.751 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█
▂▂▃██▅▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
20.1 μs Histogram: frequency by time 24.8 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark deshuffle($data, reinterpret(UInt8, vec($out2)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 30.474 μs … 59.981 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 30.798 μs ┊ GC (median): 0.00%
Time (mean ± σ): 31.096 μs ± 1.953 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▇██▆▃▁ ▂
███████▇▃▄▁▃▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▄▁▁▁▁▁▄▁▃▄▅▄▇▆▆▇▇▇▆▇█▇▇▇▇▇▆▆ █
30.5 μs Histogram: log(frequency) by time 37.2 μs <
Memory estimate: 112 bytes, allocs estimate: 3.
It still vectorizes but not quite as effectively, and I don't know how to tell the compiler that these two array types are functionally the same. @code_llvm
shows that the vectorization is quite a bit more complicated. One workaround is to unwrap the the reinterpreted array:
julia> out2_flat = reinterpret(UInt8, vec(out2));
julia> @benchmark deshuffle($data, unsafe_wrap(Array, pointer($out2_flat), size($out2_flat)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 20.101 μs … 51.699 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 20.541 μs ┊ GC (median): 0.00%
Time (mean ± σ): 20.636 μs ± 702.697 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▆█▆▂ ▁
▇█▇█████▇▆▅▄▃▁▁▁▁▁▁▁▃▁▁▃▃▃▃▁▁▄▃▁▃▁▁▁▁▃▃▃▁▁▁▃▄▃▄▅▆▆▆▆▆▅▆▅▆▆▅▅ █
20.1 μs Histogram: log(frequency) by time 24.9 μs <
Memory estimate: 64 bytes, allocs estimate: 2.
But that's quite an ugly hack... anyone have ideas on how to make the performance match that of Array
?
This is on 1.8.1 BTW:
julia> versioninfo()
Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 40 × Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
Threads: 1 on 40 virtual cores
This is likely due to the implicit arithmetic done on the view of the reinterpreted array. Every element access has the same type of strided offset computed on the fly for a view.
Hmmm, that's unfortunate...
Relevant post: https://discourse.julialang.org/t/reinterpretedarray-performance-even-worse-on-1-8/80102/26
Linked PR: https://github.com/JuliaLang/julia/pull/44186
Last updated: Nov 06 2024 at 04:40 UTC