I've run across an interesting performance problem I can't quite get my head around and I'm wondering if anyone has advice on diagnosing it.
Consider the following 4 functions that all do the same thing:
function f1!(src::Vector{T}, dst::Vector{T}) where {T}
length(src) == length(dst) || error()
@inbounds @simd for i ∈ 1:2:length(src)
src_i = src[i]
dst[i] = src_i
end
dst
end
function f2!(src::Vector{T}, dst::Vector{T}) where {T}
@assert isbitstype(T)
length(src) == length(dst) || error()
@inbounds @simd for i ∈ 1:2:length(src)
src_i = unsafe_load(pointer(src), i)
dst[i] = src_i
end
dst
end
function f3!(src::Vector{T}, dst::Vector{T}) where {T}
@assert isbitstype(T)
length(src) == length(dst) || error()
@inbounds @simd for i ∈ 1:2:length(src)
src_i = src[i]
unsafe_store!(pointer(dst), src_i, i)
end
dst
end
function f4!(src::Vector{T}, dst::Vector{T}) where {T}
@assert isbitstype(T)
length(src) == length(dst) || error()
@simd for i ∈ 1:2:length(src)
src_i = unsafe_load(pointer(src), i)
unsafe_store!(pointer(dst), src_i, i)
end
dst
end
i.e. they all are copying every other element of dst into src.
f1! is using regular getindex / setindex!, f4! is using unsafe_load and unsafe_store! on the pointersf2! and f3! use a mixture of of the two (i.e. getindex with unsafe_store!, and unsafe_load to setindex!). When I benchmark them, I find that the mixed functions f2! and f3! are significantly faster than the unmixed functions f1! and f2!. Any idea what could be causing this? My primary interest is in making f4! as fast as f2!/f3!
Here's an example benchmark:
julia> let
src = rand(800)
dst = rand(800)
print("f1!: "); @btime f1!($src, $dst)
print("f2!: "); @btime f2!($src, $dst)
print("f3!: "); @btime f3!($src, $dst)
print("f4!: "); @btime f4!($src, $dst)
nothing
end;
f1!: 257.917 ns (0 allocations: 0 bytes)
f2!: 81.792 ns (0 allocations: 0 bytes)
f3!: 79.757 ns (0 allocations: 0 bytes)
f4!: 260.740 ns (0 allocations: 0 bytes)
for f4!: perhaps assumed pointer aliasing/provenance prevents vectorization?
What confuses me though is that it matches the perf of f1!.
Here's perhaps another piece to the wierdness puzzle here:
This is slow:
julia> let
src = rand(2, 400)
dst = rand(2, 400)
@btime $dst[1, :] .= @view $src[1, :]
end;
255.112 ns (0 allocations: 0 bytes)
but this is fast:
julia> let
src = rand(ComplexF64, 400)
dst = rand(ComplexF64, 400)
src_r = reinterpret(reshape, Float64, src)
dst_r = reinterpret(reshape, Float64, dst)
@btime $dst_r[1, :] .= @view $src_r[1, :]
end;
88.521 ns (0 allocations: 0 bytes)
The other thing I was thinking is that the performance differential is smaller than I'd expect for SIMD versus no SIMD, but your comment makes me wonder if what's happening is that only the loads or only the stores are being vectorized in one case, whereas in the mixed cases perhaps it's tricking it into vectorizing both the loads and stores?
If I disable SIMD completely using Base.donotdelete, I see that the loops take ~480ns the benchmark
I feel like I've screwed up somewhere but with a little change it can get faster:
julia> function f5!(src::Vector{T}, dst::Vector{T}) where {T}
for i in eachindex(src, dst)
if iseven(i)
continue
end
dst[i] = src[i]
end
dst
end
f5! (generic function with 1 method)
julia> let N = 800
src = rand(N)
dst = zeros(N)
print("f1!: "); @btime f1!($src, $dst)
print("f2!: "); @btime f2!($src, $dst)
print("f3!: "); @btime f3!($src, $dst)
print("f4!: "); @btime f4!($src, $dst)
print("f5!: "); @btime f5!($src, $dst)
a = copy(f5!(src, dst))
dst = zeros(N)
b = copy(f1!(src, dst))
a == b
end
f1!: 190.867 ns (0 allocations: 0 bytes)
f2!: 74.897 ns (0 allocations: 0 bytes)
f3!: 75.000 ns (0 allocations: 0 bytes)
f4!: 190.867 ns (0 allocations: 0 bytes)
f5!: 45.399 ns (0 allocations: 0 bytes)
true
Bizarre. This does lend some credence to the idea that it's down to different ways that vectorization choices can be made.
Unfortunately I can't really use that style in my actual application (trying to improve copyto! performance of FieldViews.jl)
Yeah, many seemingly small things cause missed optimizations on Julia. But this one feels especially bad, coz the code is super straightforward.
From what I gather (pun-unintended) f5! uses @llvm.masked.load.v4f64.p0 while f1! (and i assume the rest) uses @llvm.masked.gather.v4f64.v4p0, so using gathers here seems to be the problem
maybe you should open an issue? similar performance bugs have surfaced before involving gathers
https://github.com/JuliaLang/julia/issues/60147
I think it makes sense that f5 is best, constructing and iterating over 1:2:length is probably pretty complicated at the llvm ir level and so worse optimized
That's really weird, I get totally different timings:
f1!: 81.062 ns (0 allocations: 0 bytes)
f2!: 82.257 ns (0 allocations: 0 bytes)
f3!: 81.907 ns (0 allocations: 0 bytes)
f4!: 80.628 ns (0 allocations: 0 bytes)
f5!: 245.792 ns (0 allocations: 0 bytes)
And on the benchmark from the issue you opened, the timings are reversed! :
copyto_odd1!: 80.659 ns (0 allocations: 0 bytes)
copyto_odd2!: 246.441 ns (0 allocations: 0 bytes)
fwiw a straightforward translation of copyto_odd1! to Rust also doesn't vectorize, and a C++ version needs g++ -O3 to vectorize somewhat like copyto_odd2!, so the compiler is actually already quite smart in copyto_odd2! (not that it can't be better)
@Rafael Fourquet what's the output of versioninfo on your machine?
wheeheee said:
Rafael Fourquet what's the output of
versioninfoon your machine?
This is on all recent julia versions, from v1.10 onwards. versioninfo() on v1.12 gives
5/6> versioninfo()
Julia Version 1.12.1
Commit d05709ca652 (2025-11-17 21:06 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver3)
GC: Built with stock GC
Threads: 16 default, 1 interactive, 16 GC (on 32 virtual cores)
Environment:
[...]
Interesting, does the vector.body section from code_llvm also show the gather/scatter and masked load/stores or is it reversed for you?
ig the optimization passes are a tad brittle
wheeheee said:
Interesting, does the
vector.bodysection from code_llvm also show the gather/scatter and masked load/stores or is it reversed for you?
I don't see any gather/scatter, only some @llvm.masked.store and @llvm.masked.load for f5!.
for f1!, what’s in vector.body? Would be a very curious situation if it also uses gather/scatters and is still faster…
wheeheee said:
for f1!, what’s in vector.body? Would be a very curious situation if it also uses gather/scatters and is still faster…
There's no vector.body in the other f_i!s for me, only for f5! !
weird, I see vector.body in f1!. what's your code_llvm output for copyto_odd1! and copyto_odd2!?
ok i see in f2! and f3! there's loop unrolling but no gather/scatter, so there's a long main loop right after the preheader, but no vector.body
Last updated: Nov 27 2025 at 04:44 UTC