Performance mismatch copying every other element · helpdesk (published)

I've run across an interesting performance problem I can't quite get my head around and I'm wondering if anyone has advice on diagnosing it.

function f1!(src::Vector{T}, dst::Vector{T}) where {T}
    length(src) == length(dst) || error()
    @inbounds @simd for i ∈ 1:2:length(src)
        src_i = src[i]
        dst[i] = src_i
    end
    dst
end

function f2!(src::Vector{T}, dst::Vector{T}) where {T}
    @assert isbitstype(T)
    length(src) == length(dst) || error()
    @inbounds @simd for i ∈ 1:2:length(src)
        src_i = unsafe_load(pointer(src), i)
        dst[i] = src_i
    end
    dst
end

function f3!(src::Vector{T}, dst::Vector{T}) where {T}
    @assert isbitstype(T)
    length(src) == length(dst) || error()
    @inbounds @simd for i ∈ 1:2:length(src)
        src_i = src[i]
        unsafe_store!(pointer(dst), src_i, i)
    end
    dst
end

function f4!(src::Vector{T}, dst::Vector{T}) where {T}
    @assert isbitstype(T)
    length(src) == length(dst) || error()
    @simd for i ∈ 1:2:length(src)
        src_i = unsafe_load(pointer(src), i)
        unsafe_store!(pointer(dst), src_i, i)
    end
    dst
end

When I benchmark them, I find that the mixed functions f2! and f3! are significantly faster than the unmixed functions f1! and f2!. Any idea what could be causing this? My primary interest is in making f4! as fast as f2!/f3!

Mason Protter (Nov 16 2025 at 11:07):

julia> let
           src = rand(800)
           dst = rand(800)
           print("f1!: "); @btime f1!($src, $dst)
           print("f2!: "); @btime f2!($src, $dst)
           print("f3!: "); @btime f3!($src, $dst)
           print("f4!: "); @btime f4!($src, $dst)
           nothing
       end;
f1!:   257.917 ns (0 allocations: 0 bytes)
f2!:   81.792 ns (0 allocations: 0 bytes)
f3!:   79.757 ns (0 allocations: 0 bytes)
f4!:   260.740 ns (0 allocations: 0 bytes)

Sukera (Nov 16 2025 at 11:09):

Mason Protter (Nov 16 2025 at 11:10):

Mason Protter (Nov 16 2025 at 11:16):

julia> let
           src = rand(2, 400)
           dst = rand(2, 400)
           @btime $dst[1, :] .= @view $src[1, :]
       end;
  255.112 ns (0 allocations: 0 bytes)

julia> let
           src = rand(ComplexF64, 400)
           dst = rand(ComplexF64, 400)

           src_r = reinterpret(reshape, Float64, src)
           dst_r = reinterpret(reshape, Float64, dst)

           @btime $dst_r[1, :] .= @view $src_r[1, :]
       end;
  88.521 ns (0 allocations: 0 bytes)

Mason Protter (Nov 16 2025 at 11:22):

The other thing I was thinking is that the performance differential is smaller than I'd expect for SIMD versus no SIMD, but your comment makes me wonder if what's happening is that only the loads or only the stores are being vectorized in one case, whereas in the mixed cases perhaps it's tricking it into vectorizing both the loads and stores?

If I disable SIMD completely using Base.donotdelete, I see that the loops take ~480ns the benchmark

wheeheee (Nov 16 2025 at 11:29):

I feel like I've screwed up somewhere but with a little change it can get faster:

julia> function f5!(src::Vector{T}, dst::Vector{T}) where {T}
           for i in eachindex(src, dst)
               if iseven(i)
                   continue
               end
               dst[i] = src[i]
           end
           dst
       end
f5! (generic function with 1 method)

julia> let N = 800
           src = rand(N)
           dst = zeros(N)
           print("f1!: "); @btime f1!($src, $dst)
           print("f2!: "); @btime f2!($src, $dst)
           print("f3!: "); @btime f3!($src, $dst)
           print("f4!: "); @btime f4!($src, $dst)
           print("f5!: "); @btime f5!($src, $dst)
           a = copy(f5!(src, dst))
           dst = zeros(N)
           b = copy(f1!(src, dst))
           a == b
       end
f1!:   190.867 ns (0 allocations: 0 bytes)
f2!:   74.897 ns (0 allocations: 0 bytes)
f3!:   75.000 ns (0 allocations: 0 bytes)
f4!:   190.867 ns (0 allocations: 0 bytes)
f5!:   45.399 ns (0 allocations: 0 bytes)
true

Mason Protter (Nov 16 2025 at 11:50):

Bizarre. This does lend some credence to the idea that it's down to different ways that vectorization choices can be made.

Unfortunately I can't really use that style in my actual application (trying to improve copyto! performance of FieldViews.jl)

wheeheee (Nov 16 2025 at 13:16):

Yeah, many seemingly small things cause missed optimizations on Julia. But this one feels especially bad, coz the code is super straightforward.

wheeheee (Nov 16 2025 at 13:17):

From what I gather (pun-unintended) f5! uses @llvm.masked.load.v4f64.p0 while f1! (and i assume the rest) uses @llvm.masked.gather.v4f64.v4p0, so using gathers here seems to be the problem

wheeheee (Nov 16 2025 at 13:21):

maybe you should open an issue? similar performance bugs have surfaced before involving gathers

Mason Protter (Nov 16 2025 at 14:51):

Zentrik (Nov 17 2025 at 22:51):

I think it makes sense that f5 is best, constructing and iterating over 1:2:length is probably pretty complicated at the llvm ir level and so worse optimized

Rafael Fourquet (Nov 18 2025 at 08:26):

f1!:   81.062 ns (0 allocations: 0 bytes)
f2!:   82.257 ns (0 allocations: 0 bytes)
f3!:   81.907 ns (0 allocations: 0 bytes)
f4!:   80.628 ns (0 allocations: 0 bytes)
f5!:   245.792 ns (0 allocations: 0 bytes)

copyto_odd1!:   80.659 ns (0 allocations: 0 bytes)
copyto_odd2!:   246.441 ns (0 allocations: 0 bytes)

wheeheee (Nov 18 2025 at 09:39):

fwiw a straightforward translation of copyto_odd1! to Rust also doesn't vectorize, and a C++ version needs g++ -O3 to vectorize somewhat like copyto_odd2!, so the compiler is actually already quite smart in copyto_odd2! (not that it can't be better)

wheeheee (Nov 18 2025 at 09:40):

Rafael Fourquet (Nov 18 2025 at 16:08):

This is on all recent julia versions, from v1.10 onwards. versioninfo() on v1.12 gives

 5/6> versioninfo()
Julia Version 1.12.1
Commit d05709ca652 (2025-11-17 21:06 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver3)
  GC: Built with stock GC
Threads: 16 default, 1 interactive, 16 GC (on 32 virtual cores)
Environment:
[...]