Stream: helpdesk (published)

Topic: Performance mismatch copying every other element


view this post on Zulip Mason Protter (Nov 16 2025 at 11:05):

I've run across an interesting performance problem I can't quite get my head around and I'm wondering if anyone has advice on diagnosing it.

Consider the following 4 functions that all do the same thing:

function f1!(src::Vector{T}, dst::Vector{T}) where {T}
    length(src) == length(dst) || error()
    @inbounds @simd for i  1:2:length(src)
        src_i = src[i]
        dst[i] = src_i
    end
    dst
end

function f2!(src::Vector{T}, dst::Vector{T}) where {T}
    @assert isbitstype(T)
    length(src) == length(dst) || error()
    @inbounds @simd for i  1:2:length(src)
        src_i = unsafe_load(pointer(src), i)
        dst[i] = src_i
    end
    dst
end

function f3!(src::Vector{T}, dst::Vector{T}) where {T}
    @assert isbitstype(T)
    length(src) == length(dst) || error()
    @inbounds @simd for i  1:2:length(src)
        src_i = src[i]
        unsafe_store!(pointer(dst), src_i, i)
    end
    dst
end

function f4!(src::Vector{T}, dst::Vector{T}) where {T}
    @assert isbitstype(T)
    length(src) == length(dst) || error()
    @simd for i  1:2:length(src)
        src_i = unsafe_load(pointer(src), i)
        unsafe_store!(pointer(dst), src_i, i)
    end
    dst
end

i.e. they all are copying every other element of dst into src.

When I benchmark them, I find that the mixed functions f2! and f3! are significantly faster than the unmixed functions f1! and f2!. Any idea what could be causing this? My primary interest is in making f4! as fast as f2!/f3!

view this post on Zulip Mason Protter (Nov 16 2025 at 11:07):

Here's an example benchmark:

julia> let
           src = rand(800)
           dst = rand(800)
           print("f1!: "); @btime f1!($src, $dst)
           print("f2!: "); @btime f2!($src, $dst)
           print("f3!: "); @btime f3!($src, $dst)
           print("f4!: "); @btime f4!($src, $dst)
           nothing
       end;
f1!:   257.917 ns (0 allocations: 0 bytes)
f2!:   81.792 ns (0 allocations: 0 bytes)
f3!:   79.757 ns (0 allocations: 0 bytes)
f4!:   260.740 ns (0 allocations: 0 bytes)

view this post on Zulip Sukera (Nov 16 2025 at 11:09):

for f4!: perhaps assumed pointer aliasing/provenance prevents vectorization?

view this post on Zulip Mason Protter (Nov 16 2025 at 11:10):

What confuses me though is that it matches the perf of f1!.

view this post on Zulip Mason Protter (Nov 16 2025 at 11:16):

Here's perhaps another piece to the wierdness puzzle here:

This is slow:

julia> let
           src = rand(2, 400)
           dst = rand(2, 400)
           @btime $dst[1, :] .= @view $src[1, :]
       end;
  255.112 ns (0 allocations: 0 bytes)

but this is fast:

julia> let
           src = rand(ComplexF64, 400)
           dst = rand(ComplexF64, 400)

           src_r = reinterpret(reshape, Float64, src)
           dst_r = reinterpret(reshape, Float64, dst)

           @btime $dst_r[1, :] .= @view $src_r[1, :]
       end;
  88.521 ns (0 allocations: 0 bytes)

view this post on Zulip Mason Protter (Nov 16 2025 at 11:22):

The other thing I was thinking is that the performance differential is smaller than I'd expect for SIMD versus no SIMD, but your comment makes me wonder if what's happening is that only the loads or only the stores are being vectorized in one case, whereas in the mixed cases perhaps it's tricking it into vectorizing both the loads and stores?

If I disable SIMD completely using Base.donotdelete, I see that the loops take ~480ns the benchmark

view this post on Zulip wheeheee (Nov 16 2025 at 11:29):

I feel like I've screwed up somewhere but with a little change it can get faster:

julia> function f5!(src::Vector{T}, dst::Vector{T}) where {T}
           for i in eachindex(src, dst)
               if iseven(i)
                   continue
               end
               dst[i] = src[i]
           end
           dst
       end
f5! (generic function with 1 method)

julia> let N = 800
           src = rand(N)
           dst = zeros(N)
           print("f1!: "); @btime f1!($src, $dst)
           print("f2!: "); @btime f2!($src, $dst)
           print("f3!: "); @btime f3!($src, $dst)
           print("f4!: "); @btime f4!($src, $dst)
           print("f5!: "); @btime f5!($src, $dst)
           a = copy(f5!(src, dst))
           dst = zeros(N)
           b = copy(f1!(src, dst))
           a == b
       end
f1!:   190.867 ns (0 allocations: 0 bytes)
f2!:   74.897 ns (0 allocations: 0 bytes)
f3!:   75.000 ns (0 allocations: 0 bytes)
f4!:   190.867 ns (0 allocations: 0 bytes)
f5!:   45.399 ns (0 allocations: 0 bytes)
true

view this post on Zulip Mason Protter (Nov 16 2025 at 11:50):

Bizarre. This does lend some credence to the idea that it's down to different ways that vectorization choices can be made.

Unfortunately I can't really use that style in my actual application (trying to improve copyto! performance of FieldViews.jl)

view this post on Zulip wheeheee (Nov 16 2025 at 13:16):

Yeah, many seemingly small things cause missed optimizations on Julia. But this one feels especially bad, coz the code is super straightforward.

view this post on Zulip wheeheee (Nov 16 2025 at 13:17):

From what I gather (pun-unintended) f5! uses @llvm.masked.load.v4f64.p0 while f1! (and i assume the rest) uses @llvm.masked.gather.v4f64.v4p0, so using gathers here seems to be the problem

view this post on Zulip wheeheee (Nov 16 2025 at 13:21):

maybe you should open an issue? similar performance bugs have surfaced before involving gathers

view this post on Zulip Mason Protter (Nov 16 2025 at 14:51):

https://github.com/JuliaLang/julia/issues/60147

view this post on Zulip Zentrik (Nov 17 2025 at 22:51):

I think it makes sense that f5 is best, constructing and iterating over 1:2:length is probably pretty complicated at the llvm ir level and so worse optimized

view this post on Zulip Rafael Fourquet (Nov 18 2025 at 08:26):

That's really weird, I get totally different timings:

f1!:   81.062 ns (0 allocations: 0 bytes)
f2!:   82.257 ns (0 allocations: 0 bytes)
f3!:   81.907 ns (0 allocations: 0 bytes)
f4!:   80.628 ns (0 allocations: 0 bytes)
f5!:   245.792 ns (0 allocations: 0 bytes)

And on the benchmark from the issue you opened, the timings are reversed! :

copyto_odd1!:   80.659 ns (0 allocations: 0 bytes)
copyto_odd2!:   246.441 ns (0 allocations: 0 bytes)

view this post on Zulip wheeheee (Nov 18 2025 at 09:39):

fwiw a straightforward translation of copyto_odd1! to Rust also doesn't vectorize, and a C++ version needs g++ -O3 to vectorize somewhat like copyto_odd2!, so the compiler is actually already quite smart in copyto_odd2! (not that it can't be better)

view this post on Zulip wheeheee (Nov 18 2025 at 09:40):

@Rafael Fourquet what's the output of versioninfo on your machine?

view this post on Zulip Rafael Fourquet (Nov 18 2025 at 16:08):

wheeheee said:

Rafael Fourquet what's the output of versioninfo on your machine?

This is on all recent julia versions, from v1.10 onwards. versioninfo() on v1.12 gives

 5/6> versioninfo()
Julia Version 1.12.1
Commit d05709ca652 (2025-11-17 21:06 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver3)
  GC: Built with stock GC
Threads: 16 default, 1 interactive, 16 GC (on 32 virtual cores)
Environment:
[...]

view this post on Zulip wheeheee (Nov 18 2025 at 16:18):

Interesting, does the vector.body section from code_llvm also show the gather/scatter and masked load/stores or is it reversed for you?

view this post on Zulip wheeheee (Nov 18 2025 at 16:19):

ig the optimization passes are a tad brittle

view this post on Zulip Rafael Fourquet (Nov 18 2025 at 16:25):

wheeheee said:

Interesting, does the vector.body section from code_llvm also show the gather/scatter and masked load/stores or is it reversed for you?

I don't see any gather/scatter, only some @llvm.masked.store and @llvm.masked.load for f5!.

view this post on Zulip wheeheee (Nov 18 2025 at 16:27):

for f1!, what’s in vector.body? Would be a very curious situation if it also uses gather/scatters and is still faster…

view this post on Zulip Rafael Fourquet (Nov 18 2025 at 17:06):

wheeheee said:

for f1!, what’s in vector.body? Would be a very curious situation if it also uses gather/scatters and is still faster…

There's no vector.body in the other f_i!s for me, only for f5! !

view this post on Zulip wheeheee (Nov 19 2025 at 14:52):

weird, I see vector.body in f1!. what's your code_llvm output for copyto_odd1! and copyto_odd2!?

view this post on Zulip wheeheee (Nov 19 2025 at 14:59):

ok i see in f2! and f3! there's loop unrolling but no gather/scatter, so there's a long main loop right after the preheader, but no vector.body


Last updated: Nov 27 2025 at 04:44 UTC