LoopVec debugging · helpdesk (published)

Stream: helpdesk (published)

Topic: LoopVec debugging

Mason Protter (Sep 25 2021 at 02:41):

chriselrod said:

If you're willing to debug, maybe we can find out why.
But, --

Sure, lemme know what to run

Mason Protter (Sep 25 2021 at 02:45):

Here's one problem I've noticed:

julia> let A = rand(2048, 2408), B = similar(A)
           @btime $B .= $A
           @btime copyto_turbo!($B, $A)
           @btime copyto_tturbo!($B, $A)
       end;
  2.410 ms (0 allocations: 0 bytes)
  4.465 ms (0 allocations: 0 bytes)
  3.399 ms (0 allocations: 0 bytes)

chriselrod (Sep 25 2021 at 02:52):

Apparently __memcpy_avx_unaligned_erms is fast.
How does a simple loop compare?
Note that B .= A forwards to copyto!, which calls memcpy.

I don't see a difference as extreme as you (and tturbo is fastest), but it'd be interesting to find out what exactly memcpy is doing to be so fast. Seems like an obviously memory bound problem.

Mason Protter (Sep 25 2021 at 03:31):

julia> function copyto_loop!(B::Array{T}, A::Array{T}) where {T}
           for i in eachindex(A, B)
               B[i] = A[i]
           end
           B
       end
copyto_loop! (generic function with 1 method)

julia> let A = rand(2048, 2408), B = similar(A)
           @btime copyto_loop!($B, $A)
       end;
5.035 ms (0 allocations: 0 bytes)

chriselrod (Sep 25 2021 at 03:39):

With @inbounds?
Anyone know where I can find the code for Julia's memmove?

function unsafe_copyto!(dest::Array{T}, doffs, src::Array{T}, soffs, n) where T
    t1 = @_gc_preserve_begin dest
    t2 = @_gc_preserve_begin src
    destp = pointer(dest, doffs)
    srcp = pointer(src, soffs)
    if !allocatedinline(T)
        ccall(:jl_array_ptr_copy, Cvoid, (Any, Ptr{Cvoid}, Any, Ptr{Cvoid}, Int),
              dest, destp, src, srcp, n)
    elseif isbitstype(T)
        ccall(:memmove, Ptr{Cvoid}, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t),
              destp, srcp, n * aligned_sizeof(T))
    elseif isbitsunion(T)
        ccall(:memmove, Ptr{Cvoid}, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t),
              destp, srcp, n * aligned_sizeof(T))
        # copy selector bytes
        ccall(:memmove, Ptr{Cvoid}, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t),
              ccall(:jl_array_typetagdata, Ptr{UInt8}, (Any,), dest) + doffs - 1,
              ccall(:jl_array_typetagdata, Ptr{UInt8}, (Any,), src) + soffs - 1,
              n)
    else
        _unsafe_copyto!(dest, doffs, src, soffs, n)
    end
    @_gc_preserve_end t2
    @_gc_preserve_end t1
    return dest
end

And, also which memmove is this/where does it come from?

Mason Protter (Sep 25 2021 at 03:47):

Oops, here it is with the inbounds

julia> function copyto_loop!(B::Array{T}, A::Array{T}) where {T}
           @inbounds for i in eachindex(A, B)
               B[i] = A[i]
           end
           B
       end
copyto_loop! (generic function with 1 method)

julia> let A = rand(2048, 2408), B = similar(A)
           @btime copyto_loop!($B, $A)
       end;
  4.588 ms (0 allocations: 0 bytes)

chriselrod (Sep 25 2021 at 03:56):

Thanks, so at least it isn't doing seem to do worse than LLVM, but it's obviously missing out on something pretty clever.

chriselrod (Sep 25 2021 at 04:00):

Maybe it's just using nontemporal stores?
Could you try @btime vmapnt!(identity, $B, $A)?

This is fast for me.

Mason Protter (Sep 25 2021 at 04:02):

chriselrod said:

Thanks, so at least it isn't doing seem to do worse than LLVM, but it's obviously missing out on something pretty clever.

Well, it's a lot slower than broadcast

chriselrod (Sep 25 2021 at 04:03):

julia> @btime $B .= $A;
  2.625 ms (0 allocations: 0 bytes)

julia> @btime @turbo $B .= $A;
  3.223 ms (0 allocations: 0 bytes)

julia> @btime vmapnt!(identity, $B, $A); # use non-temporal stores
  2.481 ms (0 allocations: 0 bytes)

chriselrod (Sep 25 2021 at 04:03):

Mason Protter said:

Well, it's a lot slower than broadcast

The broadcast is ccall-ing memmove.

Mason Protter (Sep 25 2021 at 04:03):

Gotcha

Mason Protter (Sep 25 2021 at 04:05):

julia> let A = rand(2048, 2408), B = similar(A)
           @btime vmapnt!(identity, $B, $A)
       end
  2.679 ms (0 allocations: 0 bytes)

Takafumi Arakaki (tkf) (Sep 25 2021 at 04:12):

chriselrod said:

Anyone know where I can find the code for Julia's memmove?

gdb tells me this

(gdb) b memmove
Breakpoint 2 at 0x7ffff765db70: file ../sysdeps/x86_64/multiarch/ifunc-memmove.h, line 44.

Takafumi Arakaki (tkf) (Sep 25 2021 at 04:14):

So somewhere in here? https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/multiarch/ifunc-memmove.h

chriselrod (Sep 25 2021 at 04:16):

Thanks. I need to learn gdb (and rr) at some point to find out things like that.
For now, I'm looking at:
https://squadrick.dev/journal/going-faster-than-memcpy.html
Somehow, I hadn't learned about nontemporal loads before. https://www.felixcloutier.com/x86/movntdqa

chriselrod (Sep 25 2021 at 04:18):

Also, I need to add a fence instruction to vmapnt(t)!.

chriselrod (Sep 25 2021 at 04:24):

That article used sfence, but I can only get mfence through the llvm fence instruction (and mfence is already available through Threads.atomic_fence()).

chriselrod (Sep 25 2021 at 04:27):

But this doesn't sound right:

Orders processor execution relative to all memory stores prior to the SFENCE instruction. The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible. The SFENCE instruction is ordered with respect to memory stores, other SFENCE instructions, MFENCE instructions, and any serializing instructions (such as the CPUID instruction). It is not ordered with respect to memory loads or the LFENCE instruction.

https://www.felixcloutier.com/x86/sfence
Don't I care about the ordering of loads and stores? I.e., I'd want all loads after the sfence to correctly load from any stores before the sfence to that same memory address.
If this doesn't guarantee that, then I do need mfence?

Takafumi Arakaki (tkf) (Sep 25 2021 at 04:38):

Maybe the idea is that, if you are doing concurrent programming, there will be atomic release store of a flag after the sfence? The happens-before edge would be established through this flag, but you'd need that the store to this flag is visible after the stores of the big buffer you just copied.

Takafumi Arakaki (tkf) (Sep 25 2021 at 04:43):

I wonder if this is a peculiarity of x86's TSO guarantee. In x86, release store is free (normal mov) because the memory model is very strong. But then you'd need something extra for treating very weak operations like nontemporal store...?

Last updated: Nov 03 2025 at 04:46 UTC