chriselrod said:
If you're willing to debug, maybe we can find out why.
But, --
Sure, lemme know what to run
Here's one problem I've noticed:
julia> let A = rand(2048, 2408), B = similar(A)
@btime $B .= $A
@btime copyto_turbo!($B, $A)
@btime copyto_tturbo!($B, $A)
end;
2.410 ms (0 allocations: 0 bytes)
4.465 ms (0 allocations: 0 bytes)
3.399 ms (0 allocations: 0 bytes)
Apparently __memcpy_avx_unaligned_erms
is fast.
How does a simple loop compare?
Note that B .= A
forwards to copyto!
, which calls memcpy
.
I don't see a difference as extreme as you (and tturbo
is fastest), but it'd be interesting to find out what exactly memcpy
is doing to be so fast. Seems like an obviously memory bound problem.
julia> function copyto_loop!(B::Array{T}, A::Array{T}) where {T}
for i in eachindex(A, B)
B[i] = A[i]
end
B
end
copyto_loop! (generic function with 1 method)
julia> let A = rand(2048, 2408), B = similar(A)
@btime copyto_loop!($B, $A)
end;
5.035 ms (0 allocations: 0 bytes)
With @inbounds
?
Anyone know where I can find the code for Julia's memmove
?
function unsafe_copyto!(dest::Array{T}, doffs, src::Array{T}, soffs, n) where T
t1 = @_gc_preserve_begin dest
t2 = @_gc_preserve_begin src
destp = pointer(dest, doffs)
srcp = pointer(src, soffs)
if !allocatedinline(T)
ccall(:jl_array_ptr_copy, Cvoid, (Any, Ptr{Cvoid}, Any, Ptr{Cvoid}, Int),
dest, destp, src, srcp, n)
elseif isbitstype(T)
ccall(:memmove, Ptr{Cvoid}, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t),
destp, srcp, n * aligned_sizeof(T))
elseif isbitsunion(T)
ccall(:memmove, Ptr{Cvoid}, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t),
destp, srcp, n * aligned_sizeof(T))
# copy selector bytes
ccall(:memmove, Ptr{Cvoid}, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t),
ccall(:jl_array_typetagdata, Ptr{UInt8}, (Any,), dest) + doffs - 1,
ccall(:jl_array_typetagdata, Ptr{UInt8}, (Any,), src) + soffs - 1,
n)
else
_unsafe_copyto!(dest, doffs, src, soffs, n)
end
@_gc_preserve_end t2
@_gc_preserve_end t1
return dest
end
And, also which memmove
is this/where does it come from?
Oops, here it is with the inbounds
julia> function copyto_loop!(B::Array{T}, A::Array{T}) where {T}
@inbounds for i in eachindex(A, B)
B[i] = A[i]
end
B
end
copyto_loop! (generic function with 1 method)
julia> let A = rand(2048, 2408), B = similar(A)
@btime copyto_loop!($B, $A)
end;
4.588 ms (0 allocations: 0 bytes)
Thanks, so at least it isn't doing seem to do worse than LLVM, but it's obviously missing out on something pretty clever.
Maybe it's just using nontemporal stores?
Could you try @btime vmapnt!(identity, $B, $A)
?
This is fast for me.
chriselrod said:
Thanks, so at least it isn't doing seem to do worse than LLVM, but it's obviously missing out on something pretty clever.
Well, it's a lot slower than broadcast
julia> @btime $B .= $A;
2.625 ms (0 allocations: 0 bytes)
julia> @btime @turbo $B .= $A;
3.223 ms (0 allocations: 0 bytes)
julia> @btime vmapnt!(identity, $B, $A); # use non-temporal stores
2.481 ms (0 allocations: 0 bytes)
Mason Protter said:
Well, it's a lot slower than broadcast
The broadcast is ccall
-ing memmove
.
Gotcha
julia> let A = rand(2048, 2408), B = similar(A)
@btime vmapnt!(identity, $B, $A)
end
2.679 ms (0 allocations: 0 bytes)
chriselrod said:
Anyone know where I can find the code for Julia's
memmove
?
gdb
tells me this
(gdb) b memmove
Breakpoint 2 at 0x7ffff765db70: file ../sysdeps/x86_64/multiarch/ifunc-memmove.h, line 44.
So somewhere in here? https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/multiarch/ifunc-memmove.h
Thanks. I need to learn gdb (and rr) at some point to find out things like that.
For now, I'm looking at:
https://squadrick.dev/journal/going-faster-than-memcpy.html
Somehow, I hadn't learned about nontemporal loads before. https://www.felixcloutier.com/x86/movntdqa
Also, I need to add a fence instruction to vmapnt(t)!
.
That article used sfence
, but I can only get mfence
through the llvm fence
instruction (and mfence
is already available through Threads.atomic_fence()
).
But this doesn't sound right:
Orders processor execution relative to all memory stores prior to the SFENCE instruction. The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible. The SFENCE instruction is ordered with respect to memory stores, other SFENCE instructions, MFENCE instructions, and any serializing instructions (such as the CPUID instruction). It is not ordered with respect to memory loads or the LFENCE instruction.
https://www.felixcloutier.com/x86/sfence
Don't I care about the ordering of loads and stores? I.e., I'd want all loads after the sfence to correctly load from any stores before the sfence to that same memory address.
If this doesn't guarantee that, then I do need mfence?
Maybe the idea is that, if you are doing concurrent programming, there will be atomic release store of a flag after the sfence? The happens-before edge would be established through this flag, but you'd need that the store to this flag is visible after the stores of the big buffer you just copied.
I wonder if this is a peculiarity of x86's TSO guarantee. In x86, release store is free (normal mov) because the memory model is very strong. But then you'd need something extra for treating very weak operations like nontemporal store...?
Last updated: Nov 06 2024 at 04:40 UTC