Is there a way to get something as nice as
x[x .<= maximum(x)/4] .= 0.0
with the same performance as
cutoff = maximum(x)/4
for i in eachindex(x)
if x[i] <= cutoff
x[i] = 0.0
end
end
?
I see about a 2x performance hit from the first and 3 allocations (which surprised me, I was expecting one from the comparison)
I don't think the dots fuse across indexing like that
so might be resolved with a @views
the result of the comparison will still be allocated though
I'm actually kinda surprised that julia doesn't support a generator as an index
then you could write this as something like
x[(i for i in eachindex(x) if x[i] <= maximum(x)/4)] .= 0.0
I think the both cleanest(?) and most performant way to do this would be
x .= ifelse.(x .<= maximum(x)/4, 0.0, x)
julia> f1(x) = x[x .<= maximum(x)/4] .= 0.0;
julia> f2(x) = let cutoff=maximum(x)/4
for i in eachindex(x)
if x[i] <= cutoff
x[i] = 0.0
end
end
x
end;
julia> f3(x) = x .= ifelse.(x .<= maximum(x)/4, 0.0, x);
julia> for N ∈ (100, 1000, 10000)
@show N
x = rand(N)
print("f1 "); @btime f1(x) setup=(x = rand($N)) evals=1
print("f2 "); @btime f2(x) setup=(x = rand($N)) evals=1
print("f3 "); @btime f3(x) setup=(x = rand($N)) evals=1
end
N = 100
f1 290.000 ns (3 allocations: 256 bytes)
f2 130.000 ns (0 allocations: 0 bytes)
f3 110.000 ns (0 allocations: 0 bytes)
N = 1000
f1 1.012 μs (4 allocations: 6.17 KiB)
f2 671.000 ns (0 allocations: 0 bytes)
f3 401.000 ns (0 allocations: 0 bytes)
N = 10000
f1 6.723 μs (5 allocations: 23.53 KiB)
f2 4.799 μs (0 allocations: 0 bytes)
f3 2.174 μs (0 allocations: 0 bytes)
Oh wow, how is that ifelse
thing faster?
Fwiw I don't see a speedup, so might be CPU dependent - in any case it's at least as fast and shorter and cleaner, so wins all round, thank you!
yeah, the relative performance of ifelse
versus actual branching can be very CPU dependant
ifelse
tends to be easier to make work with SIMD, but the branch-predictor can make branching more efficient
could also be julia-version dependant, since it'll depend a lot on various low-level optimizations LLVM may or may not make
Mason Protter said:
could also be julia-version dependant, since it'll depend a lot on various low-level optimizations LLVM may or may not make
Yeah, for me f2
and f3
perform about the same, both of them SIMDing thanks to LLVM.
On my machine, if I am understanding correctly(which is a big if), f3 creates 2 main branches which both are heavily SIMDed, while f2 has one SIMDed branch with fewer instructions. So probably a better vectorization plus a better branch prediction pattern?
Last updated: Nov 06 2024 at 04:40 UTC