Stream: helpdesk (published)

Topic: Nicer performant syntax for replacing elements conditionally


view this post on Zulip Nils (Jan 05 2024 at 11:28):

Is there a way to get something as nice as

x[x .<= maximum(x)/4] .= 0.0

with the same performance as

cutoff = maximum(x)/4
for i in eachindex(x)
    if x[i] <= cutoff
        x[i] = 0.0
    end
end

?

I see about a 2x performance hit from the first and 3 allocations (which surprised me, I was expecting one from the comparison)

view this post on Zulip Sukera (Jan 05 2024 at 11:56):

I don't think the dots fuse across indexing like that

view this post on Zulip Sukera (Jan 05 2024 at 11:56):

so might be resolved with a @views

view this post on Zulip Sukera (Jan 05 2024 at 11:56):

the result of the comparison will still be allocated though

view this post on Zulip Mason Protter (Jan 05 2024 at 12:33):

I'm actually kinda surprised that julia doesn't support a generator as an index

view this post on Zulip Mason Protter (Jan 05 2024 at 12:34):

then you could write this as something like

x[(i for i in eachindex(x) if x[i] <= maximum(x)/4)] .= 0.0

view this post on Zulip Mason Protter (Jan 05 2024 at 12:46):

I think the both cleanest(?) and most performant way to do this would be

x .= ifelse.(x .<= maximum(x)/4, 0.0, x)

view this post on Zulip Mason Protter (Jan 05 2024 at 12:47):

julia> f1(x) = x[x .<= maximum(x)/4] .= 0.0;

julia> f2(x) = let cutoff=maximum(x)/4
           for i in eachindex(x)
               if x[i] <= cutoff
                   x[i] = 0.0
               end
           end
           x
       end;

julia> f3(x) = x .= ifelse.(x .<= maximum(x)/4, 0.0, x);
julia> for N  (100, 1000, 10000)
           @show N
           x = rand(N)
           print("f1  "); @btime f1(x) setup=(x = rand($N)) evals=1
           print("f2  "); @btime f2(x) setup=(x = rand($N)) evals=1
           print("f3  "); @btime f3(x) setup=(x = rand($N)) evals=1
       end
N = 100
f1    290.000 ns (3 allocations: 256 bytes)
f2    130.000 ns (0 allocations: 0 bytes)
f3    110.000 ns (0 allocations: 0 bytes)
N = 1000
f1    1.012 μs (4 allocations: 6.17 KiB)
f2    671.000 ns (0 allocations: 0 bytes)
f3    401.000 ns (0 allocations: 0 bytes)
N = 10000
f1    6.723 μs (5 allocations: 23.53 KiB)
f2    4.799 μs (0 allocations: 0 bytes)
f3    2.174 μs (0 allocations: 0 bytes)

view this post on Zulip Nils (Jan 05 2024 at 13:34):

Oh wow, how is that ifelse thing faster?

view this post on Zulip Nils (Jan 05 2024 at 13:37):

Fwiw I don't see a speedup, so might be CPU dependent - in any case it's at least as fast and shorter and cleaner, so wins all round, thank you!

view this post on Zulip Mason Protter (Jan 05 2024 at 13:46):

yeah, the relative performance of ifelse versus actual branching can be very CPU dependant

view this post on Zulip Mason Protter (Jan 05 2024 at 13:49):

ifelse tends to be easier to make work with SIMD, but the branch-predictor can make branching more efficient

view this post on Zulip Mason Protter (Jan 05 2024 at 13:49):

could also be julia-version dependant, since it'll depend a lot on various low-level optimizations LLVM may or may not make

view this post on Zulip chriselrod (Jan 05 2024 at 15:08):

Mason Protter said:

could also be julia-version dependant, since it'll depend a lot on various low-level optimizations LLVM may or may not make

Yeah, for me f2 and f3 perform about the same, both of them SIMDing thanks to LLVM.

view this post on Zulip AMJ (Jan 05 2024 at 16:14):

On my machine, if I am understanding correctly(which is a big if), f3 creates 2 main branches which both are heavily SIMDed, while f2 has one SIMDed branch with fewer instructions. So probably a better vectorization plus a better branch prediction pattern?


Last updated: Nov 22 2024 at 04:41 UTC