Should I refrain from calling it in a hot loop?
Benchmarking suggests the overhead is non-negligible, which makes sense:
julia> function f1(x)
total = 0
for i in 1:x
total += i
end
return total
end
f1 (generic function with 1 method)
julia> @code_native f1(1000)
.text
; ┌ @ REPL[37]:3 within `f1`
; │┌ @ range.jl:5 within `Colon`
; ││┌ @ range.jl:354 within `UnitRange`
; │││┌ @ range.jl:359 within `unitrange_last`
testq %rdi, %rdi
; │└└└
jle L43
; │ @ REPL[37] within `f1`
movq %rdi, %rax
sarq $63, %rax
andnq %rdi, %rax, %rax
; │ @ REPL[37]:4 within `f1`
leaq -1(%rax), %rdx
leaq -2(%rax), %rcx
mulxq %rcx, %rcx, %rdx
shldq $63, %rcx, %rdx
leaq (%rdx,%rax,2), %rax
decq %rax
; │ @ REPL[37]:6 within `f1`
retq
; │ @ REPL[37] within `f1`
L43:
xorl %eax, %eax
; │ @ REPL[37]:6 within `f1`
retq
nop
; └
julia> function f2(x)
total = 0
for i in 1:x
@gensym a b c d e f g
total += i
end
return total
end
f2 (generic function with 1 method)
julia> @code_native f2(1000)
.text
; ┌ @ REPL[40]:3 within `f2`
; │┌ @ range.jl:5 within `Colon`
; ││┌ @ range.jl:354 within `UnitRange`
; │││┌ @ REPL[40]:1 within `unitrange_last`
pushq %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $40, %rsp
; │││└
; │││┌ @ range.jl:359 within `unitrange_last`
testq %rdi, %rdi
; │└└└
jle L253
; │ @ REPL[40] within `f2`
movq %rdi, %rax
sarq $63, %rax
andnq %rdi, %rax, %rax
movabsq $139821397918949, %r12 # imm = 0x7F2AB4C144E5
; │ @ REPL[40]:5 within `f2`
negq %rax
movq %rax, 32(%rsp)
movl $1, %r13d
xorl %ebx, %ebx
leaq -178798797(%r12), %rax
movq %rax, 24(%rsp)
leaq -178798637(%r12), %rax
movq %rax, 16(%rsp)
leaq -178798477(%r12), %rax
movq %rax, 8(%rsp)
leaq -178798317(%r12), %rax
movq %rax, (%rsp)
leaq -178798157(%r12), %rbp
leaq -178797997(%r12), %r14
leaq -178797837(%r12), %r15
nopl (%rax,%rax)
; │ @ REPL[40]:4 within `f2`
; │┌ @ expr.jl:12 within `gensym`
L144:
movl $1, %esi
movq 24(%rsp), %rdi
callq *%r12
movl $1, %esi
movq 16(%rsp), %rdi
callq *%r12
movl $1, %esi
movq 8(%rsp), %rdi
callq *%r12
movl $1, %esi
movq (%rsp), %rdi
callq *%r12
movl $1, %esi
movq %rbp, %rdi
callq *%r12
movl $1, %esi
movq %r14, %rdi
callq *%r12
movl $1, %esi
movq %r15, %rdi
callq *%r12
; │└
; │ @ REPL[40]:5 within `f2`
; │┌ @ int.jl:87 within `+`
addq %r13, %rbx
movq 32(%rsp), %rax
; │└
; │┌ @ range.jl:837 within `iterate`
; ││┌ @ promotion.jl:468 within `==`
addq %r13, %rax
incq %rax
; ││└
incq %r13
; ││┌ @ promotion.jl:468 within `==`
cmpq $1, %rax
; │└└
jne L144
jmp L255
; │ @ REPL[40] within `f2`
L253:
xorl %ebx, %ebx
; │ @ REPL[40]:7 within `f2`
L255:
movq %rbx, %rax
addq $40, %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
nopw %cs:(%rax,%rax)
; └
julia> @benchmark f1(1000)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.152 ns … 69.760 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.160 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.171 ns ± 0.712 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▂▁▁▆▁▁▁▄▁▁▁▄▁▁▁▃▁▁▁▂▁▁▁▂▁▁▁▃▁▁█▁▁▁▅▁▁▁▂▁▁▁▄▁▁▁█▁▁▁▆▁▁▁▇▁▁▂ ▂
1.15 ns Histogram: frequency by time 1.17 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark f2(1000)
BenchmarkTools.Trial: 641 samples with 1 evaluation.
Range (min … max): 7.365 ms … 10.314 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.726 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.795 ms ± 346.021 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁▂█▆▆▆▆▁▃▁▃▂▂
▅▆█▆▅█▅█████████████▆▅▃▃▃▃▂▃▁▃▃▃▃▁▄▄▃▃▃▃▃▃▂▁▁▂▁▂▂▁▁▁▂▂▁▁▂▁▃ ▄
7.36 ms Histogram: frequency by time 9.05 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
You answered the question by benchmarking both ways and noting the relative slowdown exceeds 5000x.
You will be happier reorganizing the code, lift the gensym out of the loop -- pregen batches of these symbols.
gensym comes from Common Lisp where it is only supposed to be used at macro-expansion time. string-interning symbols as Julia calls it, is not cheap.
I was trying to get a better sense of the absolute unit overhead. Some more local benchmarking seemed to suggest 1-2us, but that's subject to noise and possibly other factors
If it's anything like CL, there is more than string-interning involved too, since it has to keep a generation counter to ensure uniqueness.
Yup, this one specifically: https://github.com/JuliaLang/julia/blob/87ded5a9aa502cfc4e03cbf230cb9bba86c85cc1/src/symbol.c#L126
I think I can fulfill my use case using a custom macro (basically @static_gensym x, y, ...
), but will have to try. Edit: it worked!
Brian Chen has marked this topic as resolved.
oh boy, did not know that @gensym
does a ccall
I would expect this particular benchmark to be fast, since the gensymmed variables are never used, but the ccall
may have sideeffects the compiler can't see, so it can't be eliminated
Isn't side effect inference for ccall a nightly only feature?
I don't think there's any sideeffect inference on ccalls
are you talking about the @assume_effects
macro?
I was more thinking about a
being a dead variable, since it's never used
I'm not sure the benchmarks were realistic, I think f1 was constant propagated, no? Summing 1000 elements took 1 ns
oh yeah, 100% - it got folded to eulers gauss formula, there isn't even a jump there anymore
to me at least the @gensym
should not have made a difference for that case
did you mean to say Gauss or I'm missing a new formula? :sweat_smile:
ah, yes of course
always hard to remember which of the gazillion famous formulas is from which of the two
@Mosè Giordano the goal was just to have the function body not optimized away entirely. I wanted to use donotdelete but didn't want to switch versions :P
Ended up just testing a single iteration and explicit calls because those are kept around still
Brian Chen said:
Mosè Giordano the goal was just to have the function body not optimized away entirely. I wanted to use donotdelete but didn't want to switch versions :P
on nightly I get
julia> @benchmark f1(1000)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 2.460 ns … 20.945 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.472 ns ┊ GC (median): 0.00%
Time (mean ± σ): 2.511 ns ± 0.566 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▄ ▃ ▃ ▁
██▃▃▁█▁▃▁▁█▇▄▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ █
2.46 ns Histogram: log(frequency) by time 3.26 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
which would also suggests it isn't doing much in practice. these damned compilers eliminating everything...
Mosè Giordano said:
I'm not sure the benchmarks were realistic, I think f1 was constant propagated, no? Summing 1000 elements took 1 ns
LLVM will turn that sum into binomial(x+1,2)
.
Last updated: Nov 06 2024 at 04:40 UTC