Stream: helpdesk (published)

Topic: ✔ What is the overhead of `gensym`?


view this post on Zulip Brian Chen (Jun 24 2022 at 23:21):

Should I refrain from calling it in a hot loop?

view this post on Zulip Brian Chen (Jun 24 2022 at 23:23):

Benchmarking suggests the overhead is non-negligible, which makes sense:

julia> function f1(x)
           total = 0
           for i in 1:x
               total += i
           end
           return total
       end
f1 (generic function with 1 method)

julia> @code_native f1(1000)
    .text
; ┌ @ REPL[37]:3 within `f1`
; │┌ @ range.jl:5 within `Colon`
; ││┌ @ range.jl:354 within `UnitRange`
; │││┌ @ range.jl:359 within `unitrange_last`
    testq   %rdi, %rdi
; │└└└
    jle L43
; │ @ REPL[37] within `f1`
    movq    %rdi, %rax
    sarq    $63, %rax
    andnq   %rdi, %rax, %rax
; │ @ REPL[37]:4 within `f1`
    leaq    -1(%rax), %rdx
    leaq    -2(%rax), %rcx
    mulxq   %rcx, %rcx, %rdx
    shldq   $63, %rcx, %rdx
    leaq    (%rdx,%rax,2), %rax
    decq    %rax
; │ @ REPL[37]:6 within `f1`
    retq
; │ @ REPL[37] within `f1`
L43:
    xorl    %eax, %eax
; │ @ REPL[37]:6 within `f1`
    retq
    nop
; └

julia> function f2(x)
           total = 0
           for i in 1:x
               @gensym a b c d e f g
               total += i
           end
           return total
       end
f2 (generic function with 1 method)

julia> @code_native f2(1000)
    .text
; ┌ @ REPL[40]:3 within `f2`
; │┌ @ range.jl:5 within `Colon`
; ││┌ @ range.jl:354 within `UnitRange`
; │││┌ @ REPL[40]:1 within `unitrange_last`
    pushq   %rbp
    pushq   %r15
    pushq   %r14
    pushq   %r13
    pushq   %r12
    pushq   %rbx
    subq    $40, %rsp
; │││└
; │││┌ @ range.jl:359 within `unitrange_last`
    testq   %rdi, %rdi
; │└└└
    jle L253
; │ @ REPL[40] within `f2`
    movq    %rdi, %rax
    sarq    $63, %rax
    andnq   %rdi, %rax, %rax
    movabsq $139821397918949, %r12          # imm = 0x7F2AB4C144E5
; │ @ REPL[40]:5 within `f2`
    negq    %rax
    movq    %rax, 32(%rsp)
    movl    $1, %r13d
    xorl    %ebx, %ebx
    leaq    -178798797(%r12), %rax
    movq    %rax, 24(%rsp)
    leaq    -178798637(%r12), %rax
    movq    %rax, 16(%rsp)
    leaq    -178798477(%r12), %rax
    movq    %rax, 8(%rsp)
    leaq    -178798317(%r12), %rax
    movq    %rax, (%rsp)
    leaq    -178798157(%r12), %rbp
    leaq    -178797997(%r12), %r14
    leaq    -178797837(%r12), %r15
    nopl    (%rax,%rax)
; │ @ REPL[40]:4 within `f2`
; │┌ @ expr.jl:12 within `gensym`
L144:
    movl    $1, %esi
    movq    24(%rsp), %rdi
    callq   *%r12
    movl    $1, %esi
    movq    16(%rsp), %rdi
    callq   *%r12
    movl    $1, %esi
    movq    8(%rsp), %rdi
    callq   *%r12
    movl    $1, %esi
    movq    (%rsp), %rdi
    callq   *%r12
    movl    $1, %esi
    movq    %rbp, %rdi
    callq   *%r12
    movl    $1, %esi
    movq    %r14, %rdi
    callq   *%r12
    movl    $1, %esi
    movq    %r15, %rdi
    callq   *%r12
; │└
; │ @ REPL[40]:5 within `f2`
; │┌ @ int.jl:87 within `+`
    addq    %r13, %rbx
    movq    32(%rsp), %rax
; │└
; │┌ @ range.jl:837 within `iterate`
; ││┌ @ promotion.jl:468 within `==`
    addq    %r13, %rax
    incq    %rax
; ││└
    incq    %r13
; ││┌ @ promotion.jl:468 within `==`
    cmpq    $1, %rax
; │└└
    jne L144
    jmp L255
; │ @ REPL[40] within `f2`
L253:
    xorl    %ebx, %ebx
; │ @ REPL[40]:7 within `f2`
L255:
    movq    %rbx, %rax
    addq    $40, %rsp
    popq    %rbx
    popq    %r12
    popq    %r13
    popq    %r14
    popq    %r15
    popq    %rbp
    retq
    nopw    %cs:(%rax,%rax)
; └

julia> @benchmark f1(1000)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.152 ns … 69.760 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.160 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.171 ns ±  0.712 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%


  ▂▁▁▆▁▁▁▄▁▁▁▄▁▁▁▃▁▁▁▂▁▁▁▂▁▁▁▃▁▁█▁▁▁▅▁▁▁▂▁▁▁▄▁▁▁█▁▁▁▆▁▁▁▇▁▁▂ ▂
  1.15 ns        Histogram: frequency by time        1.17 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f2(1000)
BenchmarkTools.Trial: 641 samples with 1 evaluation.
 Range (min … max):  7.365 ms …  10.314 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.726 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.795 ms ± 346.021 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁    ▁▂█▆▆▆▆▁▃▁▃▂▂
  ▅▆█▆▅█▅█████████████▆▅▃▃▃▃▂▃▁▃▃▃▃▁▄▄▃▃▃▃▃▃▂▁▁▂▁▂▂▁▁▁▂▂▁▁▂▁▃ ▄
  7.36 ms         Histogram: frequency by time        9.05 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

view this post on Zulip Jeffrey Sarnoff (Jun 24 2022 at 23:48):

You answered the question by benchmarking both ways and noting the relative slowdown exceeds 5000x.
You will be happier reorganizing the code, lift the gensym out of the loop -- pregen batches of these symbols.

view this post on Zulip Michael Fiano (Jun 24 2022 at 23:51):

gensym comes from Common Lisp where it is only supposed to be used at macro-expansion time. string-interning symbols as Julia calls it, is not cheap.

view this post on Zulip Brian Chen (Jun 24 2022 at 23:51):

I was trying to get a better sense of the absolute unit overhead. Some more local benchmarking seemed to suggest 1-2us, but that's subject to noise and possibly other factors

view this post on Zulip Michael Fiano (Jun 24 2022 at 23:56):

If it's anything like CL, there is more than string-interning involved too, since it has to keep a generation counter to ensure uniqueness.

view this post on Zulip Brian Chen (Jun 24 2022 at 23:56):

Yup, this one specifically: https://github.com/JuliaLang/julia/blob/87ded5a9aa502cfc4e03cbf230cb9bba86c85cc1/src/symbol.c#L126

view this post on Zulip Brian Chen (Jun 24 2022 at 23:57):

I think I can fulfill my use case using a custom macro (basically @static_gensym x, y, ...), but will have to try. Edit: it worked!

view this post on Zulip Notification Bot (Jun 25 2022 at 00:18):

Brian Chen has marked this topic as resolved.

view this post on Zulip Sukera (Jun 25 2022 at 04:48):

oh boy, did not know that @gensym does a ccall

view this post on Zulip Sukera (Jun 25 2022 at 04:49):

I would expect this particular benchmark to be fast, since the gensymmed variables are never used, but the ccall may have sideeffects the compiler can't see, so it can't be eliminated

view this post on Zulip Brian Chen (Jun 25 2022 at 05:35):

Isn't side effect inference for ccall a nightly only feature?

view this post on Zulip Sukera (Jun 25 2022 at 05:37):

I don't think there's any sideeffect inference on ccalls

view this post on Zulip Sukera (Jun 25 2022 at 05:37):

are you talking about the @assume_effects macro?

view this post on Zulip Sukera (Jun 25 2022 at 05:38):

I was more thinking about a being a dead variable, since it's never used

view this post on Zulip Mosè Giordano (Jun 25 2022 at 08:32):

I'm not sure the benchmarks were realistic, I think f1 was constant propagated, no? Summing 1000 elements took 1 ns

view this post on Zulip Sukera (Jun 25 2022 at 08:37):

oh yeah, 100% - it got folded to eulers gauss formula, there isn't even a jump there anymore

view this post on Zulip Sukera (Jun 25 2022 at 08:37):

to me at least the @gensym should not have made a difference for that case

view this post on Zulip Mosè Giordano (Jun 25 2022 at 08:42):

did you mean to say Gauss or I'm missing a new formula? :sweat_smile:

view this post on Zulip Sukera (Jun 25 2022 at 08:43):

ah, yes of course

view this post on Zulip Sukera (Jun 25 2022 at 08:44):

always hard to remember which of the gazillion famous formulas is from which of the two

view this post on Zulip Brian Chen (Jun 25 2022 at 13:45):

@Mosè Giordano the goal was just to have the function body not optimized away entirely. I wanted to use donotdelete but didn't want to switch versions :P

view this post on Zulip Brian Chen (Jun 25 2022 at 13:46):

Ended up just testing a single iteration and explicit calls because those are kept around still

view this post on Zulip Mosè Giordano (Jun 25 2022 at 14:52):

Brian Chen said:

Mosè Giordano the goal was just to have the function body not optimized away entirely. I wanted to use donotdelete but didn't want to switch versions :P

on nightly I get

julia> @benchmark f1(1000)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  2.460 ns  20.945 ns   GC (min  max): 0.00%  0.00%
 Time  (median):     2.472 ns               GC (median):    0.00%
 Time  (mean ± σ):   2.511 ns ±  0.566 ns   GC (mean ± σ):  0.00% ± 0.00%

  █▄                                                       
  ██▃▃▁█▁▃▁▁█▇▄▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ 
  2.46 ns      Histogram: log(frequency) by time     3.26 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

which would also suggests it isn't doing much in practice. these damned compilers eliminating everything...

view this post on Zulip chriselrod (Jun 25 2022 at 20:35):

Mosè Giordano said:

I'm not sure the benchmarks were realistic, I think f1 was constant propagated, no? Summing 1000 elements took 1 ns

LLVM will turn that sum into binomial(x+1,2).


Last updated: Oct 02 2023 at 04:34 UTC