Is there a quick and dirty way to see where compilation/inference time is being spent when running a particular script? I feel like I've done this before, but I can't remember how. Is the answer just using Snoopcompile?
I didn't realize y'all had created #helpdesk (published) in addition to this stream since I last visited zulip. Could a mod please move this message to that stream?
This topic was moved here from #helpdesk > Compilation/Inference time by Mason Protter
Welcome back @Adam non-jedi Beckmeyer! Yeah, unfortunately this is kinda a hard thing to profile right now. SnoopCompile.jl has the best tools for this as far as I'm aware
It won't necessarily help you find which functions the compiler spends the most time on necessarily, but it'll definitely help identify things like invalidation which lead to excessive re-compilation
Thanks. Guess I'll have to dig back into how snoopcompile works. This script has just a handful of new functions (none are methods of existing functions, so no invalidations afaik), so it's very confusing that it's spending over a .75 seconds on compilation/inference (compared to 5.5 second total runtime).
Hmm, interesting. Are you loading packages? It could be that you are loading packages which invalidate Base methods which end up in the call chain of your code.
Nope. But I think I found the culprit; just a sec
A big chunk of the time wasn't actually compilation but evaluation of code I was using to define global consts (please don't judge the code below):
julia> @time begin
REVMASKS = Tuple(Tuple(reverse!(VecElement.(0%Int8:15%Int8), 1, i)) for i=1:16)
SHIFTMASKS = Tuple(Tuple(insert!(VecElement.(1%Int8:15%Int8), i, VecElement(0%Int8))) for i=1:16)
end;
0.273808 seconds (760.64 k allocations: 46.730 MiB, 1.84% gc time, 100.70% compilation time)
Replacing that with the equivalent using nested Base.Cartesian.@ntuple fixed things right up.
Moral of the story: the difference between /usr/bin/time and @btime shouldn't immediately be blamed on compilation/inference time.
Always fun to see "compilation time" > 100%
Code like Tuple(Tuple(reverse!(VecElement.(0%Int8:15%Int8), 1, i)) for i=1:16) defines some complicated closures so I wouldn't be surprised if it takes time to compile it. I wonder if sandwiching it with @nospecialize and @specialize helps. Another approach is to work with Any[].
Second time running it only gets you from 0.27 down to 0.17, so it's slow regardless of compilation fwiw.
I think it's more the dynamism in creating the tuples than compilation per-se that's the problem. I.e. dynamically using Val types will be slow regardless of whether the code has been compiled already or not
Out of curiosity, how much faster was the the @ntuple solution?
Quite a bit...
julia> using Base.Cartesian
julia> const V = VecElement
VecElement
julia> @time begin
REVMASKS = @ntuple 16 i->
@ntuple 16 k-> V{Int8}(k <= i ? i - k : k - 1)
SHIFTMASKS = @ntuple 16 i->
@ntuple 16 k-> V{Int8}(k == i ? 0 : k < i ? k : k - 1)
end;
0.000001 seconds
Adam non-jedi Beckmeyer said:
Second time running it only gets you from 0.27 down to 0.17, so it's slow regardless of compilation fwiw.
how do you run it "second time"? I think, evaluating GLOBAL = $RHS twice creates different closure type if $RHS needs one. so, the inference of the closure is always done from scratch (though the inference for functions called inside is cached). you need to write f() = $RHS; GLOBAL = f() if you want to exclude the compilation.
Takafumi Arakaki (tkf) said:
I think, evaluating
GLOBAL = $RHStwice creates different closure type if$RHSneeds one. so, the inference of the closure is always done from scratch
Ah. I didn't think about the fact that the closure the second time around wouldn't have the same type; good point.
(On a related note, julia is about to be faster than both C and Rust on one of the benchmarks from the benchmarks game...)
Btw, if you don't want to do an explicit using Base.Cartesian, I believe that doing
@time begin
REVMASKS = ntuple(Val(16)) do i
ntuple(k -> V{Int8}(k <= i ? i - k : k - 1), Val(16))
end
SHIFTMASKS = ntuple(Val(16)) do i
ntuple(k-> V{Int8}(k == i ? 0 : k < i ? k : k - 1), Val(16))
end
end;
should be equivalent
Slower than the version with Tuple actually
julia> const V = VecElement
VecElement
julia> @time begin
REVMASKS = ntuple(Val(16)) do i
ntuple(k -> V{Int8}(k <= i ? i - k : k - 1), Val(16))
end
SHIFTMASKS = ntuple(Val(16)) do i
ntuple(k-> V{Int8}(k == i ? 0 : k < i ? k : k - 1), Val(16))
end
end;
0.303107 seconds (1.38 M allocations: 91.848 MiB, 3.30% gc time, 100.96% compilation time)
It has the same problem as the original version where it's creating closures. @ntuple doesn't actually treat the anonymous functions it operates on as functions appearances to the contrary.
Ah interesting
ntuple(_, Val(n)) is a @generated function so the compiler tries very hard to optimize it
I don't really like Base.Cartesian but I think it's the best option here
I think we should open an issue to track and hopefully fix why we get >100% compilation times. These are quite nice MWE’s
I though compilation was multi-threaded now, so seeing >100% compilation time was expected behavior?
Precompilation of multiple separate modules can be multithreaded, but not compilation itself (yet)
Ah. hrm
I think the >100% is due to some nesting of the timing blocks in src
Mason Protter said:
Precompilation of multiple separate modules can be multithreaded, but not compilation itself (yet)
Kinda nitpick, but parallelization of Pkg.precompile is process-based so it's not "threaded" in the sense of Base.Threads.
Oh I didn't realize it used processes, that makes sense, thanks
Last updated: Nov 07 2025 at 04:42 UTC