Stream: helpdesk (published)

Topic: Help with BenchmarkTools.jl


view this post on Zulip Júlio Hoffimann (Jun 13 2025 at 19:16):

I always forget when we need to use $. The results below vary a lot:

julia> using LinearAlgebra

julia> using StaticArrays

julia> using Distances

julia> using BenchmarkTools

julia> x = SVector(1.0, 2.0, 3.0)
3-element SVector{3, Float64} with indices SOneTo(3):
 1.0
 2.0
 3.0

julia> y = SVector(4.0, 5.0, 6.0)
3-element SVector{3, Float64} with indices SOneTo(3):
 4.0
 5.0
 6.0

julia> w = SVector(1.0, 1.0/4.0, 1.0/9.0)
3-element SVector{3, Float64} with indices SOneTo(3):
 1.0
 0.25
 0.1111111111111111

julia> d1 = WeightedEuclidean(w)
WeightedEuclidean{SVector{3, Float64}}([1.0, 0.25, 0.1111111111111111])

julia> d2 = Mahalanobis(Diagonal(w))
Mahalanobis{Diagonal{Float64, SVector{3, Float64}}}([1.0 0.0 0.0; 0.0 0.25 0.0; 0.0 0.0 0.1111111111111111])

julia> @btime d1(x, y);
  18.138 ns (1 allocation: 16 bytes)

julia> @btime d1($x, $y);
  30.080 ns (3 allocations: 80 bytes)

julia> @btime $d1($x, $y);

  3.026 ns (0 allocations: 0 bytes)

julia> @btime d2(x, y);
  19.406 ns (1 allocation: 16 bytes)

julia> @btime d2($x, $y);
  30.923 ns (3 allocations: 80 bytes)

julia> @btime $d2($x, $y);
  3.093 ns (0 allocations: 0 bytes)

Apparently d1 wins, in all scenarios, but how should I read these numbers:

view this post on Zulip Júlio Hoffimann (Jun 13 2025 at 19:30):

The results with Chairmarks.jl are different:

julia> @b d1(x, y)
18.923 ns (1 allocs: 16 bytes)

julia> @b d1($x, $y)
30.557 ns (3 allocs: 80 bytes)

julia> @b $d1($x, $y)
3.235 ns

julia> @b d2(x, y)
19.091 ns (1 allocs: 16 bytes)

julia> @b d2($x, $y)
30.215 ns (3 allocs: 80 bytes)

julia> @b $d2($x, $y)
3.198 ns

view this post on Zulip Júlio Hoffimann (Jun 13 2025 at 19:31):

Appreciate any help interpreting these results.

view this post on Zulip Eric Hanson (Jun 13 2025 at 20:53):

$ means treat it like a local variable. If the computation will be carried out in an inner loop somewhere where all the variables are local, then interpolate everything.

Can you explain what you see as different in the chairmarks results? they look remarkably similar to me, less than 1ns difference from their benchmarktools counterparts for each of them

view this post on Zulip Júlio Hoffimann (Jun 13 2025 at 20:54):

They are pretty much the same, and I guess 1ns difference is irrelevant in general. Thank you for clarifying the use of $

view this post on Zulip Neven Sajko (Jun 13 2025 at 21:50):

Nowadays global variables may be types, so it often makes sense to pass a typed global variable to the benchmarked function without interpolation, to prevent unwanted constant folding.

view this post on Zulip Júlio Hoffimann (Jun 13 2025 at 21:52):

Can you elaborate on that @Neven Sajko ?

view this post on Zulip Júlio Hoffimann (Jun 13 2025 at 21:52):

How the example above is affected by your comment?

view this post on Zulip Neven Sajko (Jun 13 2025 at 22:38):

When you do @btime $d1($x, $y) or @btime $d2($x, $y) above, you're more or less benchmarking a constant expression. This can be pointless, it's possibly not what you want to benchmark, because the compiler could potentially optimize away the whole thing, leaving you benchmarking nothing. The timing would still come up at around 1-3 ns due to measurement error.

My suggestion is to instead use global variables directly when benchmarking, but those global variables should have a declared type. The compiler doesn't know the value of a global variable, so it will not be constant propagated (unless the type has only a single instance, as with Nothing, Missing, or Val{SomeParameter}). If some expression is intended to actually be constant during benchmarking, personally I still never use interpolation, preferring to use const and auxiliary functions where some arguments are constant. So my style is to run @btime f(a, b, c), where f is a constant binding (usually just a global function constant), while a, b, etc., are typed global variables. That way it's more clear what's happening.

Here's how I would perhaps have done the benchmark, assuming the metric functions are supposed to be compile-time constants, while the values of x and y should not be known to the compiler:

using LinearAlgebra, StaticArrays, Distances, BenchmarkTools
x::typeof(SVector(1.0, 2.0, 3.0)) = SVector(1.0, 2.0, 3.0)
y::typeof(x) = SVector(4.0, 5.0, 6.0)
const w = SVector(1.0, 1.0/4.0, 1.0/9.0)
const d1 = WeightedEuclidean(w)
const d2 = Mahalanobis(Diagonal(w))
@btime d1(x, y);
@btime d2(x, y);

There are also some tips specifically relevant to the floating-point code that's being benchmarked here: benchmarking merely a single call of the metric doesn't seem relevant, because that's not what happens in programs in the real world. In the real world, usually what you want is for the code to be vectorized by the compiler, which may happen if you're running the metric function in each iteration of a loop. So, to benchmark something like this, a more relevant experiment might be to benchmark a function which calls the metric for each point in a vector of points.

view this post on Zulip Júlio Hoffimann (Jun 13 2025 at 22:41):

Thank you @Neven Sajko . Very useful tips.

view this post on Zulip Daniel Wennberg (Jun 13 2025 at 22:54):

The usual trick to "interpolate the type but not the value" is @btime f(($(Ref(x)))[]), which isn't the most readable thing, so you may instead want to do xref = Ref(x); @btime f(($xref)[]).

This is similar to doing const xref = Ref(x) and not interpolating, but without limiting the rest of your Julia session by adding a const to your global namespace. (Admittedly less of an issue with the new binding partitioning in 1.12, but even there you can't un-const a name, only rebind it to a different const value).

It's also very similar to a typed global x::MyType = ... and no interpolation, but similarly avoids irreversibly attaching MyType to x for the rest of the Julia session. Also, the implementation of typed globals guarantees atomic reads and writes, which adds a tiny but nonzero extra cost to accessing them, compared to dereferencing a constant/interpolated Ref.

view this post on Zulip Júlio Hoffimann (Jun 13 2025 at 23:07):

I wonder if BenchmarkTools.jl or Chairmarks.jl could save users from these pitfalls. It is really tricky to get things right even as an experienced Julia programmer. You basically need to understand how the compiler works to do benchmarks correctly nowadays.

view this post on Zulip Daniel Wennberg (Jun 13 2025 at 23:22):

I don't think there's been any change over time except the introduction of typed globals, which gives you a third alternative to const Ref(...) or ($(Ref(...)))[] but otherwise hasn't changed anything. (And the only effect of the binding partitioning in 1.12 is that you can change the value of a const later, which may reduce your hesitation to use const for one-off benchmarking values. It changes nothing about how const and benchmarking interact.)

view this post on Zulip Daniel Wennberg (Jun 13 2025 at 23:24):

The point being, you've always needed some understanding of this to get microbenchmarks right.

view this post on Zulip Daniel Wennberg (Jun 13 2025 at 23:28):

But honestly, I think the best way of addressing most of these things is to use setup more actively. You never need to interpolate values defined in setup (and it's not possible anyway).

@btime f(x, y[], z) setup=begin
    x = rand(...)  # random value, compiler can only know the type
    y = Ref(...)  # fixed value but the compiler should only see the type, no constprop
    z = ...  # otherwise
end

view this post on Zulip Júlio Hoffimann (Jun 13 2025 at 23:30):

Setup is a good suggestion. Will try to use it more.


Last updated: Jul 01 2025 at 04:54 UTC