Puzzling CPU behavior: less cache misses and more stalls? · helpdesk (published)

Stream: helpdesk (published)

Topic: Puzzling CPU behavior: less cache misses and more stalls?

Takafumi Arakaki (tkf) (Nov 30 2021 at 23:59):

I'm playing with a cache-oblivious algorithm (single-threaded) and seeing some puzzling behavior where smaller base case sizes (x-axis) produce:

slower run-time (first row) which is bad and unexpected
smaller L1 cache misses (second row) which is good and expected
more backend stalls (third row)

I guess the wall clock and the backend stall measurements agree but I don't know why.

Any idea how to debug/fix this?

image.png

Takafumi Arakaki (tkf) (Dec 01 2021 at 00:00):

The algorithm I'm playing with is a cache-oblivious implementation of a stencil. You can find the code here: https://github.com/tkf/ToyStencils.jl/blob/main/src/trapezoid.jl If you are curious, you can run the benchmark with make at the root of the repository.

Takafumi Arakaki (tkf) (Dec 01 2021 at 00:04):

More plots of perf output https://vega.github.io/editor/#/gist/b71b202a14e0b7abd1518ab3339673aa/plot.vegalite

Takafumi Arakaki (tkf) (Dec 01 2021 at 00:06):

@chriselrod @Mason Protter I wonder if you guys saw something like this?

chriselrod (Dec 01 2021 at 07:36):

I haven't run or looked at the code, but my assumption would be that smaller blocks at one level require more passes over the data at the next level. E.g., if you have 84 iterations, a block size of 7 would mean 12 passes, while a block size of 12 would mean only 7 repetitions.
Blocking == reuse, and larger blocks yield more reuse.
Of course, you'll hit a limit with cache sizes. When a block size gets large enough to start leading to cache evictions, then reuse (of the memory movement) starts to plummet.
So there's some balance involved, but typically you'd want blocks to make use of most of the cache size.

Although, I don't have much experience with cache oblivious algorithms, for which this probably rings less true.

Takafumi Arakaki (tkf) (Dec 02 2021 at 04:38):

When the block size (base case size) goes from 2^12 to 2^11, I see big a drop in the L1 miss. It's consistent with this machine's L1 size 32 KiB = 2^15 bytes = 2^12 Float64s. I need two blocks for one base case so it makes sense that 2^11 is just enough. So, I think it's consistent with your cache eviction argument? The puzzle is that the time is worse in base size = 2^11 than in 2^12.

Last updated: Aug 14 2025 at 04:51 UTC