I'm playing with a cache-oblivious algorithm (single-threaded) and seeing some puzzling behavior where smaller base case sizes (x-axis) produce:
I guess the wall clock and the backend stall measurements agree but I don't know why.
Any idea how to debug/fix this?
The algorithm I'm playing with is a cache-oblivious implementation of a stencil. You can find the code here: https://github.com/tkf/ToyStencils.jl/blob/main/src/trapezoid.jl If you are curious, you can run the benchmark with make
at the root of the repository.
More plots of perf
output https://vega.github.io/editor/#/gist/b71b202a14e0b7abd1518ab3339673aa/plot.vegalite
@chriselrod @Mason Protter I wonder if you guys saw something like this?
I haven't run or looked at the code, but my assumption would be that smaller blocks at one level require more passes over the data at the next level. E.g., if you have 84 iterations, a block size of 7 would mean 12 passes, while a block size of 12 would mean only 7 repetitions.
Blocking == reuse, and larger blocks yield more reuse.
Of course, you'll hit a limit with cache sizes. When a block size gets large enough to start leading to cache evictions, then reuse (of the memory movement) starts to plummet.
So there's some balance involved, but typically you'd want blocks to make use of most of the cache size.
Although, I don't have much experience with cache oblivious algorithms, for which this probably rings less true.
When the block size (base case size) goes from 2^12 to 2^11, I see big a drop in the L1 miss. It's consistent with this machine's L1 size 32 KiB = 2^15 bytes = 2^12 Float64
s. I need two blocks for one base case so it makes sense that 2^11 is just enough. So, I think it's consistent with your cache eviction argument? The puzzle is that the time is worse in base size = 2^11 than in 2^12.
Last updated: Nov 06 2024 at 04:40 UTC