Skip to content

Add builtin data structure benchmark framework for booster#4139

Draft
Stevengre wants to merge 3 commits intomasterfrom
benchmark-framework
Draft

Add builtin data structure benchmark framework for booster#4139
Stevengre wants to merge 3 commits intomasterfrom
benchmark-framework

Conversation

@Stevengre
Copy link

Adds a benchmark framework for profiling booster's builtin K data structure operations (KMap, KSet, KList), including a pipeline benchmark for matchMaps.

What's included:

  • Booster.Benchmark.Data / Booster.Benchmark.Ops: generate synthetic K terms and benchmark operations (lookup, insert, update, remove, union, intersection, etc.) at various sizes (10–50000)
  • booster-bench executable using tasty-bench with CSV output
  • Pipeline benchmarks: matchMaps, term ordering (Ord vs hash-first), substitution application, and a full single-rule rewrite pipeline
  • Unit tests validating generated data structures
  • scripts/run-bench-profile.sh for profiling with GHC RTS options

@Stevengre
Copy link
Author

Stevengre commented Feb 25, 2026

Benchmark Results (GHC 9.6.5)

KMap Operations

Size lookup-existing lookup-missing size insert update remove keys values in_keys sortAndDedup
10 85.9 ns 79.7 ns 151 ns 1.47 μs 1.43 μs 1.30 μs 758 ns 734 ns 159 ns 2.65 μs
100 340 ns 522 ns 231 ns 12.7 μs 13.1 μs 12.7 μs 6.73 μs 7.22 μs 1.13 μs 28.2 μs
1000 3.85 μs 4.85 μs 1.32 μs 346 μs 145 μs 150 μs 69.0 μs 79.2 μs 15.0 μs 488 μs
5000 13.9 μs 29.7 μs 12.6 μs 2.12 ms 798 μs 759 μs 359 μs 370 μs 56.7 μs 2.23 ms
10000 20.2 μs 51.4 μs 34.0 μs 7.88 ms 2.18 ms 1.95 ms 818 μs 1.13 ms 93.3 μs 7.53 ms

KSet Operations

Size in size difference union intersection sortAndDedup
10 40.8 ns 13.6 ns 1.25 μs 1.82 μs 1.13 μs 1.79 μs
100 286 ns 82.2 ns 10.5 μs 14.3 μs 10.5 μs 20.8 μs
1000 1.55 μs 2.07 μs 122 μs 154 μs 110 μs 270 μs
5000 19.7 μs 8.69 μs 596 μs 966 μs 578 μs 1.59 ms
10000 37.1 μs 20.3 μs 1.58 ms 2.80 ms 1.88 ms 3.94 ms

KList Operations

Size get-0 get-middle get-last size range concat
10 635 ns 670 ns 597 ns 133 ns 1.71 μs 1.33 μs
100 708 ns 827 ns 873 ns 216 ns 4.73 μs 10.4 μs
1000 1.70 μs 2.30 μs 3.23 μs 1.21 μs 31.7 μs 108 μs
5000 5.44 μs 8.20 μs 10.5 μs 5.01 μs 154 μs 572 μs
10000 10.8 μs 14.8 μs 21.2 μs 11.3 μs 361 μs 1.41 ms
50000 48.4 μs 72.6 μs 123 μs 59.3 μs

Pipeline Benchmarks

Benchmark Mean
matchMaps (size 10) 6.99 μs
matchMaps (size 100) 175 μs
matchMaps (size 1000) 24.3 ms
ord-term derived (all sizes) ~14.5 ns
ord-term hash-first (all sizes) ~6.6 ns
substitution unchanged-keys (size 1000) 708 μs
substitution changed-keys (size 1000) 881 μs
substitution unchanged-keys (size 5000) 4.97 ms
substitution changed-keys (size 5000) 7.55 ms
full-single-rule-pipeline 10.9 μs

Key observations:

  • matchMaps at size 1000 takes 24.3 ms — a potential optimization target
  • Hash-first term comparison is ~2x faster than derived Ord across all sizes
  • KMap insert shows super-linear growth (likely due to normalization), while lookup scales well
  • Substitution with changed keys is ~1.5x slower than unchanged keys at size 5000

@Stevengre Stevengre marked this pull request as draft February 25, 2026 13:37
@ehildenb
Copy link
Member

@jberthold what do you think of this benchmark? It looks good to me, any changes/additions you would make to it? At least, having it as a test in the repo seems like a good thing.

Copy link
Member

@jberthold jberthold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am generally not opposed to adding benchmarks like these to the code.
There are some details that should be corrected, specifically the test data should definitely not be part of the library code. Other comments are probably minor.

Overall, I am somehow not convinced that optimising the data structures will provide much performance gain. I would imagine the typical map at runtime to not have more than 100 elements (maximum). Maybe the effort is better spent on improving other areas where booster is known to be deficient, such as the implies endpoint.

Comment on lines +47 to +59
if command -v cabal >/dev/null 2>&1; then
cmd=(cabal bench booster-bench --flags profiling)
if [[ -n "$benchmark_options" ]]; then
cmd+=("--benchmark-options=$benchmark_options")
fi
cmd+=(-- +RTS)
cmd+=("${rts_flags[@]}")
cmd+=(-RTS)
elif command -v stack >/dev/null 2>&1; then
bench_args=()
if [[ -n "$benchmark_options" ]]; then
bench_args+=("$benchmark_options")
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would give users a choice whether to use cabal or stack (and I would probably run with stack), or at least test for stack first (because I would prefer using stack) but that may be just me.

Comment on lines +251 to +254
ghc-options:
- -rtsopts
- -threaded
- -with-rtsopts=-N
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated change. The predicates-integration test suite does not need multi-threading.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be much better to keep these two modules Data and Ops outside of the library. They are all about testing and should not be part of the main code.

import Booster.Pattern.Base

benchmarkSizes :: [Int]
benchmarkSizes = [10, 100, 1000, 5000, 10000, 50000]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these sizes should be made parameters (via env.variable or argument to the benchmark program). This will allow the benchmarks to be run as unit tests in reasonable time/scale so their code doesn't become stale without us noticing.

Comment on lines +36 to +40
[ bench "insert" $ nf (\(m, k, v) -> runMapUpdate m k v) (mapTerm, insertKey, insertValue)
, bench "update" $ nf (\(m, k, v) -> runMapUpdate m k v) (mapTerm, existingKey, updateValue)
, bench "remove" $ nf (\(m, k) -> runMapRemove m k) (mapTerm, existingKey)
, bench "keys" $ nf runMapKeys mapTerm
, bench "values" $ nf runMapValues mapTerm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these tests are measuring the generation of the large test data structures (such as mapTerm) as well as the operation they are trying to measure. It might be better to evaluate the data structure in full beforehand (using force) - which may require a NFData instance.

See the long comment on the nf function in Test.Tasty.Bench for more information.

@jberthold
Copy link
Member

jberthold commented Mar 2, 2026

The hpack error in CI comes from using a newer hpack version locally. Probably best to re-run hpack ./booster; hpack ./dev-tools with the hpack provided by the nix shell (nix develop .#cabal).
(Or edit the comment near the top to claim it was hpack version 0.36.0).

@Stevengre
Copy link
Author

Stevengre commented Mar 3, 2026

Overall, I am somehow not convinced that optimising the data structures will provide much performance gain. I would imagine the typical map at runtime to not have more than 100 elements (maximum). Maybe the effort is better spent on improving other areas where booster is known to be deficient, such as the implies endpoint.

Hi @jberthold, thanks for reviewing it! Could you please tell me more about these things, which has higher priority to improve? Is there any issues introducing these gaps?

@jberthold
Copy link
Member

jberthold commented Mar 3, 2026

Overall, I am somehow not convinced that optimising the data structures will provide much performance gain. I would imagine the typical map at runtime to not have more than 100 elements (maximum). Maybe the effort is better spent on improving other areas where booster is known to be deficient, such as the implies endpoint.

Hi @jberthold, thanks for reviewing it! Could you please tell me more about these things, which has higher priority to improve? Is there any issues introducing these gaps?

We have a few github issues around the implies endpoint in booster (and the code Booster.Pattern.Implies could be refactored for readability IMHO).
The issues are #3854 for improvements and #3857, #3605 and #3990 recording three specific issues to analyse (these specific issues may have been resolved meanwhile, they need reproducing first).

I also have a branch jb/incremental-improvements-for-implies where I started making changes but never got to get it ready for a PR.

Probably also a good idea to go through the board and try to close stale issues as well as find work.

@Stevengre
Copy link
Author

Probably also a good idea to go through the board and try to close stale issues as well as find work.

It makes a lot sense. Maybe that's a good start which makes the status of backend easy to track.

We have a few github issues around the implies endpoint in booster (and the code Booster.Pattern.Implies could be refactored for readability IMHO).
The issues are #3854 for improvements and #3857, #3605 and #3990 recording three specific issues to analyse (these specific issues may have been resolved meanwhile, they need reproducing first).
I also have a branch jb/incremental-improvements-for-implies where I started making changes but never got to get it ready for a PR.

Thanks for sharing these! I'll read these issues first.

@ehildenb
Copy link
Member

ehildenb commented Mar 3, 2026

I still would like to see this benchmark merged, because I think it is useful. So if you have time @Stevengre to address the changes that Jost pointed out, that would be great!

@Stevengre
Copy link
Author

Sure, I'd like to refine this PR when I have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants