perf(weave): env-gate calls_merged heavy-LIKE ngram path + add attributes_dump index#6891
Draft
gtarpenning wants to merge 3 commits into
Draft
Conversation
…ED_HEAVY_INDEXES Adds attributes_dump to migration 031 (so all four heavy dumps have the ngrambf_v1(5, 65536, 3, 0) GRANULARITY 8 skip index) and gates the bloom-eligible candidate-CTE wiring on a new env flag so the new path can roll out per-cluster without changing query SQL until the operator opts in. The migration is idempotent and safe to apply ahead of the flag; only the query builder behavior is flag-controlled. With the flag off the LIKE optimization emits `<col> LIKE ?` (matching pre-bloom behavior) and `_build_filter_candidate_ids_cte_sql` returns None. With the flag on the LIKE emits `ifNull(<col>, '') LIKE ?` so the index expression matches and the candidate CTE pre-narrows via the bloom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…py; gate strict-form computation Two follow-ups from review of #6891: 1. The autouse fixture defaulting WF_CALLS_MERGED_HEAVY_INDEXES=true moves from tests/trace_server/query_builder/conftest.py to tests/trace_server/conftest.py. The narrower scope missed test_filter_candidate_cte_consistency_with_unmerged_parts in test_calls_complete.py, which would have silently fallen into the env-off (no-CTE) path. With the fixture hoisted, every trace_server test exercises the bloom-eligible candidate-CTE path uniformly. 2. process_query_to_optimization_sql now gates the strict-form computation on the same env flag as _build_filter_candidate_ids_cte_sql. When the flag is off the CTE builder returns None before consuming strict_result_sql, so computing it was wasted work. The two gates sit in different files; keeping them on the same flag means future readers reach the right conclusion without tracing the data flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=95e5ec239a20863bc5261274b41a56b9aa4e9e0d |
Per review feedback. Inline comments stay <= 2 lines; rationale belongs in the PR description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on #6695 (the heavy-LIKE candidate-CTE wiring) and on the parameter sweep spec at
~/Desktop/specs/attributes-dump-ngram-index.md.Summary
idx_attributes_dump_ngram ifNull(attributes_dump, '') TYPE ngrambf_v1(5, 65536, 3, 0) GRANULARITY 8. Other three columns (inputs/output/summary) already covered by perf(weave): wrap heavy-field LIKE filters in ifNull() and add ngram index #6695.WF_CALLS_MERGED_HEAVY_INDEXES. When unset the LIKE optimization falls back to the pre-bloom shape (<col> LIKE ?plus outer OR-IS-NULL, no candidate CTE), so the migration is safe to apply ahead of the flag and operators can flip it per-cluster after the index materializes on enough parts to be useful.optimization_builder._create_like_conditionemitsifNull(<col>, '')only when the flag is on; otherwise emits raw<table>.<col>.calls_query_builder._build_filter_candidate_ids_cte_sqlreturnsNonewhen the flag is off, dropping the candidate-CTE entirely.Performance
Cold-from-disk numbers vs no-index baseline, measured on a realistic 2M-row LLM-shaped corpus (18.9 GiB compressed
attributes_dump, 8 GiB cgroup so the column does NOT fit in OS page cache). Bench harness atcore/.../scripts/skip_index_bench/(genv3 + verify_winners_v3); REPORT in the same dir.ngrambf_v1(5, 65536, 3, 0) GRANULARITY 8(shipped here)Projection to a representative prod shard (~2 TiB compressed
attributes_dump): ~4 GiB index per column, ~17 GiB across all four heavy dumps. Default mark_cache_size 5 GiB; tune up if needed.Testing
tests/trace_server/query_builder/conftest.pydefaultsWF_CALLS_MERGED_HEAVY_INDEXES=truefor query-builder tests so the SQL assertions pin the post-flag shape.test_calls_query_with_like_optimization_env_offexercises the off-path: no candidate CTE, raw<col> LIKE ?, OR-IS-NULL retained.nox --no-install -e "tests-3.12(shard='trace_server')" -- tests/trace_server/query_builder/green locally (442 passed).Rollout
MATERIALIZE INDEX idx_<col>_ngramper column on hot ranges.WF_CALLS_MERGED_HEAVY_INDEXES=trueper cluster. EXPLAIN INDEXES on a known-bad customer query to confirmGranules: X/YwithX << Yon the candidate-CTE inner.Spec:
Desktop/specs/attributes-dump-ngram-index.md§1.3 (Phase-1 realistic-corpus bench), §4.5 (Unified candidate-CTE composition with #6764).