feat(optimizations): Luce Spark — calibrated hot/cold expert residency#345
Conversation
…ency) Spark is the placement + caching layer on top of the merged hot/cold MoE offload engine. Standalone tooling to calibrate expert placement from real agent traffic, validate on held-out sessions, and (research) train a pre-gate predictor from routing traces. Laguna-XS.2 Q4_K_M (33B total MoE) / RTX 3090, held-out Claude Code sessions: calibration lifts naive offload 66 -> 81 tok/s (cold-hit 36% -> 6.6%); the bounded expert cache reaches ~88 tok/s at ~0 cold, 14.6 GiB peak (vs 18.8 GiB to hold the full model). Engine (cache, gpu_remap, trace) lives in server/. Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
9 issues found across 10 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="optimizations/spark/spark/tokenizer.py">
<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>
<file name="optimizations/spark/spark/extract_sessions.py">
<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>
<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>
<file name="optimizations/spark/README.md">
<violation number="1" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>
</file>
<file name="optimizations/spark/spark/validate.py">
<violation number="1" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>
</file>
<file name="optimizations/spark/spark/train_pregate.py">
<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.
(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>
<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>
Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.
Re-trigger cubic
| def list_str(fl): | ||
| return [bytes(fl.parts[d]).decode("utf-8", errors="replace") for d in fl.data] | ||
|
|
||
| model = bytes(f["tokenizer.ggml.model"].parts[f["tokenizer.ggml.model"].data[0]]).decode() |
There was a problem hiding this comment.
P2: Unguarded GGUF field access on f["tokenizer.ggml.model"] and f["tokenizer.ggml.tokens"]. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/tokenizer.py, line 24:
<comment>Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</comment>
<file context>
@@ -0,0 +1,56 @@
+ def list_str(fl):
+ return [bytes(fl.parts[d]).decode("utf-8", errors="replace") for d in fl.data]
+
+ model = bytes(f["tokenizer.ggml.model"].parts[f["tokenizer.ggml.model"].data[0]]).decode()
+ if model != "gpt2":
+ raise SystemExit(f"only gpt2 byte-level BPE supported, gguf says model={model!r}")
</file context>
| train, test = [], [] | ||
| for f in files: | ||
| # split by session-path hash so a whole session is train xor test | ||
| bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train |
There was a problem hiding this comment.
P2: hashlib.md5() without usedforsecurity=False will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/extract_sessions.py, line 79:
<comment>`hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</comment>
<file context>
@@ -0,0 +1,115 @@
+ train, test = [], []
+ for f in files:
+ # split by session-path hash so a whole session is train xor test
+ bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train
+ sess = []
+ try:
</file context>
|
|
||
| ```bash | ||
| cd optimizations/spark | ||
| uv sync # tokenizers (+ gguf/torch optional extras) |
There was a problem hiding this comment.
P2: uv sync does not install optional dependencies (gguf, numpy) required by the tokenizer step. The comment on the uv sync line is misleading — it lists gguf/torch optional extras but the command does not install them. Step 0 (python -m spark.tokenizer) will fail with ImportError because gguf and numpy are in the tokenizer extra, not the base dependency set.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/README.md, line 104:
<comment>`uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</comment>
<file context>
@@ -0,0 +1,187 @@
+
+```bash
+cd optimizations/spark
+uv sync # tokenizers (+ gguf/torch optional extras)
+
+# 0. one tokenizer, extracted from the GGUF (gpt2 byte-level BPE)
</file context>
| uv sync # tokenizers (+ gguf/torch optional extras) | |
| uv sync --all-extras # tokenizers (+ gguf/torch optional extras) |
| train, test = [], [] | ||
| for f in files: | ||
| # split by session-path hash so a whole session is train xor test | ||
| bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train |
There was a problem hiding this comment.
P2: --test-frac is not validated, so --test-frac 0 crashes with ZeroDivisionError during split assignment.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/extract_sessions.py, line 79:
<comment>`--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</comment>
<file context>
@@ -0,0 +1,115 @@
+ train, test = [], []
+ for f in files:
+ # split by session-path hash so a whole session is train xor test
+ bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train
+ sess = []
+ try:
</file context>
| proc.kill() | ||
| raise SystemExit("ready timeout") | ||
|
|
||
| pp = Path("/tmp/spark_val_chunk.bin") |
There was a problem hiding this comment.
P2: Using fixed /tmp filenames causes cross-process collisions when multiple validation runs execute concurrently.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 95:
<comment>Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</comment>
<file context>
@@ -0,0 +1,142 @@
+ proc.kill()
+ raise SystemExit("ready timeout")
+
+ pp = Path("/tmp/spark_val_chunk.bin")
+ op = Path("/tmp/spark_val_out.bin")
+ toks = []
</file context>
| import torch.nn as nn | ||
| H, E = args.n_embd, args.n_expert | ||
| dt = np.dtype([("layer", "<i2"), ("nsel", "<i2"), ("sel", "<i4", (8,)), ("hid", "<f4", (H,))]) | ||
| arr = np.fromfile(args.trace, dtype=dt) |
There was a problem hiding this comment.
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before np.fromfile, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.
(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/train_pregate.py, line 42:
<comment>The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.
(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) </comment>
<file context>
@@ -0,0 +1,85 @@
+ import torch.nn as nn
+ H, E = args.n_embd, args.n_expert
+ dt = np.dtype([("layer", "<i2"), ("nsel", "<i2"), ("sel", "<i4", (8,)), ("hid", "<f4", (H,))])
+ arr = np.fromfile(args.trace, dtype=dt)
+ print(f"records={len(arr)}", flush=True)
+ dev = "cuda" if torch.cuda.is_available() else "cpu"
</file context>
| dev = "cuda" if torch.cuda.is_available() else "cpu" | ||
|
|
||
| agg = {8: [], 16: [], 24: []} | ||
| for L in range(1, args.n_layer): |
There was a problem hiding this comment.
P3: The per-layer training loop skips layer 0 (range(1, args.n_layer)), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/train_pregate.py, line 47:
<comment>The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</comment>
<file context>
@@ -0,0 +1,85 @@
+ dev = "cuda" if torch.cuda.is_available() else "cpu"
+
+ agg = {8: [], 16: [], 24: []}
+ for L in range(1, args.n_layer):
+ idx = np.where(arr["layer"] == L)[0][:args.max_per_layer]
+ if len(idx) < 400:
</file context>
A blocking proc.stdout.readline() returns only on a new line or EOF, so the time-based timeout check ran *after* it: a daemon that goes silent (stalled model load, deadlock, disk stall) blocked the caller forever and the timeout never fired. This broke unattended calibration/validation runs. Move the protocol into spark/_daemon.py: a reader thread pumps stdout into a queue and callers wait with queue.get(timeout=...), so a read can never outlive its deadline. wait_ready() and the per-chunk generate loops in calibrate.py and validate.py now honor --ready-timeout / --gen-timeout and kill the daemon on stall. (select() + buffered readline would miss buffered data; the reader thread does the blocking readline itself.) Verified: silent daemon times out in ~1s, dead daemon returns immediately, happy path unchanged. Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
5 issues found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="optimizations/spark/spark/validate.py">
<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>
<violation number="2" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>
<file name="optimizations/spark/spark/calibrate.py">
<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>
<file name="optimizations/spark/spark/_daemon.py">
<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>
<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.
(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| if len(ids) < 8: | ||
| continue | ||
| write_counted_i32(pp, ids) | ||
| reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout) |
There was a problem hiding this comment.
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 91:
<comment>daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</comment>
<file context>
@@ -100,30 +88,16 @@ def drain():
- break
- if okl and okl.startswith("ok "):
- m = re.search(r"decode_tok_s=([0-9.]+)", okl)
+ reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)
+ if reply is None:
+ print("daemon stalled/closed; stopping early")
</file context>
| if toks: | ||
| print(f"decode tok/s: mean={statistics.mean(toks):.1f} median={statistics.median(toks):.1f} " | ||
| f"over {len(toks)} chunks") | ||
| colds = [float(m.group(1)) for l in daemon.stderr_lines |
There was a problem hiding this comment.
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 109:
<comment>daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</comment>
<file context>
@@ -132,7 +106,7 @@ def drain():
print(f"decode tok/s: mean={statistics.mean(toks):.1f} median={statistics.median(toks):.1f} "
f"over {len(toks)} chunks")
- colds = [float(m.group(1)) for l in prof
+ colds = [float(m.group(1)) for l in daemon.stderr_lines
for m in [re.search(r"cold_experts/tok=([0-9.]+)", l)] if m]
if colds:
</file context>
| if len(ids) < 8: | ||
| continue | ||
| write_counted_i32(pp, ids) | ||
| reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout) |
There was a problem hiding this comment.
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/calibrate.py, line 78:
<comment>Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</comment>
<file context>
@@ -84,38 +75,18 @@ def main():
- if time.time() - ts > 120:
- break
- if not ok:
+ reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)
+ if reply is None:
+ print(f"[calib] daemon stalled/closed at chunk {i}; stopping", flush=True)
</file context>
| self._q.put(None) # EOF sentinel | ||
|
|
||
| def _pump_stderr(self): | ||
| for line in self.proc.stderr: |
There was a problem hiding this comment.
P2: _pump_stderr silently crashes on encoding errors — same UnicodeDecodeError risk as stdout, causing data loss for callers that read stderr_lines (e.g. validate.py extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/_daemon.py, line 44:
<comment>`_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</comment>
<file context>
@@ -0,0 +1,106 @@
+ self._q.put(None) # EOF sentinel
+
+ def _pump_stderr(self):
+ for line in self.proc.stderr:
+ self.stderr_lines.append(line.rstrip())
+
</file context>
| line = self.readline(deadline - time.time()) | ||
| except DaemonTimeout: | ||
| self.kill() | ||
| raise SystemExit(f"daemon did not become ready within {timeout}s (no output)") |
There was a problem hiding this comment.
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.
(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) .
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/_daemon.py, line 66:
<comment>Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.
(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) .</comment>
<file context>
@@ -0,0 +1,106 @@
+ line = self.readline(deadline - time.time())
+ except DaemonTimeout:
+ self.kill()
+ raise SystemExit(f"daemon did not become ready within {timeout}s (no output)")
+ if line is None:
+ raise SystemExit("daemon died before ready banner")
</file context>
Server side of Luce Spark, on top of the merged hybrid-offload engine: - Bounded GPU expert cache: spare slots over-allocated on the hot expert stack + moe_hybrid_cache_swap_in (LRU). On a cold hit the expert is swapped into a spare slot and served through the unified GPU FFN, so cold-misses fall to ~0 in fixed VRAM after warmup. Gated by DFLASH_LAGUNA_EXPERT_CACHE / _CACHE_SLOTS / _GPU_REMAP. - Swap-rebuild fix + clean profile flush: build_hybrid_storage_from_file() re-reads experts from the GGUF mmap (partial-load keeps no full expert tensors resident), used by both init and post-request swap (was asserting in ggml_backend_tensor_get). routing_stats_ now also allocates when swap is enabled, and the placement profile flushes after each hybrid generate so NEXT_PLACEMENT_OUT works without the swap path. - Pre-gate trace capture (DFLASH_LAGUNA_PREGATE_TRACE), profile-gated, feeds the optimizations/spark pregate trainer. Laguna-XS.2 Q4_K_M / RTX 3090, calibrated 60% + 32 cache slots: ~85-88 tok/s (cold ~0) at 14.6 GiB peak, vs 66 uniform / 111 all-GPU @18.8 GiB. Built + smoke-tested against origin/main (3-way merge over the generate_impl rename). Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
5 issues found across 6 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/moe_hybrid_storage.cpp">
<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>
<violation number="2" location="server/src/common/moe_hybrid_storage.cpp:507">
P0: `moe_hybrid_cache_swap_in` only handles non-fused gate/up/down tensors. For fused gate-up architectures (Laguna), `gate_hot`/`up_hot` are nullptr so the function always returns -1, making the bounded cache a complete no-op for the very architecture it targets.</violation>
</file>
<file name="server/src/common/moe_hybrid_ffn_eval.h">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>
<file name="server/src/laguna/laguna_backend.cpp">
<violation number="1" location="server/src/laguna/laguna_backend.cpp:169">
P3: Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</violation>
<violation number="2" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| const int cold_count = (int)dst.cold_expert_ids.size(); | ||
| const int spare = (cold_count > 0 && cache_slots > 0) | ||
| ? std::min(cache_slots, cold_count) : 0; | ||
| const int hot_alloc = hot_count + spare; |
There was a problem hiding this comment.
P2: Hot tensor allocation still depends on hot_count > 0, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_storage.cpp, line 377:
<comment>Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</comment>
<file context>
@@ -371,6 +372,13 @@ bool build_moe_hybrid_storage_from_file(
const int cold_count = (int)dst.cold_expert_ids.size();
+ const int spare = (cold_count > 0 && cache_slots > 0)
+ ? std::min(cache_slots, cold_count) : 0;
+ const int hot_alloc = hot_count + spare;
+ dst.hot_active = hot_count;
+ dst.cache_slots = spare;
</file context>
| int n_hot); | ||
| int n_hot, | ||
| bool gpu_remap = false, | ||
| int n_expert = 0); |
There was a problem hiding this comment.
P2: int n_expert = 0 default silently disables remapping when gpu_remap=true is set without n_expert. The n_expert > 0 guard in the implementation (moe_hybrid_ffn_eval.cpp:440) prevents a crash but silently produces a non-remapped graph, making gpu_remap=true a non-binding hint rather than a semantic guarantee. Future callers adding /*gpu_remap=*/true without also passing cfg.n_expert will get incorrect (non-remapped) behavior with no warning.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_ffn_eval.h, line 175:
<comment>`int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</comment>
<file context>
@@ -170,7 +170,9 @@ bool build_cached_hot_graph(
- int n_hot);
+ int n_hot,
+ bool gpu_remap = false,
+ int n_expert = 0);
// Build/rebuild cached cold FFN graph.
</file context>
| for (int s = 0; s < req.n_gen; ++s) { | ||
| maybe_force_close(next_tok, s); | ||
| if (next_tok == w_.eos_id || next_tok == w_.eos_chat_id) break; | ||
| if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break; |
There was a problem hiding this comment.
P2: DFLASH_IGNORE_EOS is interpreted by presence instead of value, so setting it to 0 still disables EOS stopping.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 297:
<comment>`DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</comment>
<file context>
@@ -283,7 +294,7 @@ GenerateResult LagunaBackend::generate_impl(const GenerateRequest & req,
for (int s = 0; s < req.n_gen; ++s) {
maybe_force_close(next_tok, s);
- if (next_tok == w_.eos_id || next_tok == w_.eos_chat_id) break;
+ if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;
result.tokens.push_back(next_tok);
history.push_back(next_tok);
</file context>
| if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break; | |
| if (!([]{ const char * v = std::getenv("DFLASH_IGNORE_EOS"); return v && std::atoi(v) != 0; })() && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break; |
dflash_server --spark: one self-tuning command for both MoE backends. - Enables the bounded expert cache (auto working set) with sized slots (--spark-slots, default 32). - Auto-loads <model>.gguf.spark.csv if present and keeps persisting it after each request from live routing (laguna via NEXT_PLACEMENT_OUT, qwen35moe via RUNTIME_STATS_OUT, which is the var that allocates its routing accumulator). - Wires the expert cache into qwen35moe: cache_slots into its from-file storage + moe_hybrid_cache_swap_in before both host-partition loops in the pipelined decode (symmetric to laguna; swaps are auto-picked-up by the hot_local_by_global lookup, no graph surgery). Verified end-to-end on RTX 3090: laguna and Qwen3.6-35B-A3B both write + reload the profile (source=hotness:...) and stay coherent under forced offload + cache, no crash. dflash_server needs libcurl dev headers (find_package(CURL)). Built against origin/main. Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="optimizations/spark/spark/tokenizer.py">
<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>
<file name="optimizations/spark/spark/extract_sessions.py">
<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>
<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>
<file name="optimizations/spark/README.md">
<violation number="1" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>
</file>
<file name="optimizations/spark/spark/validate.py">
<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>
<violation number="2" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>
<violation number="3" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>
<file name="optimizations/spark/spark/train_pregate.py">
<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.
(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>
<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>
<file name="optimizations/spark/spark/calibrate.py">
<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>
<file name="optimizations/spark/spark/_daemon.py">
<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>
<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.
(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>
<file name="server/src/common/moe_hybrid_storage.cpp">
<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>
</file>
<file name="server/src/common/moe_hybrid_ffn_eval.h">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>
<file name="server/src/laguna/laguna_backend.cpp">
<violation number="1" location="server/src/laguna/laguna_backend.cpp:169">
P3: Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</violation>
<violation number="2" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>
</file>
<file name="server/src/server/server_main.cpp">
<violation number="1" location="server/src/server/server_main.cpp:644">
P2: Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| const std::string pfx = is_laguna ? "DFLASH_LAGUNA_" : "DFLASH_QWEN35MOE_"; | ||
| const std::string profile = std::string(bargs.model_path) + ".spark.csv"; | ||
| std::FILE * pf = std::fopen(profile.c_str(), "rb"); | ||
| const bool have_profile = (pf != nullptr); |
There was a problem hiding this comment.
P2: Stale profile sidecar not validated against the model file. The .spark.csv profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/server_main.cpp, line 644:
<comment>Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</comment>
<file context>
@@ -623,6 +629,43 @@ int main(int argc, char ** argv) {
+ const std::string pfx = is_laguna ? "DFLASH_LAGUNA_" : "DFLASH_QWEN35MOE_";
+ const std::string profile = std::string(bargs.model_path) + ".spark.csv";
+ std::FILE * pf = std::fopen(profile.c_str(), "rb");
+ const bool have_profile = (pf != nullptr);
+ if (pf) std::fclose(pf);
+ const std::string slots = std::to_string(spark_slots);
</file context>
moe_hybrid_cache_swap_in only validated/copied the separate gate/up/down tensors, so for a fused-gate-up MoE (gate_up_hot populated, gate_hot/up_hot null, a layout the storage supports via fused_gate_up) it returned -1 and the bounded cache silently no-op'd. Branch on fused_gate_up: copy gate_up + down when fused, gate + up + down otherwise. Current targets (laguna-xs2, Qwen3.6-35B-A3B) use SEPARATE tensors, so no behavior change for them (laguna cache verified unchanged: cold 38.5 -> 1.9 / 77 -> 85 tok/s at 32 slots). Fixes the latent no-op for fused-gate-up archs. Co-Authored-By: WOZCODE <contact@withwoz.com>
extract_sessions.py now pulls both Claude Code (~/.claude/projects) and Codex (~/.codex/sessions/**/rollout-*.jsonl) by default (--source claude|codex|both). Codex rollouts are parsed from response_item user+assistant content blocks (input_text/output_text/text), skipping the developer/system boilerplate so the corpus reflects real traffic, not instructions. Both sources merge into one corpus with the same per-session train/held-out split. Verified locally: 157 sessions (claude + codex) -> 432 train chunks (~212K tok), held-out split by session. Co-Authored-By: WOZCODE <contact@withwoz.com>
The only knob a user touches is now --spark-vram <GiB>: the total VRAM Spark may use. spark_budget_split() (common) clamps the expert budget to fit the target (minus core + KV + safety), then carves an auto-sized cache ring (~12%, capped 1.5 GiB) out of it; the rest is pinned-hot. Replaces the fixed 32-slot default. Both backends store the computed slot count and the storage build uses it (explicit DFLASH_*_CACHE_SLOTS still overrides). server_main: --spark-vram sets DFLASH_SPARK + DFLASH_SPARK_VRAM_MB; --spark with no target sizes to the card. Verified RTX 3090: laguna --spark-vram 13 -> hot 9.21 GiB + 19 slots, peak 11.6 GiB; qwen3.6-35B --spark-vram 15 -> hot 10.56 GiB + 21 slots, peak 13.8 GiB. Both under target, coherent. Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
4 issues found across 8 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="optimizations/spark/README.md">
<violation number="1" location="optimizations/spark/README.md:103">
P2: README now implies `--spark` works for any model, but server code ignores it on unsupported architectures (non-laguna/non-qwen35moe).</violation>
<violation number="2" location="optimizations/spark/README.md:107">
P3: The docs call `--spark-vram` the “only knob” even though `--spark-slots` is still a functional tuning flag in the same flow.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.cpp">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.cpp:1476">
P1: Spark VRAM target clamp ignores warm-cache and draft reserves, causing possible over-allocation beyond the requested `--spark-vram` budget.</violation>
</file>
<file name="server/src/laguna/laguna_backend.cpp">
<violation number="1" location="server/src/laguna/laguna_backend.cpp:656">
P2: Spark VRAM target clamp omits warm-cache reserve, so `--spark-vram` can allocate more expert memory than the requested total budget.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| uint64_t target = 0; | ||
| if (const char * t = std::getenv("DFLASH_SPARK_VRAM_MB")) target = (uint64_t)std::atoll(t) << 20; | ||
| auto sb = dflash::common::spark_budget_split(expert_budget, total_expert_bytes, w.n_expert, | ||
| core_bytes + kv_total + safety_bytes, target); |
There was a problem hiding this comment.
P1: Spark VRAM target clamp ignores warm-cache and draft reserves, causing possible over-allocation beyond the requested --spark-vram budget.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.cpp, line 1476:
<comment>Spark VRAM target clamp ignores warm-cache and draft reserves, causing possible over-allocation beyond the requested `--spark-vram` budget.</comment>
<file context>
@@ -1467,6 +1468,18 @@ bool Qwen35MoeBackend::load_dynamic_placement(const char * hotness_path,
+ uint64_t target = 0;
+ if (const char * t = std::getenv("DFLASH_SPARK_VRAM_MB")) target = (uint64_t)std::atoll(t) << 20;
+ auto sb = dflash::common::spark_budget_split(expert_budget, total_expert_bytes, w.n_expert,
+ core_bytes + kv_total + safety_bytes, target);
+ expert_budget = sb.hot_bytes;
+ cache_slots_ = sb.cache_slots;
</file context>
| core_bytes + kv_total + safety_bytes, target); | |
| core_bytes + kv_total + warm_cache_bytes + safety_bytes + draft_reserve_bytes, target); |
| auto-tunes from its own traffic: | ||
|
|
||
| ```bash | ||
| dflash_server <model.gguf> --spark # use the card, auto-size everything |
There was a problem hiding this comment.
P2: README now implies --spark works for any model, but server code ignores it on unsupported architectures (non-laguna/non-qwen35moe).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/README.md, line 103:
<comment>README now implies `--spark` works for any model, but server code ignores it on unsupported architectures (non-laguna/non-qwen35moe).</comment>
<file context>
@@ -100,19 +100,28 @@ For production, you don't run any of the offline pipeline. `dflash_server`
```bash
-dflash_server <model.gguf> --spark # laguna or qwen35moe MoE
+dflash_server <model.gguf> --spark # use the card, auto-size everything
+dflash_server <model.gguf> --spark --spark-vram 14 # cap total VRAM at 14 GiB
</file context>
</details>
```suggestion
dflash_server <model.gguf> --spark # laguna/qwen35moe only; auto-size on supported MoE backends
| dflash_server <model.gguf> --spark --spark-vram 14 # cap total VRAM at 14 GiB | ||
| ``` | ||
|
|
||
| The only knob is `--spark-vram <GiB>`: the total VRAM Spark may use. From that |
There was a problem hiding this comment.
P3: The docs call --spark-vram the “only knob” even though --spark-slots is still a functional tuning flag in the same flow.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/README.md, line 107:
<comment>The docs call `--spark-vram` the “only knob” even though `--spark-slots` is still a functional tuning flag in the same flow.</comment>
<file context>
@@ -100,19 +100,28 @@ For production, you don't run any of the offline pipeline. `dflash_server`
---spark (optionally --spark-slots N, default 32):
+The only knob is --spark-vram <GiB>: the total VRAM Spark may use. From that
+target it sizes everything itself. It fits the non-expert weights, the KV cache
+and a safety margin, pins as many calibrated-hot experts as the rest allows, and
</file context>
</details>
```suggestion
The primary knob is `--spark-vram <GiB>`: the total VRAM Spark may use. From that
--spark already auto-calibrates from live traffic and persists the profile after each request, reloading it on restart. But the routing accumulator was zeroed each boot, so the stored profile only ever reflected the current run. Seed the accumulator from the loaded profile (when one exists) so calibration strictly accumulates across restarts; a fresh first run still starts at zero and calibrates fast. Both backends. Verified RTX 3090, laguna --spark, same 2 prompts/run: total routed count 65520 -> 131040 -> 196560 across three restarts (additive). Was flat before. Co-Authored-By: WOZCODE <contact@withwoz.com>
… boot
When --spark is on and no profile exists yet, the server warms placement from
local agent history (Claude Code + Codex) before it starts serving, so the very
first session is already calibrated. One-time; live traffic refines it after.
- spark_corpus.{h,cpp}: scrape ~/.claude/projects + ~/.codex/sessions into text
chunks (dirs overridable via DFLASH_SPARK_CLAUDE_DIR / _CODEX_DIR). Mirrors the
offline extractor, in-process, nlohmann/json + std::filesystem.
- ModelBackend: spark_wants_bootstrap() / spark_bootstrap_finalize() virtuals
(default no-op, so non-hybrid backends skip cleanly).
- server_main: feeds the corpus through generate() (prefill observes routing),
then finalize saves the profile and rebuilds placement.
- laguna: implements finalize via the existing build_from_stats +
build_hybrid_storage_from_file (reuses the same budget + cache as init).
Verified RTX 3090: laguna --spark, no profile, staged history -> 'bootstrapping
from 24 chunks' -> profile written -> restart loads source=file, bootstrap does
not re-run, coherent output. No history -> clean skip to live calibration.
qwen35moe day-one bootstrap is a follow-up (needs its mmap-retained rebuild);
it keeps calibrating from live traffic in the meantime.
Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
4 issues found across 7 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/spark_corpus.cpp">
<violation number="1" location="server/src/common/spark_corpus.cpp:35">
P2: Codex block parsing calls `get<std::string>()` on `text` without checking `is_string()`, which can throw on heterogeneous JSON and fail bootstrap processing.</violation>
<violation number="2" location="server/src/common/spark_corpus.cpp:37">
P1: Unguarded `b["thinking"].get<std::string>()` — checks `b.contains("thinking")` but not `b["thinking"].is_string()`. Same type_error risk if the field exists with a non-string type.</violation>
<violation number="3" location="server/src/common/spark_corpus.cpp:60">
P2: Using `o.value("type", "")` assumes `type` is string-typed; a non-string `type` can throw and stop corpus extraction during bootstrap.</violation>
</file>
<file name="server/src/laguna/laguna_backend.cpp">
<violation number="1" location="server/src/laguna/laguna_backend.cpp:1497">
P2: Unchecked `save_csv` return value in `spark_bootstrap_finalize` — a silent save failure loses the Spark profile CSV without logging any error. Check the return and log on failure, matching the existing pattern at line 171.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| if (t == "text" && b.contains("text")) { | ||
| append(out, b["text"].get<std::string>()); | ||
| } else if (t == "thinking" && b.contains("thinking")) { | ||
| append(out, b["thinking"].get<std::string>()); |
There was a problem hiding this comment.
P1: Unguarded b["thinking"].get<std::string>() — checks b.contains("thinking") but not b["thinking"].is_string(). Same type_error risk if the field exists with a non-string type.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/spark_corpus.cpp, line 37:
<comment>Unguarded `b["thinking"].get<std::string>()` — checks `b.contains("thinking")` but not `b["thinking"].is_string()`. Same type_error risk if the field exists with a non-string type.</comment>
<file context>
@@ -0,0 +1,148 @@
+ if (t == "text" && b.contains("text")) {
+ append(out, b["text"].get<std::string>());
+ } else if (t == "thinking" && b.contains("thinking")) {
+ append(out, b["thinking"].get<std::string>());
+ } else if (t == "tool_use" && b.contains("input")) {
+ append(out, b["input"].dump().substr(0, 4000));
</file context>
| if (line.empty()) continue; | ||
| json o; | ||
| try { o = json::parse(line); } catch (...) { continue; } | ||
| const std::string ty = o.value("type", ""); |
There was a problem hiding this comment.
P2: Using o.value("type", "") assumes type is string-typed; a non-string type can throw and stop corpus extraction during bootstrap.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/spark_corpus.cpp, line 60:
<comment>Using `o.value("type", "")` assumes `type` is string-typed; a non-string `type` can throw and stop corpus extraction during bootstrap.</comment>
<file context>
@@ -0,0 +1,148 @@
+ if (line.empty()) continue;
+ json o;
+ try { o = json::parse(line); } catch (...) { continue; }
+ const std::string ty = o.value("type", "");
+ if ((ty == "user" || ty == "assistant") && o.contains("message") &&
+ o["message"].is_object() && o["message"].contains("content"))
</file context>
| const std::string ty = o.value("type", ""); | |
| const std::string ty = (o.contains("type") && o["type"].is_string()) ? o["type"].get<std::string>() : ""; |
| if (!b.is_object()) continue; | ||
| const std::string t = b.value("type", ""); | ||
| if (t == "text" && b.contains("text")) { | ||
| append(out, b["text"].get<std::string>()); |
There was a problem hiding this comment.
P2: Codex block parsing calls get<std::string>() on text without checking is_string(), which can throw on heterogeneous JSON and fail bootstrap processing.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/spark_corpus.cpp, line 35:
<comment>Codex block parsing calls `get<std::string>()` on `text` without checking `is_string()`, which can throw on heterogeneous JSON and fail bootstrap processing.</comment>
<file context>
@@ -0,0 +1,148 @@
+ if (!b.is_object()) continue;
+ const std::string t = b.value("type", "");
+ if (t == "text" && b.contains("text")) {
+ append(out, b["text"].get<std::string>());
+ } else if (t == "thinking" && b.contains("thinking")) {
+ append(out, b["thinking"].get<std::string>());
</file context>
| bool LagunaBackend::spark_bootstrap_finalize(const std::string & profile_path) { | ||
| if (!spark_wants_bootstrap()) return false; | ||
| std::string err; | ||
| routing_stats_->save_csv(profile_path, &err); // persist the observed routing |
There was a problem hiding this comment.
P2: Unchecked save_csv return value in spark_bootstrap_finalize — a silent save failure loses the Spark profile CSV without logging any error. Check the return and log on failure, matching the existing pattern at line 171.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 1497:
<comment>Unchecked `save_csv` return value in `spark_bootstrap_finalize` — a silent save failure loses the Spark profile CSV without logging any error. Check the return and log on failure, matching the existing pattern at line 171.</comment>
<file context>
@@ -1483,6 +1487,28 @@ GenerateResult LagunaBackend::generate_hybrid(const GenerateRequest & req,
+bool LagunaBackend::spark_bootstrap_finalize(const std::string & profile_path) {
+ if (!spark_wants_bootstrap()) return false;
+ std::string err;
+ routing_stats_->save_csv(profile_path, &err); // persist the observed routing
+ MoeHybridPlacement placement;
+ if (!MoeHybridPlacement::build_from_stats_with_layer_bytes(
</file context>
| routing_stats_->save_csv(profile_path, &err); // persist the observed routing | |
| if (!routing_stats_->save_csv(profile_path, &err)) | |
| std::fprintf(stderr, "[spark] profile save failed: %s\n", err.c_str()); |
…/cold expert residency # Conflicts: # server/src/server/server_main.cpp
Mirror laguna's bootstrap on qwen: store the hot budget + per-layer expert bytes during placement, and add a self-contained rebuild_hybrid_from_placement() that re-mmaps the GGUF and rebuilds the hot/cold storage for the calibrated placement. spark_bootstrap_finalize saves the observed profile and applies it in-process, so qwen's first session is calibrated from local history too. Frees the current storage before allocating the new one (no transient 2x VRAM; safe at startup with no in-flight requests, budget unchanged). Verified RTX 3090: qwen3.6-35B --spark, no profile, staged history -> bootstrap from 24 chunks -> calibrated placement applied -> coherent output, no OOM. Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/qwen35moe/qwen35moe_backend.cpp">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.cpp:221">
P1: Spark bootstrap runs even when the model is in full-fit (non-hybrid) mode, which can unintentionally flip the backend into hybrid mode and disable the normal spec-decode path.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| } | ||
|
|
||
| bool Qwen35MoeBackend::spark_wants_bootstrap() const { | ||
| return routing_stats_ && !layer_expert_bytes_.empty() && spark_expert_budget_ > 0; |
There was a problem hiding this comment.
P1: Spark bootstrap runs even when the model is in full-fit (non-hybrid) mode, which can unintentionally flip the backend into hybrid mode and disable the normal spec-decode path.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.cpp, line 221:
<comment>Spark bootstrap runs even when the model is in full-fit (non-hybrid) mode, which can unintentionally flip the backend into hybrid mode and disable the normal spec-decode path.</comment>
<file context>
@@ -217,6 +217,90 @@ void Qwen35MoeBackend::after_target_compute(StepGraph & sg, int, int) {
}
+bool Qwen35MoeBackend::spark_wants_bootstrap() const {
+ return routing_stats_ && !layer_expert_bytes_.empty() && spark_expert_budget_ > 0;
+}
+
</file context>
| return routing_stats_ && !layer_expert_bytes_.empty() && spark_expert_budget_ > 0; | |
| return target_weights().moe_hybrid && routing_stats_ && !layer_expert_bytes_.empty() && spark_expert_budget_ > 0; |
|
I suggest to try 8GB which is most common VRAM for gamers. |
…ode) Append the routed FFN into the per-layer attention graph, driven by the on-GPU selected tensor through a LUT (global->hot-slot), hot stack as expert source, drop-on-miss via valid_lut. One graph per layer, no host FFN dispatch, so ggml's CUDA-graph spans the whole layer. With the expert cache wired in, the cache warms the working set so drops fall to zero after warmup (exact) while the FFN stays fused. Gated by DFLASH_LAGUNA_FUSE_FFN; default path unchanged. Verified RTX 3090 laguna-xs2: - budget 100 (no drop): 115.7 tok/s, 161/161 tokens IDENTICAL to all-GPU (exact). - budget 60 + cache: 91.5 tok/s, cold/tok=0 after warmup (exact), coherent. - budget 60 pure drop: 101.5 tok/s = 88% of all-GPU, coherent, approximate. Follow-ups: cross-layer prefetch (removes the per-layer readback capping fused+cache; coverage validated 92-98%), GPU-side routing observe, qwen port. Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="optimizations/spark/spark/tokenizer.py">
<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>
<file name="optimizations/spark/spark/extract_sessions.py">
<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>
<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>
<file name="optimizations/spark/README.md">
<violation number="1" location="optimizations/spark/README.md:103">
P2: README now implies `--spark` works for any model, but server code ignores it on unsupported architectures (non-laguna/non-qwen35moe).</violation>
<violation number="2" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>
<violation number="3" location="optimizations/spark/README.md:107">
P3: The docs call `--spark-vram` the “only knob” even though `--spark-slots` is still a functional tuning flag in the same flow.</violation>
</file>
<file name="optimizations/spark/spark/validate.py">
<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>
<violation number="2" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>
<violation number="3" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>
<file name="optimizations/spark/spark/train_pregate.py">
<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.
(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>
<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>
<file name="optimizations/spark/spark/calibrate.py">
<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>
<file name="optimizations/spark/spark/_daemon.py">
<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>
<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.
(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>
<file name="server/src/common/moe_hybrid_storage.cpp">
<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>
</file>
<file name="server/src/common/moe_hybrid_ffn_eval.h">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>
<file name="server/src/laguna/laguna_backend.cpp">
<violation number="1" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>
<violation number="2" location="server/src/laguna/laguna_backend.cpp:1093">
P0: `step_graph_free` does not reset `sg.hot_local_lut` or `sg.valid_lut` after freeing the ggml context, so a stale pointer from a previous fused-layer call persists into a subsequent non-fused-layer call. The guard `layer_sg.hot_local_lut` then passes on a dangling pointer, and the tensor-set writes that follow corrupt freed memory.</violation>
<violation number="3" location="server/src/laguna/laguna_backend.cpp:1497">
P2: Unchecked `save_csv` return value in `spark_bootstrap_finalize` — a silent save failure loses the Spark profile CSV without logging any error. Check the return and log on failure, matching the existing pattern at line 171.</violation>
</file>
<file name="server/src/server/server_main.cpp">
<violation number="1" location="server/src/server/server_main.cpp:644">
P2: Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.cpp">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.cpp:221">
P1: Spark bootstrap runs even when the model is in full-fit (non-hybrid) mode, which can unintentionally flip the backend into hybrid mode and disable the normal spec-decode path.</violation>
<violation number="2" location="server/src/qwen35moe/qwen35moe_backend.cpp:1476">
P1: Spark VRAM target clamp ignores warm-cache and draft reserves, causing possible over-allocation beyond the requested `--spark-vram` budget.</violation>
</file>
<file name="server/src/common/spark_corpus.cpp">
<violation number="1" location="server/src/common/spark_corpus.cpp:35">
P2: Codex block parsing calls `get<std::string>()` on `text` without checking `is_string()`, which can throw on heterogeneous JSON and fail bootstrap processing.</violation>
<violation number="2" location="server/src/common/spark_corpus.cpp:37">
P1: Unguarded `b["thinking"].get<std::string>()` — checks `b.contains("thinking")` but not `b["thinking"].is_string()`. Same type_error risk if the field exists with a non-string type.</violation>
<violation number="3" location="server/src/common/spark_corpus.cpp:60">
P2: Using `o.value("type", "")` assumes `type` is string-typed; a non-string `type` can throw and stop corpus extraction during bootstrap.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| } | ||
|
|
||
| static const bool g_fuse_dec = (std::getenv("DFLASH_LAGUNA_FUSE_FFN") != nullptr); | ||
| if (g_fuse_dec && !is_dense && layer_sg.hot_local_lut) { |
There was a problem hiding this comment.
P0: step_graph_free does not reset sg.hot_local_lut or sg.valid_lut after freeing the ggml context, so a stale pointer from a previous fused-layer call persists into a subsequent non-fused-layer call. The guard layer_sg.hot_local_lut then passes on a dangling pointer, and the tensor-set writes that follow corrupt freed memory.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 1093:
<comment>`step_graph_free` does not reset `sg.hot_local_lut` or `sg.valid_lut` after freeing the ggml context, so a stale pointer from a previous fused-layer call persists into a subsequent non-fused-layer call. The guard `layer_sg.hot_local_lut` then passes on a dangling pointer, and the tensor-set writes that follow corrupt freed memory.</comment>
<file context>
@@ -1048,6 +1089,18 @@ bool LagunaBackend::hybrid_forward_one_token(int32_t tok, int kv_pos,
}
+ static const bool g_fuse_dec = (std::getenv("DFLASH_LAGUNA_FUSE_FFN") != nullptr);
+ if (g_fuse_dec && !is_dense && layer_sg.hot_local_lut) {
+ auto & _st = moe_hybrid_->layers[(size_t)il];
+ std::vector<int32_t> _lut((size_t)w_.n_expert); std::vector<float> _vld((size_t)w_.n_expert);
</file context>
Update the integration snapshot for the latest open PR list,\nrecord the current heads for Luce-Org#345/Luce-Org#344/Luce-Org#343, and preserve the\nexisting stack with no source changes.
…swap Laguna MoE hybrid (offloaded) decode now runs the whole token in ONE fused graph (laguna_step_hybrid) with the routed FFN served from the hot expert stack via per-layer residency LUTs (build_laguna_moe_block_hybrid), instead of 40 separate per-layer graphs. That removes ~3.2 ms/token of per-layer host glue (the real cap on the multi-graph fused path). Per-layer LUT sets and the selected readback are batched into one H2D / one D2H. Default-on for the hybrid-offload path; set DFLASH_LAGUNA_NO_SINGLE_GRAPH=1 to fall back. The bounded expert-cache swap is now async on cudaStreamPerThread with a pinned cold store (ggml_backend_cuda_host_buffer_type), so cold-tail re-fetch overlaps compute instead of stalling the host. Shared storage, so laguna and qwen35moe both benefit. Measured (laguna-xs2 Q4_K_M, RTX 3090, single-token decode): all-resident 118 tok/s, 96/96 bit-exact vs all-GPU true 60% residency ~97-101 tok/s (was 76), coherent qwen3.6-35B-A3B offload ~98-106 tok/s (~92-99% of all-resident) Co-Authored-By: WOZCODE <contact@withwoz.com>
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
|
@cubic-dev-ai review |
@davide221 I have started the AI code review. It will take a few minutes to complete. |
step_graph_free reset moe_weights/ffn_* but not hot_local_lut/valid_lut, so after a fused-MoE layer freed its graph the stale (dangling) tensor pointers persisted into the next layer; the `layer_sg.hot_local_lut` guard then passed on freed memory and the following tensor_set corrupted it. Only reachable on fused-gate-up archs under DFLASH_LAGUNA_FUSE_FFN (laguna-xs2 uses separate tensors, so our shipped targets are unaffected), but a real latent use-after-free. Flagged by cubic. Co-Authored-By: WOZCODE <contact@withwoz.com>
|
Thanks @cubic-dev-ai. Triage of the engine-code findings:
|
@davide221 Clear triage, that all tracks. A couple of callouts on the items you pushed back on: P0 P2s — all reasonable to defer alongside the hybrid-descriptor refactor. If you end up tracking those anywhere visible (an issue or a follow-up doc), linking from the One thing I'd nudge from the conversation — @howard0su's suggestion about 8 GB testing is worth a nod if you have a minute. Running a quick Tip: get faster answers by chatting with cubic’s review copilot in the review UI |
# Conflicts: # server/src/laguna/laguna_target_graph.cpp
…builders laguna_step_hybrid no longer duplicates build_laguna_graph / build_laguna_layer: a LagunaHybridMoe descriptor (hot storage + per-layer LUT tensors) is threaded through build_laguna_layer -> the MoE dispatcher, and laguna_step_hybrid now reuses build_laguna_graph for the attention/norm/layer loop + logits head. The all-GPU MoE block (build_laguna_moe_block_full) is untouched; the dispatcher routes to build_laguna_moe_block_hybrid only when a hybrid descriptor with storage is set. Removes ~100 lines of duplication and fixes a latent drift introduced by the main merge: the old inline loop missed `graph_tensor_f32` precision conversion that build_laguna_layer now applies on the shared path. Verified: sg_b100 (offload path, all experts resident) = 96/96 bit-exact vs all-GPU; offload 60% = 99.6 tok/s coherent; all-GPU = 119 tok/s. Co-Authored-By: WOZCODE <contact@withwoz.com>
The origin/main merge resolved the server/deps/llama.cpp gitlink to the PR's older b896cf69 (llama #12 hip-shim), rewinding it from main's 570d9785 and dropping llama #13 (crash), #14 (pflash-sparse stream-sync / BSA fix) and #15 (fattn-sparse hip cudaStreamDefault). Point the gitlink back to 570d9785 (a strict superset — b896cf69 is its ancestor) so the PR no longer downgrades llama.cpp / drops the BSA fix. Rebuilt + smoke-tested against 570d9785: sg_b100 96/96 bit-exact vs all-GPU, offload 60% 99.3 tok/s, all-GPU 119.5. Co-Authored-By: WOZCODE <contact@withwoz.com>
…mbers Drives the test_dflash daemon over a fixed prompt and reports steady-state decode tok/s for: all-GPU (full residency), single-graph @100% (asserts token-for-token match vs all-GPU), and Spark offload (single-graph hybrid). decode_tok_s = generated_tokens / decode_s, the same metric and the same LagunaBackend decode path that `dflash_server --spark` runs. Reproduces on RTX 3090, laguna-xs2 Q4_K_M: all-GPU 119.0 tok/s single-graph @100% 118.8 tok/s (128/128 EXACT vs all-GPU) Spark offload @~60% 99.1 tok/s (83% of all-GPU) README: added a "Reproduce the decode numbers" section. Co-Authored-By: WOZCODE <contact@withwoz.com>
Adds the Luce Spark card (-> optimizations/spark/) next to PFlash in the "Inference Engine Optimizations" grid, and lists calibrated MoE expert offload in the mission narrative. The card image (assets/cards/spark_card.png) is a Git-LFS asset to be dropped in, matching the other cards. Co-Authored-By: WOZCODE <contact@withwoz.com>
Integrates Luce Spark calibrated hot/cold expert residency, bounded cache support, and benchmark tooling.
Card for the main README optimizations grid (assets/cards/spark_card.png) and the optimizations/spark hero, replacing the placeholder. 1672x941, Git LFS. Co-Authored-By: WOZCODE <contact@withwoz.com>
…id-MoE section Co-Authored-By: WOZCODE <contact@withwoz.com>
…mbers Adds optimizations/spark/demo.gif (Git LFS): a two-terminal race showing naive expert offload (66 tok/s) vs `dflash_server --spark` (100 tok/s) on the same RTX 3090 at the same 60% GPU residency, generating the same output, with a shared wall-clock so the 1.5x lands as "same seconds, more tokens". Inserted it under the README intro. Reconciles the README to the shipped fused-decode story (66/81/100/119), consistent with spark/bench.py and the blog: the single-graph hybrid decode is default-on and bit-identical to all-GPU at full residency, so point 3 now documents it as shipped (not open research) and the headline/table use the reproducible decode-bench numbers; the held-out real-session figures stay in the Results section. The GIF builder (demo.py) is a one-off asset script and is intentionally not committed. Co-Authored-By: WOZCODE <contact@withwoz.com>
Adds
optimizations/spark/, the placement + caching product layer on top of the merged hot/cold MoE offload engine (server/src/common/moe_hybrid_*), plus the single-graph hybrid-decode engine (commit31ce93c) that makes offloaded laguna decode run at near all-GPU speed.What Spark is
A 33B-total MoE only fires ~8 of 256 experts/token, but naive hot/cold offload still hits the CPU tier ~36% of the time if it keeps the wrong experts resident. Spark:
Single-graph hybrid decode engine (commit 31ce93c)
The laguna hybrid-offload decode was rebuilt to run the whole token in one fused graph (
laguna_step_hybrid+build_laguna_moe_block_hybrid): the routed FFN is served from the hot expert stack via per-layer residency LUTs (batched into one H2D / one D2H), instead of 40 separate per-layer graphs. That removes ~3.2 ms/token of per-layer host glue — the real cap on the multi-graph fused path. Default-on under offload (DFLASH_LAGUNA_NO_SINGLE_GRAPH=1to fall back). The bounded expert-cache swap is now async oncudaStreamPerThreadwith a pinned cold store (ggml_backend_cuda_host_buffer_type) so cold-tail re-fetch overlaps compute; shared storage, so qwen35moe benefits too.Measured (RTX 3090, single-token decode)
Calibration results (RTX 3090, Laguna-XS.2 Q4_K_M, held-out Claude Code sessions)
Calibration: split by session (no leakage). Full tables in
RESULTS.md.Honest notes
6–8%, capacity-bound: recency (LRU) is already at the Belady optimum, so eviction tuning can't beat it. The single-graph + async-pinned overlap hides that re-fetch under compute (that's the 76→100 gain), it doesn't eliminate it.Known follow-up (refactor debt)
laguna_step_hybrid/build_laguna_moe_block_hybridcurrently duplicate ~150 lines ofbuild_laguna_graph/build_laguna_layer/build_laguna_moe_block_full(path isolation while iterating). Unifying via a hybrid descriptor threaded through the existing builders is a clean follow-up to remove the drift risk.🧙 Built with WOZCODE