feat(optimizations): Luce Spark — calibrated hot/cold expert residency by davide221 · Pull Request #345 · Luce-Org/lucebox-hub

davide221 · 2026-06-05T14:58:54Z

Adds optimizations/spark/, the placement + caching product layer on top of the merged hot/cold MoE offload engine (server/src/common/moe_hybrid_*), plus the single-graph hybrid-decode engine (commit 31ce93c) that makes offloaded laguna decode run at near all-GPU speed.

What Spark is

A 33B-total MoE only fires ~8 of 256 experts/token, but naive hot/cold offload still hits the CPU tier ~36% of the time if it keeps the wrong experts resident. Spark:

Calibrates the hot set from the traffic you actually serve (replays real agent sessions, accumulates per-(layer,expert) routing → placement profile).
Caches the per-session tail in a bounded ring of spare GPU slots (LRU swap), driving cold-misses down in fixed VRAM.
Single-graph decode (new): runs the whole token in one fused graph so the offloaded path keeps near all-GPU throughput.

Single-graph hybrid decode engine (commit `31ce93c`)

The laguna hybrid-offload decode was rebuilt to run the whole token in one fused graph (laguna_step_hybrid + build_laguna_moe_block_hybrid): the routed FFN is served from the hot expert stack via per-layer residency LUTs (batched into one H2D / one D2H), instead of 40 separate per-layer graphs. That removes ~3.2 ms/token of per-layer host glue — the real cap on the multi-graph fused path. Default-on under offload (DFLASH_LAGUNA_NO_SINGLE_GRAPH=1 to fall back). The bounded expert-cache swap is now async on cudaStreamPerThread with a pinned cold store (ggml_backend_cuda_host_buffer_type) so cold-tail re-fetch overlaps compute; shared storage, so qwen35moe benefits too.

Measured (RTX 3090, single-token decode)

Config	tok/s	vs all-GPU
laguna all-resident	118	100% (96/96 bit-exact)
laguna true-60% residency	97–101	~85–92% (was 76)
qwen3.6-35B-A3B 60% offload (AR)	98–106	~92–99%

Calibration results (RTX 3090, Laguna-XS.2 Q4_K_M, held-out Claude Code sessions)

Config	tok/s	% all-GPU	VRAM
All-GPU	118	100%	18.8 GiB
Uniform 60%	66	56%	10.6 GiB
Spark calibrated 60%	81	69%	10.6 GiB
Spark + cache + single-graph	97–101	~85–92%	14.6 GiB

Calibration: split by session (no leakage). Full tables in RESULTS.md.

Honest notes

The earlier "~0% cold-hit" figure was a counter bug (the fused branch never incremented the cold counter). Real cold-hit at true-60% is 6–8%, capacity-bound: recency (LRU) is already at the Belady optimum, so eviction tuning can't beat it. The single-graph + async-pinned overlap hides that re-fetch under compute (that's the 76→100 gain), it doesn't eliminate it.
Token-level expert prediction (transition / router-score / pre-gate) caps at ~53% recall@8 — the experts that newly fire are driven by the current token's hidden evolution, invisible to any prior-token signal. Draft-as-predictor (MoE-SpeQ style) is the open lever, documented as a follow-up.

Known follow-up (refactor debt)

laguna_step_hybrid / build_laguna_moe_block_hybrid currently duplicate ~150 lines of build_laguna_graph / build_laguna_layer / build_laguna_moe_block_full (path isolation while iterating). Unifying via a hybrid descriptor threaded through the existing builders is a clean follow-up to remove the drift risk.

🧙 Built with WOZCODE

…ency) Spark is the placement + caching layer on top of the merged hot/cold MoE offload engine. Standalone tooling to calibrate expert placement from real agent traffic, validate on held-out sessions, and (research) train a pre-gate predictor from routing traces. Laguna-XS.2 Q4_K_M (33B total MoE) / RTX 3090, held-out Claude Code sessions: calibration lifts naive offload 66 -> 81 tok/s (cold-hit 36% -> 6.6%); the bounded expert cache reaches ~88 tok/s at ~0 cold, 14.6 GiB peak (vs 18.8 GiB to hold the full model). Engine (cache, gpu_remap, trace) lives in server/. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

9 issues found across 10 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="optimizations/spark/spark/tokenizer.py">

<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>

<file name="optimizations/spark/spark/extract_sessions.py">

<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>

<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>

<file name="optimizations/spark/README.md">

<violation number="1" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>
</file>

<file name="optimizations/spark/spark/validate.py">

<violation number="1" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>
</file>

<file name="optimizations/spark/spark/train_pregate.py">

<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.

(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>

<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>

_{Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.

Re-trigger cubic}

cubic-dev-ai · 2026-06-05T15:08:18Z

+    def list_str(fl):
+        return [bytes(fl.parts[d]).decode("utf-8", errors="replace") for d in fl.data]
+
+    model = bytes(f["tokenizer.ggml.model"].parts[f["tokenizer.ggml.model"].data[0]]).decode()


P2: Unguarded GGUF field access on f["tokenizer.ggml.model"] and f["tokenizer.ggml.tokens"]. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/tokenizer.py, line 24: <comment>Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</comment> <file context> @@ -0,0 +1,56 @@ + def list_str(fl): + return [bytes(fl.parts[d]).decode("utf-8", errors="replace") for d in fl.data] + + model = bytes(f["tokenizer.ggml.model"].parts[f["tokenizer.ggml.model"].data[0]]).decode() + if model != "gpt2": + raise SystemExit(f"only gpt2 byte-level BPE supported, gguf says model={model!r}") </file context>

cubic-dev-ai · 2026-06-05T15:08:18Z

+    train, test = [], []
+    for f in files:
+        # split by session-path hash so a whole session is train xor test
+        bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train


P2: hashlib.md5() without usedforsecurity=False will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/extract_sessions.py, line 79: <comment>`hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</comment> <file context> @@ -0,0 +1,115 @@ + train, test = [], [] + for f in files: + # split by session-path hash so a whole session is train xor test + bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train + sess = [] + try: </file context>

cubic-dev-ai · 2026-06-05T15:08:18Z

+
+```bash
+cd optimizations/spark
+uv sync                                   # tokenizers (+ gguf/torch optional extras)


P2: uv sync does not install optional dependencies (gguf, numpy) required by the tokenizer step. The comment on the uv sync line is misleading — it lists gguf/torch optional extras but the command does not install them. Step 0 (python -m spark.tokenizer) will fail with ImportError because gguf and numpy are in the tokenizer extra, not the base dependency set.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/README.md, line 104: <comment>`uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</comment> <file context> @@ -0,0 +1,187 @@ + +```bash +cd optimizations/spark +uv sync # tokenizers (+ gguf/torch optional extras) + +# 0. one tokenizer, extracted from the GGUF (gpt2 byte-level BPE) </file context>

Suggested change

uv sync # tokenizers (+ gguf/torch optional extras)

uv sync --all-extras # tokenizers (+ gguf/torch optional extras)

cubic-dev-ai · 2026-06-05T15:08:19Z

+    train, test = [], []
+    for f in files:
+        # split by session-path hash so a whole session is train xor test
+        bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train


P2: --test-frac is not validated, so --test-frac 0 crashes with ZeroDivisionError during split assignment.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/extract_sessions.py, line 79: <comment>`--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</comment> <file context> @@ -0,0 +1,115 @@ + train, test = [], [] + for f in files: + # split by session-path hash so a whole session is train xor test + bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train + sess = [] + try: </file context>

cubic-dev-ai · 2026-06-05T15:08:19Z

+            proc.kill()
+            raise SystemExit("ready timeout")
+
+    pp = Path("/tmp/spark_val_chunk.bin")


P2: Using fixed /tmp filenames causes cross-process collisions when multiple validation runs execute concurrently.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 95: <comment>Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</comment> <file context> @@ -0,0 +1,142 @@ + proc.kill() + raise SystemExit("ready timeout") + + pp = Path("/tmp/spark_val_chunk.bin") + op = Path("/tmp/spark_val_out.bin") + toks = [] </file context>

cubic-dev-ai · 2026-06-05T15:08:19Z

+    import torch.nn as nn
+    H, E = args.n_embd, args.n_expert
+    dt = np.dtype([("layer", "<i2"), ("nsel", "<i2"), ("sel", "<i4", (8,)), ("hid", "<f4", (H,))])
+    arr = np.fromfile(args.trace, dtype=dt)


P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before np.fromfile, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.

(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.)

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/train_pregate.py, line 42: <comment>The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early. (Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) </comment> <file context> @@ -0,0 +1,85 @@ + import torch.nn as nn + H, E = args.n_embd, args.n_expert + dt = np.dtype([("layer", "<i2"), ("nsel", "<i2"), ("sel", "<i4", (8,)), ("hid", "<f4", (H,))]) + arr = np.fromfile(args.trace, dtype=dt) + print(f"records={len(arr)}", flush=True) + dev = "cuda" if torch.cuda.is_available() else "cpu" </file context>

cubic-dev-ai · 2026-06-05T15:08:19Z

+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+
+    agg = {8: [], 16: [], 24: []}
+    for L in range(1, args.n_layer):


P3: The per-layer training loop skips layer 0 (range(1, args.n_layer)), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/train_pregate.py, line 47: <comment>The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</comment> <file context> @@ -0,0 +1,85 @@ + dev = "cuda" if torch.cuda.is_available() else "cpu" + + agg = {8: [], 16: [], 24: []} + for L in range(1, args.n_layer): + idx = np.where(arr["layer"] == L)[0][:args.max_per_layer] + if len(idx) < 400: </file context>

A blocking proc.stdout.readline() returns only on a new line or EOF, so the time-based timeout check ran *after* it: a daemon that goes silent (stalled model load, deadlock, disk stall) blocked the caller forever and the timeout never fired. This broke unattended calibration/validation runs. Move the protocol into spark/_daemon.py: a reader thread pumps stdout into a queue and callers wait with queue.get(timeout=...), so a read can never outlive its deadline. wait_ready() and the per-chunk generate loops in calibrate.py and validate.py now honor --ready-timeout / --gen-timeout and kill the daemon on stall. (select() + buffered readline would miss buffered data; the reader thread does the blocking readline itself.) Verified: silent daemon times out in ~1s, dead daemon returns immediately, happy path unchanged. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

5 issues found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="optimizations/spark/spark/validate.py">

<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>

<violation number="2" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>

<file name="optimizations/spark/spark/calibrate.py">

<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>

<file name="optimizations/spark/spark/_daemon.py">

<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>

<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.

(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-05T15:34:02Z

+        if len(ids) < 8:
+            continue
+        write_counted_i32(pp, ids)
+        reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)


P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 91: <comment>daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</comment> <file context> @@ -100,30 +88,16 @@ def drain(): - break - if okl and okl.startswith("ok "): - m = re.search(r"decode_tok_s=([0-9.]+)", okl) + reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout) + if reply is None: + print("daemon stalled/closed; stopping early") </file context>

cubic-dev-ai · 2026-06-05T15:34:02Z

+    if toks:
+        print(f"decode tok/s: mean={statistics.mean(toks):.1f} median={statistics.median(toks):.1f} "
+              f"over {len(toks)} chunks")
+    colds = [float(m.group(1)) for l in daemon.stderr_lines


P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 109: <comment>daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</comment> <file context> @@ -132,7 +106,7 @@ def drain(): print(f"decode tok/s: mean={statistics.mean(toks):.1f} median={statistics.median(toks):.1f} " f"over {len(toks)} chunks") - colds = [float(m.group(1)) for l in prof + colds = [float(m.group(1)) for l in daemon.stderr_lines for m in [re.search(r"cold_experts/tok=([0-9.]+)", l)] if m] if colds: </file context>

cubic-dev-ai · 2026-06-05T15:34:02Z

+        if len(ids) < 8:
+            continue
+        write_counted_i32(pp, ids)
+        reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)


P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/calibrate.py, line 78: <comment>Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</comment> <file context> @@ -84,38 +75,18 @@ def main(): - if time.time() - ts > 120: - break - if not ok: + reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout) + if reply is None: + print(f"[calib] daemon stalled/closed at chunk {i}; stopping", flush=True) </file context>

cubic-dev-ai · 2026-06-05T15:34:02Z

+        self._q.put(None)  # EOF sentinel
+
+    def _pump_stderr(self):
+        for line in self.proc.stderr:


P2: _pump_stderr silently crashes on encoding errors — same UnicodeDecodeError risk as stdout, causing data loss for callers that read stderr_lines (e.g. validate.py extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/_daemon.py, line 44: <comment>`_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</comment> <file context> @@ -0,0 +1,106 @@ + self._q.put(None) # EOF sentinel + + def _pump_stderr(self): + for line in self.proc.stderr: + self.stderr_lines.append(line.rstrip()) + </file context>

cubic-dev-ai · 2026-06-05T15:34:02Z

+                line = self.readline(deadline - time.time())
+            except DaemonTimeout:
+                self.kill()
+                raise SystemExit(f"daemon did not become ready within {timeout}s (no output)")


P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.

(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) .

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/_daemon.py, line 66: <comment>Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts. (Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) .</comment> <file context> @@ -0,0 +1,106 @@ + line = self.readline(deadline - time.time()) + except DaemonTimeout: + self.kill() + raise SystemExit(f"daemon did not become ready within {timeout}s (no output)") + if line is None: + raise SystemExit("daemon died before ready banner") </file context>

Server side of Luce Spark, on top of the merged hybrid-offload engine: - Bounded GPU expert cache: spare slots over-allocated on the hot expert stack + moe_hybrid_cache_swap_in (LRU). On a cold hit the expert is swapped into a spare slot and served through the unified GPU FFN, so cold-misses fall to ~0 in fixed VRAM after warmup. Gated by DFLASH_LAGUNA_EXPERT_CACHE / _CACHE_SLOTS / _GPU_REMAP. - Swap-rebuild fix + clean profile flush: build_hybrid_storage_from_file() re-reads experts from the GGUF mmap (partial-load keeps no full expert tensors resident), used by both init and post-request swap (was asserting in ggml_backend_tensor_get). routing_stats_ now also allocates when swap is enabled, and the placement profile flushes after each hybrid generate so NEXT_PLACEMENT_OUT works without the swap path. - Pre-gate trace capture (DFLASH_LAGUNA_PREGATE_TRACE), profile-gated, feeds the optimizations/spark pregate trainer. Laguna-XS.2 Q4_K_M / RTX 3090, calibrated 60% + 32 cache slots: ~85-88 tok/s (cold ~0) at 14.6 GiB peak, vs 66 uniform / 111 all-GPU @18.8 GiB. Built + smoke-tested against origin/main (3-way merge over the generate_impl rename). Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

5 issues found across 6 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/moe_hybrid_storage.cpp">

<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>

<violation number="2" location="server/src/common/moe_hybrid_storage.cpp:507">
P0: `moe_hybrid_cache_swap_in` only handles non-fused gate/up/down tensors. For fused gate-up architectures (Laguna), `gate_hot`/`up_hot` are nullptr so the function always returns -1, making the bounded cache a complete no-op for the very architecture it targets.</violation>
</file>

<file name="server/src/common/moe_hybrid_ffn_eval.h">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>

<file name="server/src/laguna/laguna_backend.cpp">

<violation number="1" location="server/src/laguna/laguna_backend.cpp:169">
P3: Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</violation>

<violation number="2" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-06-05T16:09:12Z

        const int cold_count = (int)dst.cold_expert_ids.size();
+        const int spare = (cold_count > 0 && cache_slots > 0)
+                          ? std::min(cache_slots, cold_count) : 0;
+        const int hot_alloc = hot_count + spare;


P2: Hot tensor allocation still depends on hot_count > 0, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_storage.cpp, line 377: <comment>Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</comment> <file context> @@ -371,6 +372,13 @@ bool build_moe_hybrid_storage_from_file( const int cold_count = (int)dst.cold_expert_ids.size(); + const int spare = (cold_count > 0 && cache_slots > 0) + ? std::min(cache_slots, cold_count) : 0; + const int hot_alloc = hot_count + spare; + dst.hot_active = hot_count; + dst.cache_slots = spare; </file context>

cubic-dev-ai · 2026-06-05T16:09:12Z

-    int n_hot);
+    int n_hot,
+    bool gpu_remap = false,
+    int n_expert = 0);


P2: int n_expert = 0 default silently disables remapping when gpu_remap=true is set without n_expert. The n_expert > 0 guard in the implementation (moe_hybrid_ffn_eval.cpp:440) prevents a crash but silently produces a non-remapped graph, making gpu_remap=true a non-binding hint rather than a semantic guarantee. Future callers adding /*gpu_remap=*/true without also passing cfg.n_expert will get incorrect (non-remapped) behavior with no warning.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_ffn_eval.h, line 175: <comment>`int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</comment> <file context> @@ -170,7 +170,9 @@ bool build_cached_hot_graph( - int n_hot); + int n_hot, + bool gpu_remap = false, + int n_expert = 0); // Build/rebuild cached cold FFN graph. </file context>

cubic-dev-ai · 2026-06-05T16:09:12Z

    for (int s = 0; s < req.n_gen; ++s) {
        maybe_force_close(next_tok, s);
-        if (next_tok == w_.eos_id || next_tok == w_.eos_chat_id) break;
+        if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;


P2: DFLASH_IGNORE_EOS is interpreted by presence instead of value, so setting it to 0 still disables EOS stopping.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 297: <comment>`DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</comment> <file context> @@ -283,7 +294,7 @@ GenerateResult LagunaBackend::generate_impl(const GenerateRequest & req, for (int s = 0; s < req.n_gen; ++s) { maybe_force_close(next_tok, s); - if (next_tok == w_.eos_id || next_tok == w_.eos_chat_id) break; + if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break; result.tokens.push_back(next_tok); history.push_back(next_tok); </file context>

Suggested change

if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;

if (!([]{ const char * v = std::getenv("DFLASH_IGNORE_EOS"); return v && std::atoi(v) != 0; })() && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;

dflash_server --spark: one self-tuning command for both MoE backends. - Enables the bounded expert cache (auto working set) with sized slots (--spark-slots, default 32). - Auto-loads <model>.gguf.spark.csv if present and keeps persisting it after each request from live routing (laguna via NEXT_PLACEMENT_OUT, qwen35moe via RUNTIME_STATS_OUT, which is the var that allocates its routing accumulator). - Wires the expert cache into qwen35moe: cache_slots into its from-file storage + moe_hybrid_cache_swap_in before both host-partition loops in the pipelined decode (symmetric to laguna; swaps are auto-picked-up by the hot_local_by_global lookup, no graph surgery). Verified end-to-end on RTX 3090: laguna and Qwen3.6-35B-A3B both write + reload the profile (source=hotness:...) and stay coherent under forced offload + cache, no crash. dflash_server needs libcurl dev headers (find_package(CURL)). Built against origin/main. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="optimizations/spark/spark/tokenizer.py">

<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>

<file name="optimizations/spark/spark/extract_sessions.py">

<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>

<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>

<file name="optimizations/spark/README.md">

<violation number="1" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>
</file>

<file name="optimizations/spark/spark/validate.py">

<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>

<violation number="2" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>

<violation number="3" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>

<file name="optimizations/spark/spark/train_pregate.py">

<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.

(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>

<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>

<file name="optimizations/spark/spark/calibrate.py">

<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>

<file name="optimizations/spark/spark/_daemon.py">

<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>

<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.

(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>

<file name="server/src/common/moe_hybrid_storage.cpp">

<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>
</file>

<file name="server/src/common/moe_hybrid_ffn_eval.h">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>

<file name="server/src/laguna/laguna_backend.cpp">

<violation number="1" location="server/src/laguna/laguna_backend.cpp:169">
P3: Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</violation>

<violation number="2" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>
</file>

<file name="server/src/server/server_main.cpp">

<violation number="1" location="server/src/server/server_main.cpp:644">
P2: Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-06-05T16:37:25Z

+            const std::string pfx = is_laguna ? "DFLASH_LAGUNA_" : "DFLASH_QWEN35MOE_";
+            const std::string profile = std::string(bargs.model_path) + ".spark.csv";
+            std::FILE * pf = std::fopen(profile.c_str(), "rb");
+            const bool have_profile = (pf != nullptr);


P2: Stale profile sidecar not validated against the model file. The .spark.csv profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/server_main.cpp, line 644: <comment>Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</comment> <file context> @@ -623,6 +629,43 @@ int main(int argc, char ** argv) { + const std::string pfx = is_laguna ? "DFLASH_LAGUNA_" : "DFLASH_QWEN35MOE_"; + const std::string profile = std::string(bargs.model_path) + ".spark.csv"; + std::FILE * pf = std::fopen(profile.c_str(), "rb"); + const bool have_profile = (pf != nullptr); + if (pf) std::fclose(pf); + const std::string slots = std::to_string(spark_slots); </file context>

moe_hybrid_cache_swap_in only validated/copied the separate gate/up/down tensors, so for a fused-gate-up MoE (gate_up_hot populated, gate_hot/up_hot null, a layout the storage supports via fused_gate_up) it returned -1 and the bounded cache silently no-op'd. Branch on fused_gate_up: copy gate_up + down when fused, gate + up + down otherwise. Current targets (laguna-xs2, Qwen3.6-35B-A3B) use SEPARATE tensors, so no behavior change for them (laguna cache verified unchanged: cold 38.5 -> 1.9 / 77 -> 85 tok/s at 32 slots). Fixes the latent no-op for fused-gate-up archs. Co-Authored-By: WOZCODE <contact@withwoz.com>

extract_sessions.py now pulls both Claude Code (~/.claude/projects) and Codex (~/.codex/sessions/**/rollout-*.jsonl) by default (--source claude|codex|both). Codex rollouts are parsed from response_item user+assistant content blocks (input_text/output_text/text), skipping the developer/system boilerplate so the corpus reflects real traffic, not instructions. Both sources merge into one corpus with the same per-session train/held-out split. Verified locally: 157 sessions (claude + codex) -> 432 train chunks (~212K tok), held-out split by session. Co-Authored-By: WOZCODE <contact@withwoz.com>

The only knob a user touches is now --spark-vram <GiB>: the total VRAM Spark may use. spark_budget_split() (common) clamps the expert budget to fit the target (minus core + KV + safety), then carves an auto-sized cache ring (~12%, capped 1.5 GiB) out of it; the rest is pinned-hot. Replaces the fixed 32-slot default. Both backends store the computed slot count and the storage build uses it (explicit DFLASH_*_CACHE_SLOTS still overrides). server_main: --spark-vram sets DFLASH_SPARK + DFLASH_SPARK_VRAM_MB; --spark with no target sizes to the card. Verified RTX 3090: laguna --spark-vram 13 -> hot 9.21 GiB + 19 slots, peak 11.6 GiB; qwen3.6-35B --spark-vram 15 -> hot 10.56 GiB + 21 slots, peak 13.8 GiB. Both under target, coherent. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

4 issues found across 8 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="optimizations/spark/README.md">

<violation number="1" location="optimizations/spark/README.md:103">
P2: README now implies `--spark` works for any model, but server code ignores it on unsupported architectures (non-laguna/non-qwen35moe).</violation>

<violation number="2" location="optimizations/spark/README.md:107">
P3: The docs call `--spark-vram` the “only knob” even though `--spark-slots` is still a functional tuning flag in the same flow.</violation>
</file>

<file name="server/src/qwen35moe/qwen35moe_backend.cpp">

<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.cpp:1476">
P1: Spark VRAM target clamp ignores warm-cache and draft reserves, causing possible over-allocation beyond the requested `--spark-vram` budget.</violation>
</file>

<file name="server/src/laguna/laguna_backend.cpp">

<violation number="1" location="server/src/laguna/laguna_backend.cpp:656">
P2: Spark VRAM target clamp omits warm-cache reserve, so `--spark-vram` can allocate more expert memory than the requested total budget.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-06-06T11:25:12Z

+        uint64_t target = 0;
+        if (const char * t = std::getenv("DFLASH_SPARK_VRAM_MB")) target = (uint64_t)std::atoll(t) << 20;
+        auto sb = dflash::common::spark_budget_split(expert_budget, total_expert_bytes, w.n_expert,
+                                                     core_bytes + kv_total + safety_bytes, target);


P1: Spark VRAM target clamp ignores warm-cache and draft reserves, causing possible over-allocation beyond the requested --spark-vram budget.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.cpp, line 1476: <comment>Spark VRAM target clamp ignores warm-cache and draft reserves, causing possible over-allocation beyond the requested `--spark-vram` budget.</comment> <file context> @@ -1467,6 +1468,18 @@ bool Qwen35MoeBackend::load_dynamic_placement(const char * hotness_path, + uint64_t target = 0; + if (const char * t = std::getenv("DFLASH_SPARK_VRAM_MB")) target = (uint64_t)std::atoll(t) << 20; + auto sb = dflash::common::spark_budget_split(expert_budget, total_expert_bytes, w.n_expert, + core_bytes + kv_total + safety_bytes, target); + expert_budget = sb.hot_bytes; + cache_slots_ = sb.cache_slots; </file context>

Suggested change

core_bytes + kv_total + safety_bytes, target);

core_bytes + kv_total + warm_cache_bytes + safety_bytes + draft_reserve_bytes, target);

cubic-dev-ai · 2026-06-06T11:25:12Z

+auto-tunes from its own traffic:
+
+```bash
+dflash_server <model.gguf> --spark                   # use the card, auto-size everything


P2: README now implies --spark works for any model, but server code ignores it on unsupported architectures (non-laguna/non-qwen35moe).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/README.md, line 103: <comment>README now implies `--spark` works for any model, but server code ignores it on unsupported architectures (non-laguna/non-qwen35moe).</comment> <file context> @@ -100,19 +100,28 @@ For production, you don't run any of the offline pipeline. `dflash_server` ```bash -dflash_server <model.gguf> --spark # laguna or qwen35moe MoE +dflash_server <model.gguf> --spark # use the card, auto-size everything +dflash_server <model.gguf> --spark --spark-vram 14 # cap total VRAM at 14 GiB

</file context>

</details> ```suggestion dflash_server <model.gguf> --spark # laguna/qwen35moe only; auto-size on supported MoE backends

cubic-dev-ai · 2026-06-06T11:25:12Z

+dflash_server <model.gguf> --spark --spark-vram 14   # cap total VRAM at 14 GiB
+```
+
+The only knob is `--spark-vram <GiB>`: the total VRAM Spark may use. From that


P3: The docs call --spark-vram the “only knob” even though --spark-slots is still a functional tuning flag in the same flow.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/README.md, line 107: <comment>The docs call `--spark-vram` the “only knob” even though `--spark-slots` is still a functional tuning flag in the same flow.</comment> <file context> @@ -100,19 +100,28 @@ For production, you don't run any of the offline pipeline. `dflash_server`

---spark (optionally --spark-slots N, default 32):
+The only knob is --spark-vram <GiB>: the total VRAM Spark may use. From that
+target it sizes everything itself. It fits the non-expert weights, the KV cache
+and a safety margin, pins as many calibrated-hot experts as the rest allows, and
</file context>

</details> ```suggestion The primary knob is `--spark-vram <GiB>`: the total VRAM Spark may use. From that

--spark already auto-calibrates from live traffic and persists the profile after each request, reloading it on restart. But the routing accumulator was zeroed each boot, so the stored profile only ever reflected the current run. Seed the accumulator from the loaded profile (when one exists) so calibration strictly accumulates across restarts; a fresh first run still starts at zero and calibrates fast. Both backends. Verified RTX 3090, laguna --spark, same 2 prompts/run: total routed count 65520 -> 131040 -> 196560 across three restarts (additive). Was flat before. Co-Authored-By: WOZCODE <contact@withwoz.com>

… boot When --spark is on and no profile exists yet, the server warms placement from local agent history (Claude Code + Codex) before it starts serving, so the very first session is already calibrated. One-time; live traffic refines it after. - spark_corpus.{h,cpp}: scrape ~/.claude/projects + ~/.codex/sessions into text chunks (dirs overridable via DFLASH_SPARK_CLAUDE_DIR / _CODEX_DIR). Mirrors the offline extractor, in-process, nlohmann/json + std::filesystem. - ModelBackend: spark_wants_bootstrap() / spark_bootstrap_finalize() virtuals (default no-op, so non-hybrid backends skip cleanly). - server_main: feeds the corpus through generate() (prefill observes routing), then finalize saves the profile and rebuilds placement. - laguna: implements finalize via the existing build_from_stats + build_hybrid_storage_from_file (reuses the same budget + cache as init). Verified RTX 3090: laguna --spark, no profile, staged history -> 'bootstrapping from 24 chunks' -> profile written -> restart loads source=file, bootstrap does not re-run, coherent output. No history -> clean skip to live calibration. qwen35moe day-one bootstrap is a follow-up (needs its mmap-retained rebuild); it keeps calibrating from live traffic in the meantime. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

4 issues found across 7 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/spark_corpus.cpp">

<violation number="1" location="server/src/common/spark_corpus.cpp:35">
P2: Codex block parsing calls `get<std::string>()` on `text` without checking `is_string()`, which can throw on heterogeneous JSON and fail bootstrap processing.</violation>

<violation number="2" location="server/src/common/spark_corpus.cpp:37">
P1: Unguarded `b["thinking"].get<std::string>()` — checks `b.contains("thinking")` but not `b["thinking"].is_string()`. Same type_error risk if the field exists with a non-string type.</violation>

<violation number="3" location="server/src/common/spark_corpus.cpp:60">
P2: Using `o.value("type", "")` assumes `type` is string-typed; a non-string `type` can throw and stop corpus extraction during bootstrap.</violation>
</file>

<file name="server/src/laguna/laguna_backend.cpp">

<violation number="1" location="server/src/laguna/laguna_backend.cpp:1497">
P2: Unchecked `save_csv` return value in `spark_bootstrap_finalize` — a silent save failure loses the Spark profile CSV without logging any error. Check the return and log on failure, matching the existing pattern at line 171.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-06-06T12:33:33Z

+        if (t == "text" && b.contains("text")) {
+            append(out, b["text"].get<std::string>());
+        } else if (t == "thinking" && b.contains("thinking")) {
+            append(out, b["thinking"].get<std::string>());


P1: Unguarded b["thinking"].get<std::string>() — checks b.contains("thinking") but not b["thinking"].is_string(). Same type_error risk if the field exists with a non-string type.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/spark_corpus.cpp, line 37: <comment>Unguarded `b["thinking"].get<std::string>()` — checks `b.contains("thinking")` but not `b["thinking"].is_string()`. Same type_error risk if the field exists with a non-string type.</comment> <file context> @@ -0,0 +1,148 @@ + if (t == "text" && b.contains("text")) { + append(out, b["text"].get<std::string>()); + } else if (t == "thinking" && b.contains("thinking")) { + append(out, b["thinking"].get<std::string>()); + } else if (t == "tool_use" && b.contains("input")) { + append(out, b["input"].dump().substr(0, 4000)); </file context>

cubic-dev-ai · 2026-06-06T12:33:33Z

+        if (line.empty()) continue;
+        json o;
+        try { o = json::parse(line); } catch (...) { continue; }
+        const std::string ty = o.value("type", "");


P2: Using o.value("type", "") assumes type is string-typed; a non-string type can throw and stop corpus extraction during bootstrap.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/spark_corpus.cpp, line 60: <comment>Using `o.value("type", "")` assumes `type` is string-typed; a non-string `type` can throw and stop corpus extraction during bootstrap.</comment> <file context> @@ -0,0 +1,148 @@ + if (line.empty()) continue; + json o; + try { o = json::parse(line); } catch (...) { continue; } + const std::string ty = o.value("type", ""); + if ((ty == "user" || ty == "assistant") && o.contains("message") && + o["message"].is_object() && o["message"].contains("content")) </file context>

Suggested change

const std::string ty = o.value("type", "");

const std::string ty = (o.contains("type") && o["type"].is_string()) ? o["type"].get<std::string>() : "";

cubic-dev-ai · 2026-06-06T12:33:33Z

+        if (!b.is_object()) continue;
+        const std::string t = b.value("type", "");
+        if (t == "text" && b.contains("text")) {
+            append(out, b["text"].get<std::string>());


P2: Codex block parsing calls get<std::string>() on text without checking is_string(), which can throw on heterogeneous JSON and fail bootstrap processing.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/spark_corpus.cpp, line 35: <comment>Codex block parsing calls `get<std::string>()` on `text` without checking `is_string()`, which can throw on heterogeneous JSON and fail bootstrap processing.</comment> <file context> @@ -0,0 +1,148 @@ + if (!b.is_object()) continue; + const std::string t = b.value("type", ""); + if (t == "text" && b.contains("text")) { + append(out, b["text"].get<std::string>()); + } else if (t == "thinking" && b.contains("thinking")) { + append(out, b["thinking"].get<std::string>()); </file context>

cubic-dev-ai · 2026-06-06T12:33:33Z

+bool LagunaBackend::spark_bootstrap_finalize(const std::string & profile_path) {
+    if (!spark_wants_bootstrap()) return false;
+    std::string err;
+    routing_stats_->save_csv(profile_path, &err);  // persist the observed routing


P2: Unchecked save_csv return value in spark_bootstrap_finalize — a silent save failure loses the Spark profile CSV without logging any error. Check the return and log on failure, matching the existing pattern at line 171.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 1497: <comment>Unchecked `save_csv` return value in `spark_bootstrap_finalize` — a silent save failure loses the Spark profile CSV without logging any error. Check the return and log on failure, matching the existing pattern at line 171.</comment> <file context> @@ -1483,6 +1487,28 @@ GenerateResult LagunaBackend::generate_hybrid(const GenerateRequest & req, +bool LagunaBackend::spark_bootstrap_finalize(const std::string & profile_path) { + if (!spark_wants_bootstrap()) return false; + std::string err; + routing_stats_->save_csv(profile_path, &err); // persist the observed routing + MoeHybridPlacement placement; + if (!MoeHybridPlacement::build_from_stats_with_layer_bytes( </file context>

Suggested change

routing_stats_->save_csv(profile_path, &err); // persist the observed routing

if (!routing_stats_->save_csv(profile_path, &err))

std::fprintf(stderr, "[spark] profile save failed: %s\n", err.c_str());

…/cold expert residency # Conflicts: # server/src/server/server_main.cpp

Mirror laguna's bootstrap on qwen: store the hot budget + per-layer expert bytes during placement, and add a self-contained rebuild_hybrid_from_placement() that re-mmaps the GGUF and rebuilds the hot/cold storage for the calibrated placement. spark_bootstrap_finalize saves the observed profile and applies it in-process, so qwen's first session is calibrated from local history too. Frees the current storage before allocating the new one (no transient 2x VRAM; safe at startup with no in-flight requests, budget unchanged). Verified RTX 3090: qwen3.6-35B --spark, no profile, staged history -> bootstrap from 24 chunks -> calibrated placement applied -> coherent output, no OOM. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35moe/qwen35moe_backend.cpp">

<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.cpp:221">
P1: Spark bootstrap runs even when the model is in full-fit (non-hybrid) mode, which can unintentionally flip the backend into hybrid mode and disable the normal spec-decode path.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-06-06T13:11:56Z

 }

+bool Qwen35MoeBackend::spark_wants_bootstrap() const {
+    return routing_stats_ && !layer_expert_bytes_.empty() && spark_expert_budget_ > 0;


P1: Spark bootstrap runs even when the model is in full-fit (non-hybrid) mode, which can unintentionally flip the backend into hybrid mode and disable the normal spec-decode path.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.cpp, line 221: <comment>Spark bootstrap runs even when the model is in full-fit (non-hybrid) mode, which can unintentionally flip the backend into hybrid mode and disable the normal spec-decode path.</comment> <file context> @@ -217,6 +217,90 @@ void Qwen35MoeBackend::after_target_compute(StepGraph & sg, int, int) { } +bool Qwen35MoeBackend::spark_wants_bootstrap() const { + return routing_stats_ && !layer_expert_bytes_.empty() && spark_expert_budget_ > 0; +} + </file context>

Suggested change

return routing_stats_ && !layer_expert_bytes_.empty() && spark_expert_budget_ > 0;

return target_weights().moe_hybrid && routing_stats_ && !layer_expert_bytes_.empty() && spark_expert_budget_ > 0;

howard0su · 2026-06-06T14:27:34Z

I suggest to try 8GB which is most common VRAM for gamers.

…ode) Append the routed FFN into the per-layer attention graph, driven by the on-GPU selected tensor through a LUT (global->hot-slot), hot stack as expert source, drop-on-miss via valid_lut. One graph per layer, no host FFN dispatch, so ggml's CUDA-graph spans the whole layer. With the expert cache wired in, the cache warms the working set so drops fall to zero after warmup (exact) while the FFN stays fused. Gated by DFLASH_LAGUNA_FUSE_FFN; default path unchanged. Verified RTX 3090 laguna-xs2: - budget 100 (no drop): 115.7 tok/s, 161/161 tokens IDENTICAL to all-GPU (exact). - budget 60 + cache: 91.5 tok/s, cold/tok=0 after warmup (exact), coherent. - budget 60 pure drop: 101.5 tok/s = 88% of all-GPU, coherent, approximate. Follow-ups: cross-layer prefetch (removes the per-layer readback capping fused+cache; coverage validated 92-98%), GPU-side routing observe, qwen port. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="optimizations/spark/spark/tokenizer.py">

<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>

<file name="optimizations/spark/spark/extract_sessions.py">

<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>

<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>

<file name="optimizations/spark/README.md">

<violation number="1" location="optimizations/spark/README.md:103">
P2: README now implies `--spark` works for any model, but server code ignores it on unsupported architectures (non-laguna/non-qwen35moe).</violation>

<violation number="2" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>

<violation number="3" location="optimizations/spark/README.md:107">
P3: The docs call `--spark-vram` the “only knob” even though `--spark-slots` is still a functional tuning flag in the same flow.</violation>
</file>

<file name="optimizations/spark/spark/validate.py">

<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>

<violation number="2" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>

<violation number="3" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>

<file name="optimizations/spark/spark/train_pregate.py">

<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.

(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>

<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>

<file name="optimizations/spark/spark/calibrate.py">

<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>

<file name="optimizations/spark/spark/_daemon.py">

<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>

<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.

(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>

<file name="server/src/common/moe_hybrid_storage.cpp">

<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>
</file>

<file name="server/src/common/moe_hybrid_ffn_eval.h">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>

<file name="server/src/laguna/laguna_backend.cpp">

<violation number="1" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>

<violation number="2" location="server/src/laguna/laguna_backend.cpp:1093">
P0: `step_graph_free` does not reset `sg.hot_local_lut` or `sg.valid_lut` after freeing the ggml context, so a stale pointer from a previous fused-layer call persists into a subsequent non-fused-layer call. The guard `layer_sg.hot_local_lut` then passes on a dangling pointer, and the tensor-set writes that follow corrupt freed memory.</violation>

<violation number="3" location="server/src/laguna/laguna_backend.cpp:1497">
P2: Unchecked `save_csv` return value in `spark_bootstrap_finalize` — a silent save failure loses the Spark profile CSV without logging any error. Check the return and log on failure, matching the existing pattern at line 171.</violation>
</file>

<file name="server/src/server/server_main.cpp">

<violation number="1" location="server/src/server/server_main.cpp:644">
P2: Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</violation>
</file>

<file name="server/src/qwen35moe/qwen35moe_backend.cpp">

<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.cpp:221">
P1: Spark bootstrap runs even when the model is in full-fit (non-hybrid) mode, which can unintentionally flip the backend into hybrid mode and disable the normal spec-decode path.</violation>

<violation number="2" location="server/src/qwen35moe/qwen35moe_backend.cpp:1476">
P1: Spark VRAM target clamp ignores warm-cache and draft reserves, causing possible over-allocation beyond the requested `--spark-vram` budget.</violation>
</file>

<file name="server/src/common/spark_corpus.cpp">

<violation number="1" location="server/src/common/spark_corpus.cpp:35">
P2: Codex block parsing calls `get<std::string>()` on `text` without checking `is_string()`, which can throw on heterogeneous JSON and fail bootstrap processing.</violation>

<violation number="2" location="server/src/common/spark_corpus.cpp:37">
P1: Unguarded `b["thinking"].get<std::string>()` — checks `b.contains("thinking")` but not `b["thinking"].is_string()`. Same type_error risk if the field exists with a non-string type.</violation>

<violation number="3" location="server/src/common/spark_corpus.cpp:60">
P2: Using `o.value("type", "")` assumes `type` is string-typed; a non-string `type` can throw and stop corpus extraction during bootstrap.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-06-06T14:44:08Z

        }

+        static const bool g_fuse_dec = (std::getenv("DFLASH_LAGUNA_FUSE_FFN") != nullptr);
+        if (g_fuse_dec && !is_dense && layer_sg.hot_local_lut) {


P0: step_graph_free does not reset sg.hot_local_lut or sg.valid_lut after freeing the ggml context, so a stale pointer from a previous fused-layer call persists into a subsequent non-fused-layer call. The guard layer_sg.hot_local_lut then passes on a dangling pointer, and the tensor-set writes that follow corrupt freed memory.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 1093: <comment>`step_graph_free` does not reset `sg.hot_local_lut` or `sg.valid_lut` after freeing the ggml context, so a stale pointer from a previous fused-layer call persists into a subsequent non-fused-layer call. The guard `layer_sg.hot_local_lut` then passes on a dangling pointer, and the tensor-set writes that follow corrupt freed memory.</comment> <file context> @@ -1048,6 +1089,18 @@ bool LagunaBackend::hybrid_forward_one_token(int32_t tok, int kv_pos, } + static const bool g_fuse_dec = (std::getenv("DFLASH_LAGUNA_FUSE_FFN") != nullptr); + if (g_fuse_dec && !is_dense && layer_sg.hot_local_lut) { + auto & _st = moe_hybrid_->layers[(size_t)il]; + std::vector<int32_t> _lut((size_t)w_.n_expert); std::vector<float> _vld((size_t)w_.n_expert); </file context>

Update the integration snapshot for the latest open PR list,\nrecord the current heads for Luce-Org#345/Luce-Org#344/Luce-Org#343, and preserve the\nexisting stack with no source changes.

…swap Laguna MoE hybrid (offloaded) decode now runs the whole token in ONE fused graph (laguna_step_hybrid) with the routed FFN served from the hot expert stack via per-layer residency LUTs (build_laguna_moe_block_hybrid), instead of 40 separate per-layer graphs. That removes ~3.2 ms/token of per-layer host glue (the real cap on the multi-graph fused path). Per-layer LUT sets and the selected readback are batched into one H2D / one D2H. Default-on for the hybrid-offload path; set DFLASH_LAGUNA_NO_SINGLE_GRAPH=1 to fall back. The bounded expert-cache swap is now async on cudaStreamPerThread with a pinned cold store (ggml_backend_cuda_host_buffer_type), so cold-tail re-fetch overlaps compute instead of stalling the host. Shared storage, so laguna and qwen35moe both benefit. Measured (laguna-xs2 Q4_K_M, RTX 3090, single-token decode): all-resident 118 tok/s, 96/96 bit-exact vs all-GPU true 60% residency ~97-101 tok/s (was 76), coherent qwen3.6-35B-A3B offload ~98-106 tok/s (~92-99% of all-resident) Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai · 2026-06-07T22:42:08Z

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

davide221 · 2026-06-07T23:14:22Z

@cubic-dev-ai review

cubic-dev-ai · 2026-06-07T23:14:27Z

@cubic-dev-ai review

@davide221 I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

0 issues found across 4 files (changes from recent commits).

_{Re-trigger cubic}

step_graph_free reset moe_weights/ffn_* but not hot_local_lut/valid_lut, so after a fused-MoE layer freed its graph the stale (dangling) tensor pointers persisted into the next layer; the `layer_sg.hot_local_lut` guard then passed on freed memory and the following tensor_set corrupted it. Only reachable on fused-gate-up archs under DFLASH_LAGUNA_FUSE_FFN (laguna-xs2 uses separate tensors, so our shipped targets are unaffected), but a real latent use-after-free. Flagged by cubic. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-07T23:24:47Z

Thanks @cubic-dev-ai. Triage of the engine-code findings:

P0 step_graph_free stale hot_local_lut/valid_lut — valid, fixed in 44be2d2 (reset both on free). Only reachable on fused-gate-up archs under DFLASH_LAGUNA_FUSE_FFN; shipped targets (laguna-xs2, qwen35moe) use separate tensors so were unaffected, but it was a real latent use-after-free.
P0 moe_hybrid_cache_swap_in "no-op for fused gate-up" — false positive. The full function handles both layouts (if (st.fused_gate_up) {...} else {...} + copy_slice for each); the flagged hunk was just my async/pinned change to copy_slice. Verified the cache works on laguna-xs2 (cold-hit drops with cache enabled).
P2s (n_expert default, hot_count>0 spare gating, --spark-vram warm-reserve, unchecked save_csv, double profile save, DFLASH_IGNORE_EOS by-presence) — acknowledged; mostly pre-existing in the merged offload engine, low-impact. Tracking as cleanup alongside the documented refactor (unify laguna_step_hybrid with build_laguna_graph).

cubic-dev-ai · 2026-06-07T23:25:06Z

Thanks @cubic-dev-ai. Triage of the engine-code findings:

P0 step_graph_free stale hot_local_lut/valid_lut — valid, fixed in 44be2d2 (reset both on free). Only reachable on fused-gate-up archs under DFLASH_LAGUNA_FUSE_FFN; shipped targets (laguna-xs2, qwen35moe) use separate tensors so were unaffected, but it was a real latent use-after-free.

P0 moe_hybrid_cache_swap_in "no-op for fused gate-up" — false positive. The full function handles both layouts (if (st.fused_gate_up) {...} else {...} + copy_slice for each); the flagged hunk was just my async/pinned change to copy_slice. Verified the cache works on laguna-xs2 (cold-hit drops with cache enabled).

P2s (n_expert default, hot_count>0 spare gating, --spark-vram warm-reserve, unchecked save_csv, double profile save, DFLASH_IGNORE_EOS by-presence) — acknowledged; mostly pre-existing in the merged offload engine, low-impact. Tracking as cleanup alongside the documented refactor (unify laguna_step_hybrid with build_laguna_graph).
...

@davide221 Clear triage, that all tracks. A couple of callouts on the items you pushed back on:

P0 moe_hybrid_cache_swap_in "no-op for fused gate-up" — agreed, that was a miss on my end. You're right that the function already branches on fused_gate_up with the full copy_slice for both layouts. The async/pinned change didn't alter that logic. Sorry for the noise there.

P2s — all reasonable to defer alongside the hybrid-descriptor refactor. If you end up tracking those anywhere visible (an issue or a follow-up doc), linking from the RESULTS.md refactor note would make them easy to find later. Entirely optional.

One thing I'd nudge from the conversation — @howard0su's suggestion about 8 GB testing is worth a nod if you have a minute. Running a quick --spark-vram 8 sanity check to see where the numbers land (even if just a cold-start tok/s) would make the README's "smaller GPU" claim concrete for the largest common VRAM tier. That said, I know you're deep in iteration and the current target is clearly 24 GB.

_{Tip: get faster answers by chatting with cubic’s review copilot in the review UI}

# Conflicts: # server/src/laguna/laguna_target_graph.cpp

…builders laguna_step_hybrid no longer duplicates build_laguna_graph / build_laguna_layer: a LagunaHybridMoe descriptor (hot storage + per-layer LUT tensors) is threaded through build_laguna_layer -> the MoE dispatcher, and laguna_step_hybrid now reuses build_laguna_graph for the attention/norm/layer loop + logits head. The all-GPU MoE block (build_laguna_moe_block_full) is untouched; the dispatcher routes to build_laguna_moe_block_hybrid only when a hybrid descriptor with storage is set. Removes ~100 lines of duplication and fixes a latent drift introduced by the main merge: the old inline loop missed `graph_tensor_f32` precision conversion that build_laguna_layer now applies on the shared path. Verified: sg_b100 (offload path, all experts resident) = 96/96 bit-exact vs all-GPU; offload 60% = 99.6 tok/s coherent; all-GPU = 119 tok/s. Co-Authored-By: WOZCODE <contact@withwoz.com>

The origin/main merge resolved the server/deps/llama.cpp gitlink to the PR's older b896cf69 (llama #12 hip-shim), rewinding it from main's 570d9785 and dropping llama #13 (crash), #14 (pflash-sparse stream-sync / BSA fix) and #15 (fattn-sparse hip cudaStreamDefault). Point the gitlink back to 570d9785 (a strict superset — b896cf69 is its ancestor) so the PR no longer downgrades llama.cpp / drops the BSA fix. Rebuilt + smoke-tested against 570d9785: sg_b100 96/96 bit-exact vs all-GPU, offload 60% 99.3 tok/s, all-GPU 119.5. Co-Authored-By: WOZCODE <contact@withwoz.com>

@100

…mbers Drives the test_dflash daemon over a fixed prompt and reports steady-state decode tok/s for: all-GPU (full residency), single-graph @100% (asserts token-for-token match vs all-GPU), and Spark offload (single-graph hybrid). decode_tok_s = generated_tokens / decode_s, the same metric and the same LagunaBackend decode path that `dflash_server --spark` runs. Reproduces on RTX 3090, laguna-xs2 Q4_K_M: all-GPU 119.0 tok/s single-graph @100% 118.8 tok/s (128/128 EXACT vs all-GPU) Spark offload @~60% 99.1 tok/s (83% of all-GPU) README: added a "Reproduce the decode numbers" section. Co-Authored-By: WOZCODE <contact@withwoz.com>

Adds the Luce Spark card (-> optimizations/spark/) next to PFlash in the "Inference Engine Optimizations" grid, and lists calibrated MoE expert offload in the mission narrative. The card image (assets/cards/spark_card.png) is a Git-LFS asset to be dropped in, matching the other cards. Co-Authored-By: WOZCODE <contact@withwoz.com>

Integrates Luce Spark calibrated hot/cold expert residency, bounded cache support, and benchmark tooling.

Card for the main README optimizations grid (assets/cards/spark_card.png) and the optimizations/spark hero, replacing the placeholder. 1672x941, Git LFS. Co-Authored-By: WOZCODE <contact@withwoz.com>

…id-MoE section Co-Authored-By: WOZCODE <contact@withwoz.com>

…mbers Adds optimizations/spark/demo.gif (Git LFS): a two-terminal race showing naive expert offload (66 tok/s) vs `dflash_server --spark` (100 tok/s) on the same RTX 3090 at the same 60% GPU residency, generating the same output, with a shared wall-clock so the 1.5x lands as "same seconds, more tokens". Inserted it under the README intro. Reconciles the README to the shipped fused-decode story (66/81/100/119), consistent with spark/bench.py and the blog: the single-graph hybrid decode is default-on and bit-identical to all-GPU at full residency, so point 3 now documents it as shipped (not open research) and the headline/table use the reproducible decode-bench numbers; the held-out real-session figures stay in the Results section. The GIF builder (demo.py) is a one-off asset script and is intentionally not committed. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai Bot reviewed Jun 5, 2026

View reviewed changes

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026

docs: refresh auto-integration manifest for PR Luce-Org#345

6f81085

cubic-dev-ai Bot reviewed Jun 5, 2026

View reviewed changes

Davide Cifarelli and others added 3 commits June 5, 2026 18:45

cubic-dev-ai Bot reviewed Jun 6, 2026

View reviewed changes

davide221 and others added 2 commits June 6, 2026 13:33

cubic-dev-ai Bot reviewed Jun 6, 2026

View reviewed changes

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 6, 2026

Merge PR Luce-Org#345: feat(optimizations): Luce Spark calibrated hot…

eb9bb4e

…/cold expert residency # Conflicts: # server/src/server/server_main.cpp

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 6, 2026

docs: refresh auto-integration manifest after PR Luce-Org#345 merge

96a287c

cubic-dev-ai Bot reviewed Jun 6, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Jun 7, 2026

View reviewed changes

davide221 and others added 5 commits June 7, 2026 23:27

Merge remote-tracking branch 'origin/main' into feat/optimizations-spark

55d486c

# Conflicts: # server/src/laguna/laguna_target_graph.cpp

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 8, 2026

Merge pull request Luce-Org#345 from davide221/feat/optimizations-spark

c8baf8e

Integrates Luce Spark calibrated hot/cold expert residency, bounded cache support, and benchmark tooling.

davide221 and others added 3 commits June 8, 2026 02:57

assets(spark): add Luce Spark card + hero image

bcbea6d

Card for the main README optimizations grid (assets/cards/spark_card.png) and the optimizations/spark hero, replacing the placeholder. 1672x941, Git LFS. Co-Authored-By: WOZCODE <contact@withwoz.com>

docs(server): document --spark + cache/single-graph knobs in the hybr…

fc79c0d

…id-MoE section Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 merged commit f55780b into main Jun 8, 2026
2 checks passed

davide221 deleted the feat/optimizations-spark branch June 8, 2026 22:21

	uv sync # tokenizers (+ gguf/torch optional extras)
	uv sync --all-extras # tokenizers (+ gguf/torch optional extras)

	if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id \|\| next_tok == w_.eos_chat_id)) break;
	if (!([]{ const char * v = std::getenv("DFLASH_IGNORE_EOS"); return v && std::atoi(v) != 0; })() && (next_tok == w_.eos_id \|\| next_tok == w_.eos_chat_id)) break;

	core_bytes + kv_total + safety_bytes, target);
	core_bytes + kv_total + warm_cache_bytes + safety_bytes + draft_reserve_bytes, target);

	const std::string ty = o.value("type", "");
	const std::string ty = (o.contains("type") && o["type"].is_string()) ? o["type"].get<std::string>() : "";

	routing_stats_->save_csv(profile_path, &err); // persist the observed routing
	if (!routing_stats_->save_csv(profile_path, &err))
	std::fprintf(stderr, "[spark] profile save failed: %s\n", err.c_str());

	return routing_stats_ && !layer_expert_bytes_.empty() && spark_expert_budget_ > 0;
	return target_weights().moe_hybrid && routing_stats_ && !layer_expert_bytes_.empty() && spark_expert_budget_ > 0;

Conversation

davide221 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Spark is

Single-graph hybrid decode engine (commit 31ce93c)

Measured (RTX 3090, single-token decode)

Calibration results (RTX 3090, Laguna-XS.2 Q4_K_M, held-out Claude Code sessions)

Honest notes

Known follow-up (refactor debt)

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 6, 2026

davide221 commented Jun 5, 2026 •

edited

Loading

Single-graph hybrid decode engine (commit `31ce93c`)

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading