fix(evaluate): don't mark primary sessions as evaluated on failure#7
Conversation
…ilure
When the scorer returns a fallback card (e.g. user picked a model that
doesn't exist, model provider is down, LLM call timed out), the
runEvaluation() fallback path was unconditionally adding
pending.sessionID to ctx.sessionsEvaluated. The runEvaluation() guard
at line 96 then short-circuited any future runEvaluation() call for
the same session id, silently disabling re-attempts for the rest of
the plugin lifetime.
This is wrong: a failed scoring is transient. The user might fix the
model config in kasper.jsonc and expect the next polling cycle (or
the next /kasper score session invocation) to pick up the now-fixable
session. With the previous code, the user would have to restart
opencode to retry. The success path of runEvaluation() is the only
path that may mark a session as evaluated.
The fix removes ctx.sessionsEvaluated.add(pending.sessionID) from the
fallback branch. The persistent state.evaluated_sessions was never
updated in this path, so the in-memory set is the only thing that
changed; on plugin restart the set is rebuilt from the persistent
array (state.ts:981-983) which is unchanged, so the failed session is
correctly retryable after a restart either way.
Notes for the reviewer:
- This does NOT change the kasper-side ephemeral scoring sessions
(kasper-scoring-*, kasper-merge-*, kasper-diag-*). Their cleanup
is in scorer.ts:268-285 (merge) and scorer.ts:1185-1204
(tryEvaluate), in finally blocks guarded by
'if (scoringSessionId)'. When session.create() throws (the model
doesn't exist case), scoringSessionId is never assigned and the
delete is correctly skipped — there's no session to delete. The
primary session being scored is a separate concern, addressed by
this fix.
- Startup time is unaffected. The Set.add() was a single
synchronous operation on a path that fires at most once per
failed scoring attempt, never at plugin init. Measured cold
startup on the current branch: ~2 ms (5 runs, median 1.8 ms).
- The new regression test in tests/evaluate.test.ts asserts the
fix at the runEvaluation() level. It uses the existing mockCtx()
helper and a scorer that returns a fallback card, then asserts
that ctx.sessionsEvaluated does NOT contain the failed session
and that a subsequent runEvaluation() call actually re-invokes
the scorer. Confirmed via mutation check: the test fails on the
unfixed code with 'Expected: false, Received: true' on
sessionsEvaluated.has().
- Trade-off: removing the Set.add() means a session whose model is
broken will be re-polled on every evaluation_poll_interval_ms
(default 60s) for as long as the model is broken. Each
re-attempt re-fires the LLM call's retry budget (maxRetries+1
attempts). The toast warning at evaluate.ts:295-302 still tells
the user, and each attempt creates a 'scoring_session_created'
or 'scoring_attempt_error' log line which is greppable.
PR Summary by QodoFix evaluation retries by not marking fallback-scored sessions as evaluated Description
Diagram
High-Level Assessment
Files changed (2)
|
Code Review by Qodo
1. Zero-score retries forever
|
| if (card.fallback || card.overall_score <= 0) { | ||
| ctx.sessionsEvaluated.add(pending.sessionID) | ||
| // Failure path: do NOT add pending.sessionID to | ||
| // ctx.sessionsEvaluated or to the persistent | ||
| // ctx.stateStore.addEvaluatedSession list. Adding it would mark | ||
| // the primary session as "already evaluated" forever, blocking | ||
| // any future re-attempt (e.g. after the user fixes the model | ||
| // config in kasper.jsonc). Scorer errors are transient: the | ||
| // model may be unavailable right now but become available | ||
| // later, the user may edit kasper.jsonc, or the user may | ||
| // restart opencode. The session will be retried on the next | ||
| // poll cycle (or after restart) when SESSION_DEBOUNCE_MS elapses | ||
| // — and a fresh scoring attempt produces a fresh card. The | ||
| // trade-off is some 6s-cycle log noise while the model is | ||
| // unavailable, but the toast warning already tells the user. | ||
| // The successful-recording path (below) is the only path that | ||
| // is allowed to mark a session as evaluated. | ||
| await ctx.logger.log("evaluation_skipped", { | ||
| sessionID: pending.sessionID, | ||
| score: card.overall_score, |
There was a problem hiding this comment.
1. Zero-score retries forever 🐞 Bug ≡ Correctness
runEvaluation() currently treats overall_score <= 0 the same as a transient failure (fallback) and returns without recording; after this PR it also no longer marks such sessions as evaluated, so a legitimate 0.0 score or a persistent fallback will be re-scored every poll cycle indefinitely and never persisted. This can additionally spam users with repeated fallback toasts (when not quiet) and repeatedly invoke expensive scoring until the model/config recovers.
Agent Prompt
## Issue description
`runEvaluation()` conflates valid low scores (including `overall_score: 0.0`) with transient scorer failures (fallback) by treating `overall_score <= 0` as a failure path; after this PR, that path also stops adding the session to `sessionsEvaluated`, so affected sessions are never persisted/marked evaluated and are retried indefinitely by the polling loop. Additionally, because fallback evaluations are now retryable and `runEvaluation()` emits a toast on fallback (when not quiet), persistent fallback conditions can spam users with repeated warnings and repeatedly hit the scoring backend every poll tick.
## Issue Context
- The scorer clamps `overall_score` into `[0, 1]`, and the scoring prompt documents scoring ranges as `0.0 - 1.0`, so `0` is a valid representable outcome.
- `pollAndEvaluate()` repeatedly selects sessions that are “old enough” (past debounce) and not in `sessionsEvaluated`, then calls `runEvaluation()`.
- `runEvaluation()` shows a toast for fallback cards when `quiet` is false.
- There is no per-session backoff/cooldown for repeated fallback outcomes, so once a session starts falling back it can be retried and toasted every poll cycle.
## Fix Focus Areas
- src/evaluate.ts[269-305]
- src/index.ts[837-959]
- src/scorer.ts[33-41]
- src/scorer.ts[1117-1178]
- src/utils.ts[142-154]
## Suggested fix
1. Separate “transient scorer failure” (`card.fallback === true`) from “valid but low score”, ensuring `overall_score: 0` (with `fallback: false`) is recorded and the session is added to `sessionsEvaluated`.
2. Change the failure condition from `card.fallback || card.overall_score <= 0` to something that only captures true failures, e.g. `if (card.fallback)` or `if (card.fallback || !Number.isFinite(card.overall_score))`.
3. Add/adjust a unit test asserting that `overall_score: 0` (with `fallback: false`) still records the session and adds it to `sessionsEvaluated`.
4. Implement a per-session cooldown/backoff for fallback retries and/or fallback warning toasts:
- Track `lastFallbackAt` and/or `lastFallbackToastAt` (e.g., `Map<string, number>` on `ctx` or via a helper/state store).
- In the fallback branch, only call `showToast()` if the last toast was more than N seconds/minutes ago.
- Optionally, in `pollAndEvaluate()`, skip re-evaluating sessions that had a fallback within a recent window (exponential backoff preferred), while still allowing quick retries when model/config changes are detected.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
When the scorer returns a fallback card (e.g. user picked a model that doesn't exist, model provider is down, LLM call timed out), the runEvaluation() fallback path was unconditionally adding pending.sessionID to ctx.sessionsEvaluated. The runEvaluation() guard at line 96 then short-circuited any future runEvaluation() call for the same session id, silently disabling re-attempts for the rest of the plugin lifetime.
This is wrong: a failed scoring is transient. The user might fix the model config in kasper.jsonc and expect the next polling cycle (or the next /kasper score session invocation) to pick up the now-fixable session. With the previous code, the user would have to restart opencode to retry. The success path of runEvaluation() is the only path that may mark a session as evaluated.
The fix removes ctx.sessionsEvaluated.add(pending.sessionID) from the fallback branch. The persistent state.evaluated_sessions was never updated in this path, so the in-memory set is the only thing that changed; on plugin restart the set is rebuilt from the persistent array (state.ts:981-983) which is unchanged, so the failed session is correctly retryable after a restart either way.