fix(evaluate): don't mark primary sessions as evaluated on failure by andrejtonev · Pull Request #7 · andrejtonev/opencode-kasper

andrejtonev · 2026-06-18T18:23:51Z

When the scorer returns a fallback card (e.g. user picked a model that doesn't exist, model provider is down, LLM call timed out), the runEvaluation() fallback path was unconditionally adding pending.sessionID to ctx.sessionsEvaluated. The runEvaluation() guard at line 96 then short-circuited any future runEvaluation() call for the same session id, silently disabling re-attempts for the rest of the plugin lifetime.

This is wrong: a failed scoring is transient. The user might fix the model config in kasper.jsonc and expect the next polling cycle (or the next /kasper score session invocation) to pick up the now-fixable session. With the previous code, the user would have to restart opencode to retry. The success path of runEvaluation() is the only path that may mark a session as evaluated.

The fix removes ctx.sessionsEvaluated.add(pending.sessionID) from the fallback branch. The persistent state.evaluated_sessions was never updated in this path, so the in-memory set is the only thing that changed; on plugin restart the set is rebuilt from the persistent array (state.ts:981-983) which is unchanged, so the failed session is correctly retryable after a restart either way.

…ilure When the scorer returns a fallback card (e.g. user picked a model that doesn't exist, model provider is down, LLM call timed out), the runEvaluation() fallback path was unconditionally adding pending.sessionID to ctx.sessionsEvaluated. The runEvaluation() guard at line 96 then short-circuited any future runEvaluation() call for the same session id, silently disabling re-attempts for the rest of the plugin lifetime. This is wrong: a failed scoring is transient. The user might fix the model config in kasper.jsonc and expect the next polling cycle (or the next /kasper score session invocation) to pick up the now-fixable session. With the previous code, the user would have to restart opencode to retry. The success path of runEvaluation() is the only path that may mark a session as evaluated. The fix removes ctx.sessionsEvaluated.add(pending.sessionID) from the fallback branch. The persistent state.evaluated_sessions was never updated in this path, so the in-memory set is the only thing that changed; on plugin restart the set is rebuilt from the persistent array (state.ts:981-983) which is unchanged, so the failed session is correctly retryable after a restart either way. Notes for the reviewer: - This does NOT change the kasper-side ephemeral scoring sessions (kasper-scoring-*, kasper-merge-*, kasper-diag-*). Their cleanup is in scorer.ts:268-285 (merge) and scorer.ts:1185-1204 (tryEvaluate), in finally blocks guarded by 'if (scoringSessionId)'. When session.create() throws (the model doesn't exist case), scoringSessionId is never assigned and the delete is correctly skipped — there's no session to delete. The primary session being scored is a separate concern, addressed by this fix. - Startup time is unaffected. The Set.add() was a single synchronous operation on a path that fires at most once per failed scoring attempt, never at plugin init. Measured cold startup on the current branch: ~2 ms (5 runs, median 1.8 ms). - The new regression test in tests/evaluate.test.ts asserts the fix at the runEvaluation() level. It uses the existing mockCtx() helper and a scorer that returns a fallback card, then asserts that ctx.sessionsEvaluated does NOT contain the failed session and that a subsequent runEvaluation() call actually re-invokes the scorer. Confirmed via mutation check: the test fails on the unfixed code with 'Expected: false, Received: true' on sessionsEvaluated.has(). - Trade-off: removing the Set.add() means a session whose model is broken will be re-polled on every evaluation_poll_interval_ms (default 60s) for as long as the model is broken. Each re-attempt re-fires the LLM call's retry budget (maxRetries+1 attempts). The toast warning at evaluate.ts:295-302 still tells the user, and each attempt creates a 'scoring_session_created' or 'scoring_attempt_error' log line which is greppable.

qodo-code-review · 2026-06-18T18:25:02Z

PR Summary by Qodo

Fix evaluation retries by not marking fallback-scored sessions as evaluated
🐞 Bug fix 🧪 Tests 🕐 10-20 Minutes

Description

• Allow sessions to be re-scored after transient scorer/model failures.
• Only mark sessions as evaluated on successful scoring + persistence.
• Add regression test ensuring fallback scoring does not short-circuit retries.

Diagram

graph TD
  A["runEvaluation()"] --> B["Check sessionsEvaluated"] --> C["scorer.evaluate()"] --> D{"Fallback or score<=0?"}
  D -->|"Yes"| E["Skip (no mark)"] --> G["Return false"]
  D -->|"No"| F["Record + mark evaluated"] --> H["Return true"]
  F --> B

High-Level Assessment

The following are alternative approaches to this PR:

1. Track failures separately with backoff

➕ Avoids repeated re-attempts/log noise while the provider/model is down
➕ Can show clearer user-facing status (e.g., 'retrying in 60s')
➖ More state and complexity (timestamps, per-session retry budgets)
➖ Risk of delaying legitimate fast recovery after user fixes config

2. Mark evaluated with a short TTL instead of permanent set membership

➕ Prevents tight retry loops without requiring restart
➕ Keeps evaluated-set semantics closer to 'recently attempted'
➖ Changes meaning of sessionsEvaluated (attempted vs successfully evaluated)
➖ Adds time-based behavior that can be harder to reason about/test

Recommendation: The PR’s minimal change is the correct immediate fix: only successful evaluations should mark a primary session as evaluated, restoring expected retry behavior after transient scorer failures. If retry noise becomes a real issue, consider a separate failure/backoff tracker rather than weakening the meaning of “evaluated.”

Files changed (2) +58 / -1

Bug fix (1) +15 / -1

evaluate.tsStop marking sessions evaluated on fallback/failed scoring +15/-1

Stop marking sessions evaluated on fallback/failed scoring

• Removes the sessionsEvaluated update from the fallback/score<=0 branch so transient scorer failures remain retryable. Adds detailed rationale comments clarifying that only the success-recording path may mark a session as evaluated/persisted.

src/evaluate.ts

Tests (1) +43 / -0

evaluate.test.tsAdd regression test for retrying fallback-scored sessions +43/-0

Add regression test for retrying fallback-scored sessions

• Adds a test asserting that fallback score cards do not add the session ID to ctx.sessionsEvaluated. Verifies a second runEvaluation call re-attempts scoring instead of being short-circuited by the evaluated-set guard.

tests/evaluate.test.ts

qodo-code-review · 2026-06-18T18:30:47Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📜 Skill insights (0)

1. Zero-score retries forever 🐞 Bug ≡ Correctness

Description

runEvaluation() currently treats overall_score <= 0 the same as a transient failure (fallback)
and returns without recording; after this PR it also no longer marks such sessions as evaluated, so
a legitimate 0.0 score or a persistent fallback will be re-scored every poll cycle indefinitely
and never persisted. This can additionally spam users with repeated fallback toasts (when not quiet)
and repeatedly invoke expensive scoring until the model/config recovers.

Code

src/evaluate.ts[R272-290]

      if (card.fallback || card.overall_score <= 0) {
-        ctx.sessionsEvaluated.add(pending.sessionID)
+        // Failure path: do NOT add pending.sessionID to
+        // ctx.sessionsEvaluated or to the persistent
+        // ctx.stateStore.addEvaluatedSession list. Adding it would mark
+        // the primary session as "already evaluated" forever, blocking
+        // any future re-attempt (e.g. after the user fixes the model
+        // config in kasper.jsonc). Scorer errors are transient: the
+        // model may be unavailable right now but become available
+        // later, the user may edit kasper.jsonc, or the user may
+        // restart opencode. The session will be retried on the next
+        // poll cycle (or after restart) when SESSION_DEBOUNCE_MS elapses
+        // — and a fresh scoring attempt produces a fresh card. The
+        // trade-off is some 6s-cycle log noise while the model is
+        // unavailable, but the toast warning already tells the user.
+        // The successful-recording path (below) is the only path that
+        // is allowed to mark a session as evaluated.
        await ctx.logger.log("evaluation_skipped", {
          sessionID: pending.sessionID,
          score: card.overall_score,

Evidence

The scoring rubric/prompt and clamping logic allow overall_score values in [0,1], making 0.0 a
valid outcome rather than an error. Despite that, runEvaluation() gates
recording/evaluated-marking on a condition that treats overall_score <= 0 as failure and (after
this PR) does not add the session to sessionsEvaluated; meanwhile pollAndEvaluate() keeps
selecting sessions not present in sessionsEvaluated once they pass the debounce/age filter,
causing the same session to be re-evaluated every polling tick. In the fallback case,
runEvaluation() also calls showToast() when quiet is false, so a persistent fallback condition
leads to repeated toasts and repeated scoring attempts each poll cycle.

src/evaluate.ts[90-103]
src/evaluate.ts[269-304]
src/scorer.ts[33-41]
src/scorer.ts[1117-1178]
src/index.ts[837-897]
src/index.ts[951-959]
src/index.ts[837-860]
src/index.ts[869-897]
src/utils.ts[142-154]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`runEvaluation()` conflates valid low scores (including `overall_score: 0.0`) with transient scorer failures (fallback) by treating `overall_score <= 0` as a failure path; after this PR, that path also stops adding the session to `sessionsEvaluated`, so affected sessions are never persisted/marked evaluated and are retried indefinitely by the polling loop. Additionally, because fallback evaluations are now retryable and `runEvaluation()` emits a toast on fallback (when not quiet), persistent fallback conditions can spam users with repeated warnings and repeatedly hit the scoring backend every poll tick.

## Issue Context
- The scorer clamps `overall_score` into `[0, 1]`, and the scoring prompt documents scoring ranges as `0.0 - 1.0`, so `0` is a valid representable outcome.
- `pollAndEvaluate()` repeatedly selects sessions that are “old enough” (past debounce) and not in `sessionsEvaluated`, then calls `runEvaluation()`.
- `runEvaluation()` shows a toast for fallback cards when `quiet` is false.
- There is no per-session backoff/cooldown for repeated fallback outcomes, so once a session starts falling back it can be retried and toasted every poll cycle.

## Fix Focus Areas
- src/evaluate.ts[269-305]
- src/index.ts[837-959]
- src/scorer.ts[33-41]
- src/scorer.ts[1117-1178]
- src/utils.ts[142-154]

## Suggested fix
1. Separate “transient scorer failure” (`card.fallback === true`) from “valid but low score”, ensuring `overall_score: 0` (with `fallback: false`) is recorded and the session is added to `sessionsEvaluated`.
2. Change the failure condition from `card.fallback || card.overall_score <= 0` to something that only captures true failures, e.g. `if (card.fallback)` or `if (card.fallback || !Number.isFinite(card.overall_score))`.
3. Add/adjust a unit test asserting that `overall_score: 0` (with `fallback: false`) still records the session and adds it to `sessionsEvaluated`.
4. Implement a per-session cooldown/backoff for fallback retries and/or fallback warning toasts:
  - Track `lastFallbackAt` and/or `lastFallbackToastAt` (e.g., `Map<string, number>` on `ctx` or via a helper/state store).
  - In the fallback branch, only call `showToast()` if the last toast was more than N seconds/minutes ago.
  - Optionally, in `pollAndEvaluate()`, skip re-evaluating sessions that had a fallback within a recent window (exponential backoff preferred), while still allowing quick retries when model/config changes are detected.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-06-18T18:30:47Z

      if (card.fallback || card.overall_score <= 0) {
-        ctx.sessionsEvaluated.add(pending.sessionID)
+        // Failure path: do NOT add pending.sessionID to
+        // ctx.sessionsEvaluated or to the persistent
+        // ctx.stateStore.addEvaluatedSession list. Adding it would mark
+        // the primary session as "already evaluated" forever, blocking
+        // any future re-attempt (e.g. after the user fixes the model
+        // config in kasper.jsonc). Scorer errors are transient: the
+        // model may be unavailable right now but become available
+        // later, the user may edit kasper.jsonc, or the user may
+        // restart opencode. The session will be retried on the next
+        // poll cycle (or after restart) when SESSION_DEBOUNCE_MS elapses
+        // — and a fresh scoring attempt produces a fresh card. The
+        // trade-off is some 6s-cycle log noise while the model is
+        // unavailable, but the toast warning already tells the user.
+        // The successful-recording path (below) is the only path that
+        // is allowed to mark a session as evaluated.
        await ctx.logger.log("evaluation_skipped", {
          sessionID: pending.sessionID,
          score: card.overall_score,


1. Zero-score retries forever 🐞 Bug ≡ Correctness

runEvaluation() currently treats overall_score <= 0 the same as a transient failure (fallback) and returns without recording; after this PR it also no longer marks such sessions as evaluated, so a legitimate 0.0 score or a persistent fallback will be re-scored every poll cycle indefinitely and never persisted. This can additionally spam users with repeated fallback toasts (when not quiet) and repeatedly invoke expensive scoring until the model/config recovers.

Agent Prompt

## Issue description `runEvaluation()` conflates valid low scores (including `overall_score: 0.0`) with transient scorer failures (fallback) by treating `overall_score <= 0` as a failure path; after this PR, that path also stops adding the session to `sessionsEvaluated`, so affected sessions are never persisted/marked evaluated and are retried indefinitely by the polling loop. Additionally, because fallback evaluations are now retryable and `runEvaluation()` emits a toast on fallback (when not quiet), persistent fallback conditions can spam users with repeated warnings and repeatedly hit the scoring backend every poll tick. ## Issue Context - The scorer clamps `overall_score` into `[0, 1]`, and the scoring prompt documents scoring ranges as `0.0 - 1.0`, so `0` is a valid representable outcome. - `pollAndEvaluate()` repeatedly selects sessions that are “old enough” (past debounce) and not in `sessionsEvaluated`, then calls `runEvaluation()`. - `runEvaluation()` shows a toast for fallback cards when `quiet` is false. - There is no per-session backoff/cooldown for repeated fallback outcomes, so once a session starts falling back it can be retried and toasted every poll cycle. ## Fix Focus Areas - src/evaluate.ts[269-305] - src/index.ts[837-959] - src/scorer.ts[33-41] - src/scorer.ts[1117-1178] - src/utils.ts[142-154] ## Suggested fix 1. Separate “transient scorer failure” (`card.fallback === true`) from “valid but low score”, ensuring `overall_score: 0` (with `fallback: false`) is recorded and the session is added to `sessionsEvaluated`. 2. Change the failure condition from `card.fallback || card.overall_score <= 0` to something that only captures true failures, e.g. `if (card.fallback)` or `if (card.fallback || !Number.isFinite(card.overall_score))`. 3. Add/adjust a unit test asserting that `overall_score: 0` (with `fallback: false`) still records the session and adds it to `sessionsEvaluated`. 4. Implement a per-session cooldown/backoff for fallback retries and/or fallback warning toasts: - Track `lastFallbackAt` and/or `lastFallbackToastAt` (e.g., `Map<string, number>` on `ctx` or via a helper/state store). - In the fallback branch, only call `showToast()` if the last toast was more than N seconds/minutes ago. - Optionally, in `pollAndEvaluate()`, skip re-evaluating sessions that had a fallback within a recent window (exponential backoff preferred), while still allowing quick retries when model/config changes are detected.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

andrejtonev self-assigned this Jun 18, 2026

andrejtonev added the bug Something isn't working label Jun 18, 2026

andrejtonev merged commit b47bea7 into main Jun 18, 2026
1 of 2 checks passed

qodo-code-review Bot reviewed Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evaluate): don't mark primary sessions as evaluated on failure#7

fix(evaluate): don't mark primary sessions as evaluated on failure#7
andrejtonev merged 1 commit into
mainfrom
investigate/session-cleanup-failures

andrejtonev commented Jun 18, 2026

Uh oh!

Uh oh!

qodo-code-review Bot commented Jun 18, 2026

Uh oh!

qodo-code-review Bot commented Jun 18, 2026

Uh oh!

qodo-code-review Bot Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrejtonev commented Jun 18, 2026

Uh oh!

Uh oh!

qodo-code-review Bot commented Jun 18, 2026

PR Summary by Qodo

Uh oh!

qodo-code-review Bot commented Jun 18, 2026

Code Review by Qodo

Uh oh!

qodo-code-review Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant