Skip to content

fix(evaluate): don't mark primary sessions as evaluated on failure#7

Merged
andrejtonev merged 1 commit into
mainfrom
investigate/session-cleanup-failures
Jun 18, 2026
Merged

fix(evaluate): don't mark primary sessions as evaluated on failure#7
andrejtonev merged 1 commit into
mainfrom
investigate/session-cleanup-failures

Conversation

@andrejtonev

Copy link
Copy Markdown
Owner

When the scorer returns a fallback card (e.g. user picked a model that doesn't exist, model provider is down, LLM call timed out), the runEvaluation() fallback path was unconditionally adding pending.sessionID to ctx.sessionsEvaluated. The runEvaluation() guard at line 96 then short-circuited any future runEvaluation() call for the same session id, silently disabling re-attempts for the rest of the plugin lifetime.

This is wrong: a failed scoring is transient. The user might fix the model config in kasper.jsonc and expect the next polling cycle (or the next /kasper score session invocation) to pick up the now-fixable session. With the previous code, the user would have to restart opencode to retry. The success path of runEvaluation() is the only path that may mark a session as evaluated.

The fix removes ctx.sessionsEvaluated.add(pending.sessionID) from the fallback branch. The persistent state.evaluated_sessions was never updated in this path, so the in-memory set is the only thing that changed; on plugin restart the set is rebuilt from the persistent array (state.ts:981-983) which is unchanged, so the failed session is correctly retryable after a restart either way.

…ilure

When the scorer returns a fallback card (e.g. user picked a model that
doesn't exist, model provider is down, LLM call timed out), the
runEvaluation() fallback path was unconditionally adding
pending.sessionID to ctx.sessionsEvaluated. The runEvaluation() guard
at line 96 then short-circuited any future runEvaluation() call for
the same session id, silently disabling re-attempts for the rest of
the plugin lifetime.

This is wrong: a failed scoring is transient. The user might fix the
model config in kasper.jsonc and expect the next polling cycle (or
the next /kasper score session invocation) to pick up the now-fixable
session. With the previous code, the user would have to restart
opencode to retry. The success path of runEvaluation() is the only
path that may mark a session as evaluated.

The fix removes ctx.sessionsEvaluated.add(pending.sessionID) from the
fallback branch. The persistent state.evaluated_sessions was never
updated in this path, so the in-memory set is the only thing that
changed; on plugin restart the set is rebuilt from the persistent
array (state.ts:981-983) which is unchanged, so the failed session is
correctly retryable after a restart either way.

Notes for the reviewer:

  - This does NOT change the kasper-side ephemeral scoring sessions
    (kasper-scoring-*, kasper-merge-*, kasper-diag-*). Their cleanup
    is in scorer.ts:268-285 (merge) and scorer.ts:1185-1204
    (tryEvaluate), in finally blocks guarded by
    'if (scoringSessionId)'. When session.create() throws (the model
    doesn't exist case), scoringSessionId is never assigned and the
    delete is correctly skipped — there's no session to delete. The
    primary session being scored is a separate concern, addressed by
    this fix.

  - Startup time is unaffected. The Set.add() was a single
    synchronous operation on a path that fires at most once per
    failed scoring attempt, never at plugin init. Measured cold
    startup on the current branch: ~2 ms (5 runs, median 1.8 ms).

  - The new regression test in tests/evaluate.test.ts asserts the
    fix at the runEvaluation() level. It uses the existing mockCtx()
    helper and a scorer that returns a fallback card, then asserts
    that ctx.sessionsEvaluated does NOT contain the failed session
    and that a subsequent runEvaluation() call actually re-invokes
    the scorer. Confirmed via mutation check: the test fails on the
    unfixed code with 'Expected: false, Received: true' on
    sessionsEvaluated.has().

  - Trade-off: removing the Set.add() means a session whose model is
    broken will be re-polled on every evaluation_poll_interval_ms
    (default 60s) for as long as the model is broken. Each
    re-attempt re-fires the LLM call's retry budget (maxRetries+1
    attempts). The toast warning at evaluate.ts:295-302 still tells
    the user, and each attempt creates a 'scoring_session_created'
    or 'scoring_attempt_error' log line which is greppable.
@andrejtonev andrejtonev self-assigned this Jun 18, 2026
@andrejtonev andrejtonev added the bug Something isn't working label Jun 18, 2026
@andrejtonev andrejtonev merged commit b47bea7 into main Jun 18, 2026
1 of 2 checks passed
@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Fix evaluation retries by not marking fallback-scored sessions as evaluated
🐞 Bug fix 🧪 Tests 🕐 10-20 Minutes

Grey Divider

Description

• Allow sessions to be re-scored after transient scorer/model failures.
• Only mark sessions as evaluated on successful scoring + persistence.
• Add regression test ensuring fallback scoring does not short-circuit retries.
Diagram

graph TD
  A["runEvaluation()"] --> B["Check sessionsEvaluated"] --> C["scorer.evaluate()"] --> D{"Fallback or score<=0?"}
  D -->|"Yes"| E["Skip (no mark)"] --> G["Return false"]
  D -->|"No"| F["Record + mark evaluated"] --> H["Return true"]
  F --> B
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Track failures separately with backoff
  • ➕ Avoids repeated re-attempts/log noise while the provider/model is down
  • ➕ Can show clearer user-facing status (e.g., 'retrying in 60s')
  • ➖ More state and complexity (timestamps, per-session retry budgets)
  • ➖ Risk of delaying legitimate fast recovery after user fixes config
2. Mark evaluated with a short TTL instead of permanent set membership
  • ➕ Prevents tight retry loops without requiring restart
  • ➕ Keeps evaluated-set semantics closer to 'recently attempted'
  • ➖ Changes meaning of sessionsEvaluated (attempted vs successfully evaluated)
  • ➖ Adds time-based behavior that can be harder to reason about/test

Recommendation: The PR’s minimal change is the correct immediate fix: only successful evaluations should mark a primary session as evaluated, restoring expected retry behavior after transient scorer failures. If retry noise becomes a real issue, consider a separate failure/backoff tracker rather than weakening the meaning of “evaluated.”

Files changed (2) +58 / -1

Bug fix (1) +15 / -1
evaluate.tsStop marking sessions evaluated on fallback/failed scoring +15/-1

Stop marking sessions evaluated on fallback/failed scoring

• Removes the sessionsEvaluated update from the fallback/score<=0 branch so transient scorer failures remain retryable. Adds detailed rationale comments clarifying that only the success-recording path may mark a session as evaluated/persisted.

src/evaluate.ts

Tests (1) +43 / -0
evaluate.test.tsAdd regression test for retrying fallback-scored sessions +43/-0

Add regression test for retrying fallback-scored sessions

• Adds a test asserting that fallback score cards do not add the session ID to ctx.sessionsEvaluated. Verifies a second runEvaluation call re-attempts scoring instead of being short-circuited by the evaluated-set guard.

tests/evaluate.test.ts

@qodo-code-review

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📜 Skill insights (0)

Grey Divider


Action required

1. Zero-score retries forever 🐞 Bug ≡ Correctness
Description
runEvaluation() currently treats overall_score <= 0 the same as a transient failure (fallback)
and returns without recording; after this PR it also no longer marks such sessions as evaluated, so
a legitimate 0.0 score or a persistent fallback will be re-scored every poll cycle indefinitely
and never persisted. This can additionally spam users with repeated fallback toasts (when not quiet)
and repeatedly invoke expensive scoring until the model/config recovers.
Code

src/evaluate.ts[R272-290]

      if (card.fallback || card.overall_score <= 0) {
-        ctx.sessionsEvaluated.add(pending.sessionID)
+        // Failure path: do NOT add pending.sessionID to
+        // ctx.sessionsEvaluated or to the persistent
+        // ctx.stateStore.addEvaluatedSession list. Adding it would mark
+        // the primary session as "already evaluated" forever, blocking
+        // any future re-attempt (e.g. after the user fixes the model
+        // config in kasper.jsonc). Scorer errors are transient: the
+        // model may be unavailable right now but become available
+        // later, the user may edit kasper.jsonc, or the user may
+        // restart opencode. The session will be retried on the next
+        // poll cycle (or after restart) when SESSION_DEBOUNCE_MS elapses
+        // — and a fresh scoring attempt produces a fresh card. The
+        // trade-off is some 6s-cycle log noise while the model is
+        // unavailable, but the toast warning already tells the user.
+        // The successful-recording path (below) is the only path that
+        // is allowed to mark a session as evaluated.
        await ctx.logger.log("evaluation_skipped", {
          sessionID: pending.sessionID,
          score: card.overall_score,
Evidence
The scoring rubric/prompt and clamping logic allow overall_score values in [0,1], making 0.0 a
valid outcome rather than an error. Despite that, runEvaluation() gates
recording/evaluated-marking on a condition that treats overall_score <= 0 as failure and (after
this PR) does not add the session to sessionsEvaluated; meanwhile pollAndEvaluate() keeps
selecting sessions not present in sessionsEvaluated once they pass the debounce/age filter,
causing the same session to be re-evaluated every polling tick. In the fallback case,
runEvaluation() also calls showToast() when quiet is false, so a persistent fallback condition
leads to repeated toasts and repeated scoring attempts each poll cycle.

src/evaluate.ts[90-103]
src/evaluate.ts[269-304]
src/scorer.ts[33-41]
src/scorer.ts[1117-1178]
src/index.ts[837-897]
src/index.ts[951-959]
src/index.ts[837-860]
src/index.ts[869-897]
src/utils.ts[142-154]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`runEvaluation()` conflates valid low scores (including `overall_score: 0.0`) with transient scorer failures (fallback) by treating `overall_score <= 0` as a failure path; after this PR, that path also stops adding the session to `sessionsEvaluated`, so affected sessions are never persisted/marked evaluated and are retried indefinitely by the polling loop. Additionally, because fallback evaluations are now retryable and `runEvaluation()` emits a toast on fallback (when not quiet), persistent fallback conditions can spam users with repeated warnings and repeatedly hit the scoring backend every poll tick.

## Issue Context
- The scorer clamps `overall_score` into `[0, 1]`, and the scoring prompt documents scoring ranges as `0.0 - 1.0`, so `0` is a valid representable outcome.
- `pollAndEvaluate()` repeatedly selects sessions that are “old enough” (past debounce) and not in `sessionsEvaluated`, then calls `runEvaluation()`.
- `runEvaluation()` shows a toast for fallback cards when `quiet` is false.
- There is no per-session backoff/cooldown for repeated fallback outcomes, so once a session starts falling back it can be retried and toasted every poll cycle.

## Fix Focus Areas
- src/evaluate.ts[269-305]
- src/index.ts[837-959]
- src/scorer.ts[33-41]
- src/scorer.ts[1117-1178]
- src/utils.ts[142-154]

## Suggested fix
1. Separate “transient scorer failure” (`card.fallback === true`) from “valid but low score”, ensuring `overall_score: 0` (with `fallback: false`) is recorded and the session is added to `sessionsEvaluated`.
2. Change the failure condition from `card.fallback || card.overall_score <= 0` to something that only captures true failures, e.g. `if (card.fallback)` or `if (card.fallback || !Number.isFinite(card.overall_score))`.
3. Add/adjust a unit test asserting that `overall_score: 0` (with `fallback: false`) still records the session and adds it to `sessionsEvaluated`.
4. Implement a per-session cooldown/backoff for fallback retries and/or fallback warning toasts:
  - Track `lastFallbackAt` and/or `lastFallbackToastAt` (e.g., `Map<string, number>` on `ctx` or via a helper/state store).
  - In the fallback branch, only call `showToast()` if the last toast was more than N seconds/minutes ago.
  - Optionally, in `pollAndEvaluate()`, skip re-evaluating sessions that had a fallback within a recent window (exponential backoff preferred), while still allowing quick retries when model/config changes are detected.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread src/evaluate.ts
Comment on lines 272 to 290
if (card.fallback || card.overall_score <= 0) {
ctx.sessionsEvaluated.add(pending.sessionID)
// Failure path: do NOT add pending.sessionID to
// ctx.sessionsEvaluated or to the persistent
// ctx.stateStore.addEvaluatedSession list. Adding it would mark
// the primary session as "already evaluated" forever, blocking
// any future re-attempt (e.g. after the user fixes the model
// config in kasper.jsonc). Scorer errors are transient: the
// model may be unavailable right now but become available
// later, the user may edit kasper.jsonc, or the user may
// restart opencode. The session will be retried on the next
// poll cycle (or after restart) when SESSION_DEBOUNCE_MS elapses
// — and a fresh scoring attempt produces a fresh card. The
// trade-off is some 6s-cycle log noise while the model is
// unavailable, but the toast warning already tells the user.
// The successful-recording path (below) is the only path that
// is allowed to mark a session as evaluated.
await ctx.logger.log("evaluation_skipped", {
sessionID: pending.sessionID,
score: card.overall_score,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Zero-score retries forever 🐞 Bug ≡ Correctness

runEvaluation() currently treats overall_score <= 0 the same as a transient failure (fallback)
and returns without recording; after this PR it also no longer marks such sessions as evaluated, so
a legitimate 0.0 score or a persistent fallback will be re-scored every poll cycle indefinitely
and never persisted. This can additionally spam users with repeated fallback toasts (when not quiet)
and repeatedly invoke expensive scoring until the model/config recovers.
Agent Prompt
## Issue description
`runEvaluation()` conflates valid low scores (including `overall_score: 0.0`) with transient scorer failures (fallback) by treating `overall_score <= 0` as a failure path; after this PR, that path also stops adding the session to `sessionsEvaluated`, so affected sessions are never persisted/marked evaluated and are retried indefinitely by the polling loop. Additionally, because fallback evaluations are now retryable and `runEvaluation()` emits a toast on fallback (when not quiet), persistent fallback conditions can spam users with repeated warnings and repeatedly hit the scoring backend every poll tick.

## Issue Context
- The scorer clamps `overall_score` into `[0, 1]`, and the scoring prompt documents scoring ranges as `0.0 - 1.0`, so `0` is a valid representable outcome.
- `pollAndEvaluate()` repeatedly selects sessions that are “old enough” (past debounce) and not in `sessionsEvaluated`, then calls `runEvaluation()`.
- `runEvaluation()` shows a toast for fallback cards when `quiet` is false.
- There is no per-session backoff/cooldown for repeated fallback outcomes, so once a session starts falling back it can be retried and toasted every poll cycle.

## Fix Focus Areas
- src/evaluate.ts[269-305]
- src/index.ts[837-959]
- src/scorer.ts[33-41]
- src/scorer.ts[1117-1178]
- src/utils.ts[142-154]

## Suggested fix
1. Separate “transient scorer failure” (`card.fallback === true`) from “valid but low score”, ensuring `overall_score: 0` (with `fallback: false`) is recorded and the session is added to `sessionsEvaluated`.
2. Change the failure condition from `card.fallback || card.overall_score <= 0` to something that only captures true failures, e.g. `if (card.fallback)` or `if (card.fallback || !Number.isFinite(card.overall_score))`.
3. Add/adjust a unit test asserting that `overall_score: 0` (with `fallback: false`) still records the session and adds it to `sessionsEvaluated`.
4. Implement a per-session cooldown/backoff for fallback retries and/or fallback warning toasts:
   - Track `lastFallbackAt` and/or `lastFallbackToastAt` (e.g., `Map<string, number>` on `ctx` or via a helper/state store).
   - In the fallback branch, only call `showToast()` if the last toast was more than N seconds/minutes ago.
   - Optionally, in `pollAndEvaluate()`, skip re-evaluating sessions that had a fallback within a recent window (exponential backoff preferred), while still allowing quick retries when model/config changes are detected.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant