Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion src/evaluate.ts
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,21 @@ export async function runEvaluation(
const { emoji: _scoreEmoji } = formatScore(card.overall_score)

if (card.fallback || card.overall_score <= 0) {
ctx.sessionsEvaluated.add(pending.sessionID)
// Failure path: do NOT add pending.sessionID to
// ctx.sessionsEvaluated or to the persistent
// ctx.stateStore.addEvaluatedSession list. Adding it would mark
// the primary session as "already evaluated" forever, blocking
// any future re-attempt (e.g. after the user fixes the model
// config in kasper.jsonc). Scorer errors are transient: the
// model may be unavailable right now but become available
// later, the user may edit kasper.jsonc, or the user may
// restart opencode. The session will be retried on the next
// poll cycle (or after restart) when SESSION_DEBOUNCE_MS elapses
// — and a fresh scoring attempt produces a fresh card. The
// trade-off is some 6s-cycle log noise while the model is
// unavailable, but the toast warning already tells the user.
// The successful-recording path (below) is the only path that
// is allowed to mark a session as evaluated.
await ctx.logger.log("evaluation_skipped", {
sessionID: pending.sessionID,
score: card.overall_score,
Comment on lines 272 to 290

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Zero-score retries forever 🐞 Bug ≡ Correctness

runEvaluation() currently treats overall_score <= 0 the same as a transient failure (fallback)
and returns without recording; after this PR it also no longer marks such sessions as evaluated, so
a legitimate 0.0 score or a persistent fallback will be re-scored every poll cycle indefinitely
and never persisted. This can additionally spam users with repeated fallback toasts (when not quiet)
and repeatedly invoke expensive scoring until the model/config recovers.
Agent Prompt
## Issue description
`runEvaluation()` conflates valid low scores (including `overall_score: 0.0`) with transient scorer failures (fallback) by treating `overall_score <= 0` as a failure path; after this PR, that path also stops adding the session to `sessionsEvaluated`, so affected sessions are never persisted/marked evaluated and are retried indefinitely by the polling loop. Additionally, because fallback evaluations are now retryable and `runEvaluation()` emits a toast on fallback (when not quiet), persistent fallback conditions can spam users with repeated warnings and repeatedly hit the scoring backend every poll tick.

## Issue Context
- The scorer clamps `overall_score` into `[0, 1]`, and the scoring prompt documents scoring ranges as `0.0 - 1.0`, so `0` is a valid representable outcome.
- `pollAndEvaluate()` repeatedly selects sessions that are “old enough” (past debounce) and not in `sessionsEvaluated`, then calls `runEvaluation()`.
- `runEvaluation()` shows a toast for fallback cards when `quiet` is false.
- There is no per-session backoff/cooldown for repeated fallback outcomes, so once a session starts falling back it can be retried and toasted every poll cycle.

## Fix Focus Areas
- src/evaluate.ts[269-305]
- src/index.ts[837-959]
- src/scorer.ts[33-41]
- src/scorer.ts[1117-1178]
- src/utils.ts[142-154]

## Suggested fix
1. Separate “transient scorer failure” (`card.fallback === true`) from “valid but low score”, ensuring `overall_score: 0` (with `fallback: false`) is recorded and the session is added to `sessionsEvaluated`.
2. Change the failure condition from `card.fallback || card.overall_score <= 0` to something that only captures true failures, e.g. `if (card.fallback)` or `if (card.fallback || !Number.isFinite(card.overall_score))`.
3. Add/adjust a unit test asserting that `overall_score: 0` (with `fallback: false`) still records the session and adds it to `sessionsEvaluated`.
4. Implement a per-session cooldown/backoff for fallback retries and/or fallback warning toasts:
   - Track `lastFallbackAt` and/or `lastFallbackToastAt` (e.g., `Map<string, number>` on `ctx` or via a helper/state store).
   - In the fallback branch, only call `showToast()` if the last toast was more than N seconds/minutes ago.
   - Optionally, in `pollAndEvaluate()`, skip re-evaluating sessions that had a fallback within a recent window (exponential backoff preferred), while still allowing quick retries when model/config changes are detected.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Expand Down
43 changes: 43 additions & 0 deletions tests/evaluate.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,49 @@ describe("runEvaluation", () => {
expect(result).toBe(false)
})

test("fallback card (e.g. model unavailable) does NOT mark session as evaluated", async () => {
// Regression: a primary session whose scorer returns a fallback
// card (because session.create threw — e.g. user picked a model
// that doesn't exist, or the model provider is down) used to be
// added to ctx.sessionsEvaluated unconditionally. The
// runEvaluation() short-circuit at line 96 then skipped the
// session for the rest of the plugin lifetime, silently disabling
// any future re-attempt. The user would have to restart opencode
// to retry. The fix: the fallback path records nothing. Only the
// successful-recording path (recordSession + addEvaluatedSession)
// is allowed to mark a session as evaluated.
const ctx = mockCtx({
scorer: {
evaluate: async (_input: any) =>
makeScoreCard({ fallback: true, overall_score: 0.5 }),
} as any,
})
const result = await runEvaluation(
makePendingEval({ sessionID: "broken-model-session" }),
ctx,
)
expect(result).toBe(false)
// The fallback path must NOT add the session to sessionsEvaluated.
// If it does, the next runEvaluation() call for the same session
// id will short-circuit at the line-96 guard and never call the
// LLM again.
expect(ctx.sessionsEvaluated.has("broken-model-session")).toBe(false)

// And the next call must actually re-attempt the scoring.
let secondCallCount = 0
ctx.scorer = {
evaluate: async (_input: any) => {
secondCallCount++
return makeScoreCard({ fallback: true, overall_score: 0.5 })
},
} as any
await runEvaluation(
makePendingEval({ sessionID: "broken-model-session" }),
ctx,
)
expect(secondCallCount).toBe(1)
})

test("truncates long agent response before scoring", async () => {
let receivedResponse = ""
const ctx = mockCtx({
Expand Down
Loading