Skip to content

Conversation

@estsauver
Copy link

@estsauver estsauver commented Feb 6, 2026

Description

This PR consolidates three separate callbacks (on_start, on_progress, on_log) into a unified event-based system, addressing issue #755. The new system uses TypedDict events with Literal discriminators for type-safe pattern matching, making it cleaner, more extensible, and easier to maintain.

Context

I'm new to this codebase and the broader Prime Intellect ecosystem, so I focused on:

  1. Understanding the existing patterns (TypedDict, maybe_await, etc.)
  2. Following established conventions
  3. Thorough testing to ensure no regressions
  4. Clear documentation for future maintainers

Feedback welcome, especially on areas where I might have missed broader integration concerns!

Motivation

The previous callback system had grown brittle with three separate callback parameters that felt "tailor-made" for specific use cases. This PR:

Before / After

Before:

await env.evaluate(
    client=client,
    model="gpt-4",
    on_start=lambda total: ...,
    on_progress=lambda all_outs, new_outs: ...,
    on_log=lambda msg: ...
)

After:

async def on_event(event: EvalEvent):
    match event["type"]:
        case "start":
            print(f"Starting: {event['num_examples']} examples")
        case "progress":
            print(f"Progress: {event['completed_count']}/{event['total_count']}")
        case "complete":
            print(f"Done! Avg reward: {event['avg_reward']}")

await env.evaluate(
    client=client,
    model="gpt-4",
    on_event=on_event  # Single unified handler!
)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Note: This is a breaking change that removes on_start, on_progress, and on_log parameters. All internal usages have been migrated to the new event system.

Changes

New Event Types (verifiers/types.py)

  • StartEvent - Emitted once at start with resolved num_examples and rollouts_per_example
  • ProgressEvent - Emitted after each rollout/group completes with all_outputs and new_outputs
  • GroupCompleteEvent - For grouped scoring, includes full State objects (addresses PR Feature: Add on_group_complete callback for streaming eval results. #632)
  • LogEvent - For log messages with level, source, and timestamp
  • LogStreamEvent - Infrastructure for streaming logs to files (stream logs of envs into eval tui #753)
  • SaveEvent - When results are saved (intermediate or final)
  • CompleteEvent - When generation finishes with timing and metrics

Event Emission (verifiers/envs/environment.py)

  • Updated generate() and evaluate() signatures to use on_event parameter
  • Added _emit_event() helper using maybe_await() for sync/async handlers
  • Added _run_group_with_states() internal method to preserve State objects for GroupCompleteEvent
  • Events emitted at all key points: start, progress, group complete, save, log, complete

Event Consumption (verifiers/utils/eval_utils.py)

  • Migrated run_evaluations_tui() to use match/case pattern for event handling
  • All metric accumulation logic preserved
  • Progress bar, display updates, and logging all driven by events

Supporting Infrastructure

  • New file: verifiers/utils/event_utils.py - LogStreamFileWriter for log tailing
  • Updated: verifiers/utils/eval_display.py - Comments updated for new event system

Testing

Comprehensive test coverage demonstrates the system works correctly:

Unit Tests (tests/test_event_system.py - 10 tests)

  • ✅ Event type structure validation
  • ✅ LogStreamFileWriter functionality (file creation, appending, custom paths)
  • ✅ All tests pass

E2E Scenarios (tests/test_event_system_e2e.py - 4 scenarios)

Standalone executable script with realistic integration scenarios:

  • ✅ Scenario 1: Simple independent scoring
  • ✅ Scenario 2: Grouped scoring with multiple rollouts (tests GroupCompleteEvent)
  • ✅ Scenario 3: Intermediate saves (tests SaveEvent emission)
  • ✅ Scenario 4: Progress tracking with metrics
  • ✅ All scenarios validate event order, data completeness, and counts
  • Run with: uv run python tests/test_event_system_e2e.py

Integration Testing

  • All existing tests pass when running uv run pytest locally (514 tests pass, 4 skipped external env tests)
  • New tests have been added to cover the changes (10 unit + 4 e2e scenarios)
  • Verified with real vf-eval command - progress bar and TUI work correctly

Manual Testing

Ran actual evaluation with vf-eval using a test environment:

$ uv run vf-eval test_config.toml
Processing 2 groups (2 total rollouts): 100%|██████████| 2/2 [00:00<00:00]
Evaluation completed in 1.94 seconds

✅ Progress bar updates correctly (requires ProgressEvent)
✅ Results display properly (requires CompleteEvent)

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (MEMORY.md added)
  • My changes generate no new warnings (only pre-existing experimental uv warnings)
  • Any dependent changes have been merged and published (N/A)

Additional Notes

Design Decisions

Why TypedDict over dataclasses?

  • Matches existing patterns in the codebase
  • JSON-serializable by default
  • Works well with Literal discriminators for type-safe pattern matching

Why break backward compatibility?

  • The previous callback system was acknowledged as "tailor-made/brittle" (issue consolidate callbacks for eval tui #755)
  • Clean break is simpler than maintaining an adapter layer
  • All internal usages migrated in this PR
  • Better to do it now than carry technical debt

State Preservation Strategy

  • Created internal _run_group_with_states() method to return both State objects and outputs
  • Public run_group() API remains unchanged (returns only outputs)
  • GroupCompleteEvent receives State objects without breaking existing callers

Future Work

  • Full subprocess log streaming implementation (infrastructure in place)
  • Additional event types as needed (e.g., ErrorEvent for failures)
  • TUI features that leverage State objects from GroupCompleteEvent

Related Issues


Note

Medium Risk
Breaking API change in core evaluation/generation paths; regressions could affect progress reporting, checkpointing, or grouped/server-mode execution despite added tests and guards.

Overview
Switches the evaluation/generation callback API from three separate hooks (on_start, on_progress, on_log) to a single on_event handler carrying typed EvalEvent payloads (start, progress, group_complete, save, log, log_stream, complete), and updates Environment.generate()/evaluate() to emit these events at key lifecycle points.

Adds grouped-scoring state preservation via _run_group_with_states() to support GroupCompleteEvent (while preserving server-mode routing through run_group()), emits intermediate SaveEvents during incremental checkpointing, and avoids unnecessary event construction when no handler is registered (plus copies list fields to prevent post-emission mutation).

Migrates the CLI/TUI progress display to consume events, introduces LogStreamFileWriter utility for log_stream events, and updates docs (docs/reference.md, docs/evaluation.md) and adds focused unit/e2e tests covering event structure, ordering, server-mode behavior, save events, and immutability.

Written by Cursor Bugbot for commit cc8f0cb. This will update automatically on new commits. Configure here.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Earl St Sauver seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Earl St Sauver added 8 commits February 6, 2026 17:09
…ai#755)

Replace three separate callbacks (on_start, on_progress, on_log) with a
single event-based system using TypedDict events with Literal discriminators.

- Remove: StartCallback, ProgressCallback, LogCallback
- Add 7 new event TypedDicts with type discriminator:
  - StartEvent: emitted at start with resolved counts
  - ProgressEvent: emitted after each rollout/group completes
  - GroupCompleteEvent: for PR PrimeIntellect-ai#632, includes State objects
  - LogEvent: for log messages with level/source/timestamp
  - LogStreamEvent: infrastructure for PrimeIntellect-ai#753 log streaming
  - SaveEvent: when results are saved (intermediate or final)
  - CompleteEvent: when generation finishes
- Add EvalEvent union type and EventHandler callback type

- LogStreamFileWriter: writes log_stream events to files for tailing

- Update generate() and evaluate() signatures: single on_event parameter
- Add _emit_event() helper using maybe_await()
- Add _run_group_with_states() to preserve State objects
- Emit events at all key points:
  - StartEvent with accurate num_examples calculation
  - ProgressEvent after each completion
  - GroupCompleteEvent for non-independent scoring (with States)
  - SaveEvent for intermediate and final saves
  - LogEvent for important messages
  - CompleteEvent at the end

- Update run_evaluation() signature to use on_event
- Rewrite run_evaluations_tui() with match/case event handler
- Preserve all metric accumulation logic

- Update comments to reflect new event system
- Add TODO for full log_streams implementation

- Event type structure tests
- LogStreamFileWriter tests (creation, appending, custom paths)
- All 35 eval-related tests pass

Removed callback parameters from:
- Environment.generate()
- Environment.evaluate()
- run_evaluation()

Replaced with: on_event: EventHandler

- Issue PrimeIntellect-ai#755: Consolidate callbacks for eval TUI
- PR PrimeIntellect-ai#632: GroupCompleteEvent with State objects (infrastructure)
- Issue PrimeIntellect-ai#753: Log streaming infrastructure (LogStreamEvent + writer)
Created comprehensive e2e test suite with 4 realistic scenarios:
1. Simple independent scoring evaluation
2. Grouped scoring with multiple rollouts (tests GroupCompleteEvent)
3. Intermediate saves (tests SaveEvent)
4. Progress tracking with metrics

Each scenario validates:
- Event emission order (start -> progress -> complete)
- Event data completeness
- Correct event counts
- StartEvent with resolved num_examples/rollouts
- CompleteEvent with timing and metrics

All 4 scenarios pass, demonstrating the event system works correctly
in realistic usage patterns.

Related to PrimeIntellect-ai#755
Addresses three bugs found in code review:

1. Server mode bypass (HIGH): Fixed grouped scoring to check for server
   mode before calling _run_group_with_states(). In server mode, properly
   routes through run_group() instead of bypassing server dispatch.

2. Incorrect num_examples calculation (MEDIUM): Fixed StartEvent to
   report correct num_examples when independent_scoring=True with
   rollouts_per_example > 1. Added configured_rollouts_per_example
   parameter to generate() for accurate calculation.

3. Missing documentation (LOW): Added comprehensive documentation for
   event types and EventHandler to docs/reference.md and usage examples
   to docs/evaluation.md.

Tests: Added test_bugfix_event_system.py with 5 tests covering all fixes.
All 19 event system tests now pass (10 unit + 4 e2e + 5 bugfix tests).
Problem: Events were storing direct references to builder.outputs and
new_outputs lists, which would mutate as more results were added. This
made stored events (e.g., in EventCollector) misleading - all_outputs
would grow even after the event was emitted.

Fix: Copy lists when creating events to ensure immutability.

Tests: Added test_event_immutability.py with 2 tests verifying that
ProgressEvent and GroupCompleteEvent data doesn't mutate after emission.
Problem: Changed 'elif on_progress is not None:' to bare 'else:' which
unconditionally creates ProgressEvent objects (including list copies)
even when on_event=None. This causes O(N²) allocations across N
completions, affecting production code like GEPA that calls generate()
with no event handler.

Fix: Use 'elif on_event is not None:' to skip event construction when
no handler is registered, matching the original callback pattern.

Found by: Cursor Bugbot
name: metrics_accum[name] / completed
for name in metrics_accum
}
error_rate = error_accum / completed if completed > 0 else 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TUI metrics wrong when resuming evaluations

Medium Severity

When resuming a partially completed evaluation, completed_count from ProgressEvent includes resumed outputs (via len(builder.outputs) after builder.add_outputs(resumed_outputs)), but the TUI handler's rolling accumulators (reward_accum, metrics_accum, error_accum) only accumulate from new_outputs in each event. Dividing by completed therefore produces artificially low averages since the denominator counts resumed outputs that were never added to the numerator.

Additional Locations (1)

Fix in Cursor Fix in Web

The configured_rollouts_per_example parameter was added but never used.
The existing logic using len(set([example_id])) already correctly
calculates num_examples by counting unique example IDs.

- Remove unused parameter from generate() signature
- Remove parameter passing in evaluate()
- Update test documentation to clarify it tests correct existing behavior
- Update PR description to document this fix as Bug 5
The generate() method performed incremental saves after each task
completion but never emitted SaveEvent with is_intermediate=True,
making the TUI save handler dead code.

- Emit SaveEvent(is_intermediate=True) after each incremental save
- Add test coverage for intermediate save events
- Update PR description to document this fix as Bug 3
- TUI handler now correctly displays checkpoint messages

Fixes bugbot feedback about dead code in TUI save handler.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

else:
remaining_rollouts = len(filtered_inputs)
saved_rollouts = total_rollouts - remaining_rollouts

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-TUI path lost default progress bar entirely

Medium Severity

The old generate() method provided default callbacks (default_on_start, default_on_progress, default_on_log) that created a tqdm progress bar when no custom callbacks were given. These were removed without a replacement. The non-TUI evaluation path (run_evaluations()) calls run_evaluation() without an on_event handler, so evaluations now run completely silently with no progress indication — a regression from the prior behavior.

Fix in Cursor Fix in Web

on_start: StartCallback | None = None,
on_progress: ProgressCallback | None = None,
on_log: LogCallback | None = None,
on_event: EventHandler | None = None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GEPA adapter passes removed callback parameters to generate

High Severity

The GEPA adapter at verifiers/gepa/adapter.py still passes on_start=do_nothing and on_progress=do_nothing to env.generate(), but this commit removed those parameters from generate() (replaced with on_event). This will cause a TypeError at runtime, crashing any GEPA-based evaluation. The PR states "All internal usages have been migrated" but this caller was missed.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

consolidate callbacks for eval tui

2 participants