-
Notifications
You must be signed in to change notification settings - Fork 491
Consolidate eval callbacks into unified event system (#755) #837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Earl St Sauver seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
…ai#755) Replace three separate callbacks (on_start, on_progress, on_log) with a single event-based system using TypedDict events with Literal discriminators. - Remove: StartCallback, ProgressCallback, LogCallback - Add 7 new event TypedDicts with type discriminator: - StartEvent: emitted at start with resolved counts - ProgressEvent: emitted after each rollout/group completes - GroupCompleteEvent: for PR PrimeIntellect-ai#632, includes State objects - LogEvent: for log messages with level/source/timestamp - LogStreamEvent: infrastructure for PrimeIntellect-ai#753 log streaming - SaveEvent: when results are saved (intermediate or final) - CompleteEvent: when generation finishes - Add EvalEvent union type and EventHandler callback type - LogStreamFileWriter: writes log_stream events to files for tailing - Update generate() and evaluate() signatures: single on_event parameter - Add _emit_event() helper using maybe_await() - Add _run_group_with_states() to preserve State objects - Emit events at all key points: - StartEvent with accurate num_examples calculation - ProgressEvent after each completion - GroupCompleteEvent for non-independent scoring (with States) - SaveEvent for intermediate and final saves - LogEvent for important messages - CompleteEvent at the end - Update run_evaluation() signature to use on_event - Rewrite run_evaluations_tui() with match/case event handler - Preserve all metric accumulation logic - Update comments to reflect new event system - Add TODO for full log_streams implementation - Event type structure tests - LogStreamFileWriter tests (creation, appending, custom paths) - All 35 eval-related tests pass Removed callback parameters from: - Environment.generate() - Environment.evaluate() - run_evaluation() Replaced with: on_event: EventHandler - Issue PrimeIntellect-ai#755: Consolidate callbacks for eval TUI - PR PrimeIntellect-ai#632: GroupCompleteEvent with State objects (infrastructure) - Issue PrimeIntellect-ai#753: Log streaming infrastructure (LogStreamEvent + writer)
Created comprehensive e2e test suite with 4 realistic scenarios: 1. Simple independent scoring evaluation 2. Grouped scoring with multiple rollouts (tests GroupCompleteEvent) 3. Intermediate saves (tests SaveEvent) 4. Progress tracking with metrics Each scenario validates: - Event emission order (start -> progress -> complete) - Event data completeness - Correct event counts - StartEvent with resolved num_examples/rollouts - CompleteEvent with timing and metrics All 4 scenarios pass, demonstrating the event system works correctly in realistic usage patterns. Related to PrimeIntellect-ai#755
Addresses three bugs found in code review: 1. Server mode bypass (HIGH): Fixed grouped scoring to check for server mode before calling _run_group_with_states(). In server mode, properly routes through run_group() instead of bypassing server dispatch. 2. Incorrect num_examples calculation (MEDIUM): Fixed StartEvent to report correct num_examples when independent_scoring=True with rollouts_per_example > 1. Added configured_rollouts_per_example parameter to generate() for accurate calculation. 3. Missing documentation (LOW): Added comprehensive documentation for event types and EventHandler to docs/reference.md and usage examples to docs/evaluation.md. Tests: Added test_bugfix_event_system.py with 5 tests covering all fixes. All 19 event system tests now pass (10 unit + 4 e2e + 5 bugfix tests).
Problem: Events were storing direct references to builder.outputs and new_outputs lists, which would mutate as more results were added. This made stored events (e.g., in EventCollector) misleading - all_outputs would grow even after the event was emitted. Fix: Copy lists when creating events to ensure immutability. Tests: Added test_event_immutability.py with 2 tests verifying that ProgressEvent and GroupCompleteEvent data doesn't mutate after emission.
Problem: Changed 'elif on_progress is not None:' to bare 'else:' which unconditionally creates ProgressEvent objects (including list copies) even when on_event=None. This causes O(N²) allocations across N completions, affecting production code like GEPA that calls generate() with no event handler. Fix: Use 'elif on_event is not None:' to skip event construction when no handler is registered, matching the original callback pattern. Found by: Cursor Bugbot
| name: metrics_accum[name] / completed | ||
| for name in metrics_accum | ||
| } | ||
| error_rate = error_accum / completed if completed > 0 else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TUI metrics wrong when resuming evaluations
Medium Severity
When resuming a partially completed evaluation, completed_count from ProgressEvent includes resumed outputs (via len(builder.outputs) after builder.add_outputs(resumed_outputs)), but the TUI handler's rolling accumulators (reward_accum, metrics_accum, error_accum) only accumulate from new_outputs in each event. Dividing by completed therefore produces artificially low averages since the denominator counts resumed outputs that were never added to the numerator.
Additional Locations (1)
The configured_rollouts_per_example parameter was added but never used. The existing logic using len(set([example_id])) already correctly calculates num_examples by counting unique example IDs. - Remove unused parameter from generate() signature - Remove parameter passing in evaluate() - Update test documentation to clarify it tests correct existing behavior - Update PR description to document this fix as Bug 5
The generate() method performed incremental saves after each task completion but never emitted SaveEvent with is_intermediate=True, making the TUI save handler dead code. - Emit SaveEvent(is_intermediate=True) after each incremental save - Add test coverage for intermediate save events - Update PR description to document this fix as Bug 3 - TUI handler now correctly displays checkpoint messages Fixes bugbot feedback about dead code in TUI save handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| else: | ||
| remaining_rollouts = len(filtered_inputs) | ||
| saved_rollouts = total_rollouts - remaining_rollouts | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-TUI path lost default progress bar entirely
Medium Severity
The old generate() method provided default callbacks (default_on_start, default_on_progress, default_on_log) that created a tqdm progress bar when no custom callbacks were given. These were removed without a replacement. The non-TUI evaluation path (run_evaluations()) calls run_evaluation() without an on_event handler, so evaluations now run completely silently with no progress indication — a regression from the prior behavior.
| on_start: StartCallback | None = None, | ||
| on_progress: ProgressCallback | None = None, | ||
| on_log: LogCallback | None = None, | ||
| on_event: EventHandler | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GEPA adapter passes removed callback parameters to generate
High Severity
The GEPA adapter at verifiers/gepa/adapter.py still passes on_start=do_nothing and on_progress=do_nothing to env.generate(), but this commit removed those parameters from generate() (replaced with on_event). This will cause a TypeError at runtime, crashing any GEPA-based evaluation. The PR states "All internal usages have been migrated" but this caller was missed.


Description
This PR consolidates three separate callbacks (
on_start,on_progress,on_log) into a unified event-based system, addressing issue #755. The new system uses TypedDict events with Literal discriminators for type-safe pattern matching, making it cleaner, more extensible, and easier to maintain.Context
I'm new to this codebase and the broader Prime Intellect ecosystem, so I focused on:
Feedback welcome, especially on areas where I might have missed broader integration concerns!
Motivation
The previous callback system had grown brittle with three separate callback parameters that felt "tailor-made" for specific use cases. This PR:
on_eventhandlerBefore / After
Before:
After:
Type of Change
Note: This is a breaking change that removes
on_start,on_progress, andon_logparameters. All internal usages have been migrated to the new event system.Changes
New Event Types (
verifiers/types.py)num_examplesandrollouts_per_exampleall_outputsandnew_outputsStateobjects (addresses PR Feature: Add on_group_complete callback for streaming eval results. #632)Event Emission (
verifiers/envs/environment.py)generate()andevaluate()signatures to useon_eventparameter_emit_event()helper usingmaybe_await()for sync/async handlers_run_group_with_states()internal method to preserve State objects for GroupCompleteEventEvent Consumption (
verifiers/utils/eval_utils.py)run_evaluations_tui()to usematch/casepattern for event handlingSupporting Infrastructure
verifiers/utils/event_utils.py- LogStreamFileWriter for log tailingverifiers/utils/eval_display.py- Comments updated for new event systemTesting
Comprehensive test coverage demonstrates the system works correctly:
Unit Tests (
tests/test_event_system.py- 10 tests)E2E Scenarios (
tests/test_event_system_e2e.py- 4 scenarios)Standalone executable script with realistic integration scenarios:
uv run python tests/test_event_system_e2e.pyIntegration Testing
uv run pytestlocally (514 tests pass, 4 skipped external env tests)vf-evalcommand - progress bar and TUI work correctlyManual Testing
Ran actual evaluation with
vf-evalusing a test environment:✅ Progress bar updates correctly (requires ProgressEvent)
✅ Results display properly (requires CompleteEvent)
Checklist
Additional Notes
Design Decisions
Why TypedDict over dataclasses?
Why break backward compatibility?
State Preservation Strategy
_run_group_with_states()method to return both State objects and outputsrun_group()API remains unchanged (returns only outputs)Future Work
Related Issues
Note
Medium Risk
Breaking API change in core evaluation/generation paths; regressions could affect progress reporting, checkpointing, or grouped/server-mode execution despite added tests and guards.
Overview
Switches the evaluation/generation callback API from three separate hooks (
on_start,on_progress,on_log) to a singleon_eventhandler carrying typedEvalEventpayloads (start,progress,group_complete,save,log,log_stream,complete), and updatesEnvironment.generate()/evaluate()to emit these events at key lifecycle points.Adds grouped-scoring state preservation via
_run_group_with_states()to supportGroupCompleteEvent(while preserving server-mode routing throughrun_group()), emits intermediateSaveEvents during incremental checkpointing, and avoids unnecessary event construction when no handler is registered (plus copies list fields to prevent post-emission mutation).Migrates the CLI/TUI progress display to consume events, introduces
LogStreamFileWriterutility forlog_streamevents, and updates docs (docs/reference.md,docs/evaluation.md) and adds focused unit/e2e tests covering event structure, ordering, server-mode behavior, save events, and immutability.Written by Cursor Bugbot for commit cc8f0cb. This will update automatically on new commits. Configure here.