Skip to content

Conversation

@jverre
Copy link
Collaborator

@jverre jverre commented Dec 5, 2025

Details

This PR adds a comprehensive Opik integration for Harbor, a benchmark evaluation framework for autonomous LLM agents. The integration enables real-time observability for agent benchmark evaluations (SWE-bench, LiveCodeBench, Terminal-Bench, etc.).

Screenshot 2025-12-05 at 15 56 03

Key features:

Python SDK Integration (opik.integrations.harbor):

  • track_harbor(job) - Wraps Harbor Job instances with Opik tracking
  • enable_tracking() - Global tracking enablement for Harbor Trial and Verifier classes
  • Automatic patching of Harbor's Step class for real-time trajectory step tracking

CLI Integration (opik harbor):

  • New CLI command that wraps Harbor CLI with automatic Opik tracking
  • Supports all Harbor subcommands (run, jobs start, trials start, etc.)
  • Usage: opik harbor run -d terminal-bench@head -a terminus_2 -m gpt-4.1

Data Mapping:

  • Trial results → Opik traces with timing, metadata, and agent/model info
  • ATIF trajectory steps → Nested spans with tool calls, observations, token usage, and costs
  • Verifier rewards → Feedback scores (e.g., pass/fail, tests_passed)
  • Automatic experiment creation linking trials to datasets per benchmark source

Change checklist

  • User facing
  • Documentation update

Issues

  • Resolves #
  • OPIK-3358

Testing

Tested manually and created library integration tests

Documentation

Documentation was updated

Copilot AI review requested due to automatic review settings December 5, 2025 17:58
@jverre jverre requested review from a team as code owners December 5, 2025 17:58
@github-actions github-actions bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file python Pull requests that update Python code Python SDK tests Including test files, or tests related like configuration. labels Dec 5, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive integration between Opik and Harbor, a benchmark evaluation framework for autonomous LLM agents. The integration enables real-time observability for agent benchmark evaluations (SWE-bench, LiveCodeBench, Terminal-Bench, etc.) by tracking trials as traces, trajectory steps as spans, and verifier rewards as feedback scores.

Key changes:

  • Added Python SDK integration with track_harbor() and enable_tracking() functions
  • Added opik harbor CLI command that wraps Harbor CLI with automatic tracking
  • Added comprehensive documentation and examples

Reviewed changes

Copilot reviewed 17 out of 19 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
sdks/python/src/opik/integrations/harbor/opik_tracker.py Core integration logic - patches Harbor's Trial/Verifier classes and Step class for real-time tracking
sdks/python/src/opik/integrations/harbor/experiment_service.py Manages datasets and experiments for Harbor jobs, links trial traces to experiments
sdks/python/src/opik/cli/harbor.py CLI command that wraps Harbor CLI with automatic Opik tracking
sdks/python/tests/e2e_library_integration/harbor/test_harbor_e2e.py E2E tests for both SDK and CLI integration
sdks/python/examples/harbor_integration_example.py Usage example showing how to track Harbor jobs
apps/opik-documentation/documentation/fern/docs/tracing/integrations/harbor.mdx Complete integration documentation with API reference and examples
README.md and localized READMEs Added Harbor to integrations table

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

🌿 Preview your docs: https://opik-preview-46bb02b1-87bc-4e70-824c-9ff9228c1bf7.docs.buildwithfern.com/docs/opik

No broken links found

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

Images automagically compressed by Calibre's image-actions

Compression reduced images by 40.2%, saving 127.85 KB.

Filename Before After Improvement Visual comparison
apps/opik-documentation/documentation/fern/img/tracing/harbor_integration.png 317.85 KB 190.00 KB -40.2% View diff

332 images did not require optimisation.

Update required: Update image-actions configuration to the latest version before 1/1/21. See README for instructions.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

🌿 Preview your docs: https://opik-preview-dfc28fc9-d106-4976-9100-81bffa000e22.docs.buildwithfern.com/docs/opik

No broken links found

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

🌿 Preview your docs: https://opik-preview-a17f443f-0127-4a04-8df5-5f903b60ef95.docs.buildwithfern.com/docs/opik

The following broken links where found:

Page:
❌ Broken link: {} ()

Page:
❌ Broken link: ] ()

@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2025

🌿 Preview your docs: https://opik-preview-25135266-e318-46e2-af7d-2256e68043d6.docs.buildwithfern.com/docs/opik

No broken links found

@Lothiraldan
Copy link
Contributor

It would be useful to also save the following information from the dataset:

  • The version of the dataset used. Not sure where to save it though
  • Additional task data like the git_url, git_commit, path could be useful to catch if a task definition has changed between two runs?

Lothiraldan
Lothiraldan previously approved these changes Dec 8, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

SDK Unit Tests Results

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit e3419e0.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

🌿 Preview your docs: https://opik-preview-f794d068-78b5-49b2-a52b-a1d643c50ff8.docs.buildwithfern.com/docs/opik

No broken links found


📌 Results for commit 85fc76e

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

SDK E2E Tests Results

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit e3419e0.

♻️ This comment has been updated with latest results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation Python SDK python Pull requests that update Python code tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants