-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What component(s) are affected?
- Opik Python SDK
- Opik Typescript SDK
- Opik Agent Optimizer SDK
- Opik UI
- Opik Server
- Documentation
Opik version
- Opik version: 1.9.32
Describe the problem
Summary
When using GEval with LiteLLMChatModel and DashScope Qwen as the judge
model, some samples fail with MetricComputationError("Failed to calculate g-eval score").
This seems to happen when the logprobs-based scoring path is used.
Related: #4229
Reproduction steps and code snippets
Steps to reproduce
-
Create a
LiteLLMChatModel:judge_model = models.LiteLLMChatModel( model_name="dashscope/qwen-flash", api_base="https://dashscope.aliyuncs.com/compatible-mode/v1", api_key=os.getenv("DASHSCOPE_API_KEY"), )
-
Use it in a
GEvalmetric (for exampletranslation_quality) and run
an evaluation on multiple samples. -
Observe that some samples succeed, but one or more fail with:
Failed to parse model output: Failed to calculate g-eval score opik.exceptions.MetricComputationError: Failed to calculate g-eval score
Error logs or stack trace
Evaluation: 0%| | 0/4 [00:00<?, ?it/s]OPIK: Started logging traces to the "Default Project" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=019acb86-b6c9-7853-b089-3ac8b5318cd4&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.
Evaluation: 25%|██▌ | 1/4 [00:08<00:24, 8.02s/it]
Evaluation: 50%|█████ | 2/4 [00:09<00:07, 3.99s/it]
Evaluation: 75%|███████▌ | 3/4 [00:10<00:02, 2.79s/it]OPIK: Failed to parse model output: Failed to calculate g-eval score
Traceback (most recent call last):
File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\parser.py", line 96, in parse_litellm_model_output
raise exceptions.MetricComputationError(GEVAL_SCORE_CALC_FAILED)
opik.exceptions.MetricComputationError: Failed to calculate g-eval score
OPIK: Failed to call LLM provider, reason: Failed to calculate g-eval score
OPIK: Failed to compute metric translation_quality. Score result will be marked as failed.
Traceback (most recent call last):
File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\parser.py", line 96, in parse_litellm_model_output
raise exceptions.MetricComputationError(GEVAL_SCORE_CALC_FAILED)
opik.exceptions.MetricComputationError: Failed to calculate g-eval score
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "(my local project path)\opik\sdks\python\src\opik\evaluation\models\base_model.py", line 132, in get_provider_response
yield model_provider.generate_provider_response(messages, **kwargs)
File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\metric.py", line 231, in score
return parser.parse_litellm_model_output(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\parser.py", line 109, in parse_litellm_model_output
raise exceptions.MetricComputationError(GEVAL_SCORE_CALC_FAILED) from exception
opik.exceptions.MetricComputationError: Failed to calculate g-eval score
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "(my local project path)\opik\sdks\python\src\opik\evaluation\engine\metrics_evaluator.py", line 83, in _compute_metric_scores
result = metric.score(**mapped_scoring_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "(my local project path)\opik\sdks\python\src\opik\decorator\base_track_decorator.py", line 355, in wrapper
raise func_exception
File "(my local project path)\opik\sdks\python\src\opik\decorator\base_track_decorator.py", line 328, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\metric.py", line 226, in score
with base_model.get_provider_response(
File "C:\Program Files\Python311\Lib\contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "(my local project path)\opik\sdks\python\src\opik\evaluation\models\base_model.py", line 135, in get_provider_response
raise exceptions.BaseLLMError(str(e))
opik.exceptions.BaseLLMError: LLM infrastructure error: Failed to calculate g-eval score
OPIK: No valid traces feedback scores to log from provided ones: []
Evaluation: 100%|██████████| 4/4 [00:10<00:00, 1.76s/it]
Evaluation: 100%|██████████| 4/4 [00:10<00:00, 2.68s/it]
┌─ Evaluation test dataset (4 samples) ────────┐
│ │
│ Total time: 00:00:10 │
│ Number of samples: 4 │
│ │
│ translation_quality: 0.2859 (avg) - 1 failed │
│ │
└──────────────────────────────────────────────┘
Uploading results to Opik ...
View the results in your Opik dashboard.
translation_quality ScoreStatistics(mean=0.28594319453714584, max=0.6433721530377606, min=0.10000388221784111, values=[0.10000388221784111, 0.11445354835583574, 0.6433721530377606], std=0.30962686171261133)
Notes
From reading parse_litellm_model_output in sdks/python/src/opik/evaluation/metrics/llm_judges/g_eval/parser.py, it's possible that the logprobs-aware
path assumes an OpenAI-style logprobs structure and a specific token position
for the numeric score, and Qwen responses sometimes maynot match this
shape, which might explain the intermittent failures.
As a local workaround I disabled the logprobs path for DashScope/Qwen, and the
evaluation then runs reliably.
The earlier top_logprobs clamping to 0–5 for DashScope Qwen was based on the
Qwen API reference: https://www.alibabacloud.com/help/en/model-studio/qwen-api-reference
Healthcheck results
No response