Skip to content

[Bug]: GEval LiteLLMChatModel with DashScope Qwen sometimes fails to calculate g-eval score #4270

@Susan9001

Description

@Susan9001

What component(s) are affected?

  • Opik Python SDK
  • Opik Typescript SDK
  • Opik Agent Optimizer SDK
  • Opik UI
  • Opik Server
  • Documentation

Opik version

  • Opik version: 1.9.32

Describe the problem

Summary

When using GEval with LiteLLMChatModel and DashScope Qwen as the judge
model, some samples fail with MetricComputationError("Failed to calculate g-eval score").
This seems to happen when the logprobs-based scoring path is used.

Related: #4229

Reproduction steps and code snippets

Steps to reproduce

  1. Create a LiteLLMChatModel:

    judge_model = models.LiteLLMChatModel(
        model_name="dashscope/qwen-flash",
        api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",
        api_key=os.getenv("DASHSCOPE_API_KEY"),
    )
  2. Use it in a GEval metric (for example translation_quality) and run
    an evaluation on multiple samples.

  3. Observe that some samples succeed, but one or more fail with:

    Failed to parse model output: Failed to calculate g-eval score
    opik.exceptions.MetricComputationError: Failed to calculate g-eval score
    

Error logs or stack trace

Evaluation:   0%|          | 0/4 [00:00<?, ?it/s]OPIK: Started logging traces to the "Default Project" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=019acb86-b6c9-7853-b089-3ac8b5318cd4&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.

Evaluation:  25%|██▌       | 1/4 [00:08<00:24,  8.02s/it]
Evaluation:  50%|█████     | 2/4 [00:09<00:07,  3.99s/it]
Evaluation:  75%|███████▌  | 3/4 [00:10<00:02,  2.79s/it]OPIK: Failed to parse model output: Failed to calculate g-eval score
Traceback (most recent call last):
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\parser.py", line 96, in parse_litellm_model_output
    raise exceptions.MetricComputationError(GEVAL_SCORE_CALC_FAILED)
opik.exceptions.MetricComputationError: Failed to calculate g-eval score
OPIK: Failed to call LLM provider, reason: Failed to calculate g-eval score
OPIK: Failed to compute metric translation_quality. Score result will be marked as failed.
Traceback (most recent call last):
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\parser.py", line 96, in parse_litellm_model_output
    raise exceptions.MetricComputationError(GEVAL_SCORE_CALC_FAILED)
opik.exceptions.MetricComputationError: Failed to calculate g-eval score

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\models\base_model.py", line 132, in get_provider_response
    yield model_provider.generate_provider_response(messages, **kwargs)
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\metric.py", line 231, in score
    return parser.parse_litellm_model_output(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\parser.py", line 109, in parse_litellm_model_output
    raise exceptions.MetricComputationError(GEVAL_SCORE_CALC_FAILED) from exception
opik.exceptions.MetricComputationError: Failed to calculate g-eval score

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\engine\metrics_evaluator.py", line 83, in _compute_metric_scores
    result = metric.score(**mapped_scoring_inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "(my local project path)\opik\sdks\python\src\opik\decorator\base_track_decorator.py", line 355, in wrapper
    raise func_exception
  File "(my local project path)\opik\sdks\python\src\opik\decorator\base_track_decorator.py", line 328, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\metric.py", line 226, in score
    with base_model.get_provider_response(
  File "C:\Program Files\Python311\Lib\contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\models\base_model.py", line 135, in get_provider_response
    raise exceptions.BaseLLMError(str(e))
opik.exceptions.BaseLLMError: LLM infrastructure error: Failed to calculate g-eval score
OPIK: No valid traces feedback scores to log from provided ones: []

Evaluation: 100%|██████████| 4/4 [00:10<00:00,  1.76s/it]
Evaluation: 100%|██████████| 4/4 [00:10<00:00,  2.68s/it]
┌─ Evaluation test dataset (4 samples) ────────┐
│                                              │
│ Total time:        00:00:10                  │
│ Number of samples: 4                         │
│                                              │
│ translation_quality: 0.2859 (avg) - 1 failed │
│                                              │
└──────────────────────────────────────────────┘
Uploading results to Opik ... 
View the results in your Opik dashboard.
translation_quality ScoreStatistics(mean=0.28594319453714584, max=0.6433721530377606, min=0.10000388221784111, values=[0.10000388221784111, 0.11445354835583574, 0.6433721530377606], std=0.30962686171261133)

Notes

From reading parse_litellm_model_output in sdks/python/src/opik/evaluation/metrics/llm_judges/g_eval/parser.py, it's possible that the logprobs-aware
path assumes an OpenAI-style logprobs structure and a specific token position
for the numeric score, and Qwen responses sometimes maynot match this
shape, which might explain the intermittent failures.

As a local workaround I disabled the logprobs path for DashScope/Qwen, and the
evaluation then runs reliably.

The earlier top_logprobs clamping to 0–5 for DashScope Qwen was based on the
Qwen API reference: https://www.alibabacloud.com/help/en/model-studio/qwen-api-reference

Healthcheck results

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions