[Bug]: GEval LiteLLMChatModel with DashScope Qwen sometimes fails to calculate g-eval score

### What component(s) are affected?

- [x] Opik Python SDK
- [ ] Opik Typescript SDK
- [ ] Opik Agent Optimizer SDK
- [ ] Opik UI
- [ ] Opik Server
- [ ] Documentation

### Opik version

- Opik version: 1.9.32

### Describe the problem

### Summary

When using `GEval` with `LiteLLMChatModel` and DashScope Qwen as the judge
model, some samples fail with `MetricComputationError("Failed to calculate g-eval score")`.
This seems to happen when the logprobs-based scoring path is used.

Related: #4229 


### Reproduction steps and code snippets

### Steps to reproduce

1. Create a `LiteLLMChatModel`:

   ```python
   judge_model = models.LiteLLMChatModel(
       model_name="dashscope/qwen-flash",
       api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",
       api_key=os.getenv("DASHSCOPE_API_KEY"),
   )
   ```

2. Use it in a `GEval` metric (for example `translation_quality`) and run
   an evaluation on multiple samples.

3. Observe that some samples succeed, but one or more fail with:

   ```text
   Failed to parse model output: Failed to calculate g-eval score
   opik.exceptions.MetricComputationError: Failed to calculate g-eval score
   ```

### Error logs or stack trace

```
Evaluation:   0%|          | 0/4 [00:00<?, ?it/s]OPIK: Started logging traces to the "Default Project" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=019acb86-b6c9-7853-b089-3ac8b5318cd4&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.

Evaluation:  25%|██▌       | 1/4 [00:08<00:24,  8.02s/it]
Evaluation:  50%|█████     | 2/4 [00:09<00:07,  3.99s/it]
Evaluation:  75%|███████▌  | 3/4 [00:10<00:02,  2.79s/it]OPIK: Failed to parse model output: Failed to calculate g-eval score
Traceback (most recent call last):
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\parser.py", line 96, in parse_litellm_model_output
    raise exceptions.MetricComputationError(GEVAL_SCORE_CALC_FAILED)
opik.exceptions.MetricComputationError: Failed to calculate g-eval score
OPIK: Failed to call LLM provider, reason: Failed to calculate g-eval score
OPIK: Failed to compute metric translation_quality. Score result will be marked as failed.
Traceback (most recent call last):
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\parser.py", line 96, in parse_litellm_model_output
    raise exceptions.MetricComputationError(GEVAL_SCORE_CALC_FAILED)
opik.exceptions.MetricComputationError: Failed to calculate g-eval score

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\models\base_model.py", line 132, in get_provider_response
    yield model_provider.generate_provider_response(messages, **kwargs)
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\metric.py", line 231, in score
    return parser.parse_litellm_model_output(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\parser.py", line 109, in parse_litellm_model_output
    raise exceptions.MetricComputationError(GEVAL_SCORE_CALC_FAILED) from exception
opik.exceptions.MetricComputationError: Failed to calculate g-eval score

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\engine\metrics_evaluator.py", line 83, in _compute_metric_scores
    result = metric.score(**mapped_scoring_inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "(my local project path)\opik\sdks\python\src\opik\decorator\base_track_decorator.py", line 355, in wrapper
    raise func_exception
  File "(my local project path)\opik\sdks\python\src\opik\decorator\base_track_decorator.py", line 328, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\metrics\llm_judges\g_eval\metric.py", line 226, in score
    with base_model.get_provider_response(
  File "C:\Program Files\Python311\Lib\contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "(my local project path)\opik\sdks\python\src\opik\evaluation\models\base_model.py", line 135, in get_provider_response
    raise exceptions.BaseLLMError(str(e))
opik.exceptions.BaseLLMError: LLM infrastructure error: Failed to calculate g-eval score
OPIK: No valid traces feedback scores to log from provided ones: []

Evaluation: 100%|██████████| 4/4 [00:10<00:00,  1.76s/it]
Evaluation: 100%|██████████| 4/4 [00:10<00:00,  2.68s/it]
┌─ Evaluation test dataset (4 samples) ────────┐
│                                              │
│ Total time:        00:00:10                  │
│ Number of samples: 4                         │
│                                              │
│ translation_quality: 0.2859 (avg) - 1 failed │
│                                              │
└──────────────────────────────────────────────┘
Uploading results to Opik ... 
View the results in your Opik dashboard.
translation_quality ScoreStatistics(mean=0.28594319453714584, max=0.6433721530377606, min=0.10000388221784111, values=[0.10000388221784111, 0.11445354835583574, 0.6433721530377606], std=0.30962686171261133)
```

### Notes

From reading `parse_litellm_model_output` in [sdks/python/src/opik/evaluation/metrics/llm_judges/g_eval/parser.py](https://github.com/comet-ml/opik/blob/main/sdks/python/src/opik/evaluation/metrics/llm_judges/g_eval/parser.py), it's possible that the logprobs-aware
path assumes an OpenAI-style `logprobs` structure and a specific token position
for the numeric score, and Qwen responses sometimes maynot match this
shape, which might explain the intermittent failures.

As a local workaround I disabled the logprobs path for DashScope/Qwen, and the
evaluation then runs reliably.

The earlier `top_logprobs` clamping to 0–5 for DashScope Qwen was based on the
Qwen API reference: https://www.alibabacloud.com/help/en/model-studio/qwen-api-reference

### Healthcheck results

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: GEval LiteLLMChatModel with DashScope Qwen sometimes fails to calculate g-eval score #4270

What component(s) are affected?

Opik version

Describe the problem

Summary

Reproduction steps and code snippets

Steps to reproduce

Error logs or stack trace

Notes

Healthcheck results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: GEval LiteLLMChatModel with DashScope Qwen sometimes fails to calculate g-eval score #4270

Description

What component(s) are affected?

Opik version

Describe the problem

Summary

Reproduction steps and code snippets

Steps to reproduce

Error logs or stack trace

Notes

Healthcheck results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions