fix(hf): pass return_dict=True to apply_chat_template by ajbozarth · Pull Request #1253 · generative-computing/mellea

ajbozarth · 2026-06-10T21:55:09Z

Issue

Fixes #

Description

Fixes a latent bug introduced in #418 (transformers 5 bump) where LocalHFBackend.generate_from_context indexes a raw torch.Tensor with a string key.

apply_chat_template(..., return_tensors="pt") returns a 2-D torch.Tensor, not a dict. The standard-generation call site at huggingface.py:1119-1120 then does:

inputs=input_ids["input_ids"],
attention_mask=input_ids["attention_mask"],

Per the warning at pytorch/csrc/autograd/python_variable_indexing.cpp:353, in torch ≥ 2.9 this expression is interpreted as tensor[torch.tensor(seq)] — fancy indexing with the codepoints of "input_ids" against a 2-D tensor — which raises:

IndexError: too many indices for tensor of dimension 2
mellea/backends/huggingface.py:1119: IndexError

Reproduces locally on macOS Apple Silicon (torch 2.11.0, both CPU and MPS).

The fix is one line: pass return_dict=True to apply_chat_template so the dict access (and the .to(self._device) chain) operate on a BatchEncoding.

Why `return_dict=True` and not "pass the bare tensor"

Two producers feed the local input_ids variable into the same downstream processing() and post_processing() consumers:

The merged-cache helper (_compute_input_ids_with_kv_cache_smashing, line 831) — returns a bare torch.Tensor.
The standard-generation site (line 1054) — apply_chat_template(...) — should return a BatchEncoding.

The consumers already handle both shapes via isinstance(input_ids, torch.Tensor) ternaries at lines 1200, 1273, 1305 (Tensor branch for the merged-cache producer, else input_ids["input_ids"] for the dict producer). The dict branch is only reachable if the standard-generation producer returns a dict, so adding return_dict=True matches what consumers were coded for. The alternative would also require deleting all three dict branches and synthesizing an attention_mask.

Why this slipped through

PR CI doesn't catch it. test_telemetry/test_metrics_backend.py::test_huggingface_token_metrics_integration is gated by @require_gpu(min_vram_gb=8) (test/predicates.py:91). GitHub Actions runners have neither CUDA nor MPS, so the test is skipped on every PR run.
GPU nightly runs do not fail on it. Confirmed against the most recent main-branch nightly log (Linux + CUDA, same torch==2.11.0 per uv.lock): both parametrizations of the test ran end-to-end (the transformers max_length warning fires from inside model.generate(), past the broken site) and neither appears in the failure list. The pytorch deprecation warning that fires every run locally is completely absent from the cluster log.

Same mellea code, same indexing expression, same torch version string, same Python, same filterwarnings config — but different platform wheels (uv.lock ships ~530MB Linux x86_64 with CUDA vs. ~80MB macOS ARM64; distinct compiled binaries). Canonical pytorch source for treatSequenceAsTuple is platform-agnostic and should warn on every platform; the cluster wheel evidently diverges from that. Mechanism uninvestigated — not blocking.

What didn't change

Public API, signatures, behavior: unchanged.
Merged-cache path: still returns a raw Tensor; the three downstream ternaries continue to handle both producers.
Other apply_chat_template call sites: only line 1054 was buggy. The KV-cache call at line 774 uses tokenize=False (returns a string) and the alora batch path at 1511-1550 already operates on a dict-shaped result.

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code was added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Locally on Apple Silicon (torch 2.11.0, transformers 5.x, MPS):

$ uv run pytest test/telemetry/test_metrics_backend.py::test_huggingface_token_metrics_integration -v
test_huggingface_token_metrics_integration[non-streaming] PASSED
test_huggingface_token_metrics_integration[streaming]     PASSED

The python_variable_indexing.cpp:353 warning is gone. No new test added — test_huggingface_token_metrics_integration already exercises this path; it's just GPU-gated. Will validate on the GPU path post-merge.

Attribution

AI coding assistants used

Adding a new component, requirement, sampling strategy, or tool?

If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.

Component
Requirement
Sampling Strategy
Tool

NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.

apply_chat_template(return_tensors="pt") returns a 2-D torch.Tensor, not a dict. The standard-generation call site at huggingface.py:1119- 1120 then indexes it with string keys ("input_ids", "attention_mask"), which on torch >= 2.9 is interpreted as fancy indexing with the codepoints of the string and raises IndexError on macOS (CPU and MPS). Adding return_dict=True makes apply_chat_template return a BatchEncoding, so dict access works as the surrounding code already expects. The downstream isinstance(input_ids, torch.Tensor) ternaries in processing()/post_processing() were already coded to handle both the dict producer and the bare-Tensor producer from the merged-cache path, so they continue to work for both shapes. Latent since generative-computing#418 (transformers 5 bump). Not caught by PR CI because the integration test is gated by @require_gpu, and not caught by the GPU nightly because the Linux CUDA wheel of torch 2.11.0 doesn't trigger the deprecated indexing path that fails locally on macOS. Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

avinash2692

@ajbozarth : LGTM and thanks for catching this, seems like I missed this in my migration. I'll also try and add a test in a followup PR to catch this leak in the future.

ajbozarth requested a review from a team as a code owner June 10, 2026 21:55

ajbozarth requested review from AngeloDanducci, markstur and planetf1 June 10, 2026 21:55

github-actions Bot added the bug Something isn't working label Jun 10, 2026

ajbozarth self-assigned this Jun 10, 2026

ajbozarth requested a review from nrfulton June 10, 2026 21:55

ajbozarth mentioned this pull request Jun 10, 2026

refactor(telemetry): align metrics/pricing init with tracing pattern #1252

Open

8 tasks

ajbozarth requested a review from avinash2692 June 11, 2026 15:07

avinash2692 approved these changes Jun 11, 2026

View reviewed changes

ajbozarth added this pull request to the merge queue Jun 11, 2026

Merged via the queue into generative-computing:main with commit b9b5315 Jun 11, 2026
10 checks passed

ajbozarth deleted the fix/hf-input-ids-tensor-indexing branch June 11, 2026 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hf): pass return_dict=True to apply_chat_template#1253

fix(hf): pass return_dict=True to apply_chat_template#1253
ajbozarth merged 1 commit into
generative-computing:mainfrom
ajbozarth:fix/hf-input-ids-tensor-indexing

ajbozarth commented Jun 10, 2026

Uh oh!

avinash2692 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajbozarth commented Jun 10, 2026

Issue

Description

Why return_dict=True and not "pass the bare tensor"

Why this slipped through

What didn't change

Testing

Attribution

Adding a new component, requirement, sampling strategy, or tool?

Uh oh!

avinash2692 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why `return_dict=True` and not "pass the bare tensor"