Skip to content

fix(hf): pass return_dict=True to apply_chat_template#1253

Merged
ajbozarth merged 1 commit into
generative-computing:mainfrom
ajbozarth:fix/hf-input-ids-tensor-indexing
Jun 11, 2026
Merged

fix(hf): pass return_dict=True to apply_chat_template#1253
ajbozarth merged 1 commit into
generative-computing:mainfrom
ajbozarth:fix/hf-input-ids-tensor-indexing

Conversation

@ajbozarth

Copy link
Copy Markdown
Contributor

Issue

Fixes #

Description

Fixes a latent bug introduced in #418 (transformers 5 bump) where LocalHFBackend.generate_from_context indexes a raw torch.Tensor with a string key.

apply_chat_template(..., return_tensors="pt") returns a 2-D torch.Tensor, not a dict. The standard-generation call site at huggingface.py:1119-1120 then does:

inputs=input_ids["input_ids"],
attention_mask=input_ids["attention_mask"],

Per the warning at pytorch/csrc/autograd/python_variable_indexing.cpp:353, in torch ≥ 2.9 this expression is interpreted as tensor[torch.tensor(seq)] — fancy indexing with the codepoints of "input_ids" against a 2-D tensor — which raises:

IndexError: too many indices for tensor of dimension 2
mellea/backends/huggingface.py:1119: IndexError

Reproduces locally on macOS Apple Silicon (torch 2.11.0, both CPU and MPS).

The fix is one line: pass return_dict=True to apply_chat_template so the dict access (and the .to(self._device) chain) operate on a BatchEncoding.

Why return_dict=True and not "pass the bare tensor"

Two producers feed the local input_ids variable into the same downstream processing() and post_processing() consumers:

  1. The merged-cache helper (_compute_input_ids_with_kv_cache_smashing, line 831) — returns a bare torch.Tensor.
  2. The standard-generation site (line 1054) — apply_chat_template(...) — should return a BatchEncoding.

The consumers already handle both shapes via isinstance(input_ids, torch.Tensor) ternaries at lines 1200, 1273, 1305 (Tensor branch for the merged-cache producer, else input_ids["input_ids"] for the dict producer). The dict branch is only reachable if the standard-generation producer returns a dict, so adding return_dict=True matches what consumers were coded for. The alternative would also require deleting all three dict branches and synthesizing an attention_mask.

Why this slipped through

  • PR CI doesn't catch it. test_telemetry/test_metrics_backend.py::test_huggingface_token_metrics_integration is gated by @require_gpu(min_vram_gb=8) (test/predicates.py:91). GitHub Actions runners have neither CUDA nor MPS, so the test is skipped on every PR run.

  • GPU nightly runs do not fail on it. Confirmed against the most recent main-branch nightly log (Linux + CUDA, same torch==2.11.0 per uv.lock): both parametrizations of the test ran end-to-end (the transformers max_length warning fires from inside model.generate(), past the broken site) and neither appears in the failure list. The pytorch deprecation warning that fires every run locally is completely absent from the cluster log.

    Same mellea code, same indexing expression, same torch version string, same Python, same filterwarnings config — but different platform wheels (uv.lock ships ~530MB Linux x86_64 with CUDA vs. ~80MB macOS ARM64; distinct compiled binaries). Canonical pytorch source for treatSequenceAsTuple is platform-agnostic and should warn on every platform; the cluster wheel evidently diverges from that. Mechanism uninvestigated — not blocking.

What didn't change

  • Public API, signatures, behavior: unchanged.
  • Merged-cache path: still returns a raw Tensor; the three downstream ternaries continue to handle both producers.
  • Other apply_chat_template call sites: only line 1054 was buggy. The KV-cache call at line 774 uses tokenize=False (returns a string) and the alora batch path at 1511-1550 already operates on a dict-shaped result.

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code was added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Locally on Apple Silicon (torch 2.11.0, transformers 5.x, MPS):

$ uv run pytest test/telemetry/test_metrics_backend.py::test_huggingface_token_metrics_integration -v
test_huggingface_token_metrics_integration[non-streaming] PASSED
test_huggingface_token_metrics_integration[streaming]     PASSED

The python_variable_indexing.cpp:353 warning is gone. No new test added — test_huggingface_token_metrics_integration already exercises this path; it's just GPU-gated. Will validate on the GPU path post-merge.

Attribution

  • AI coding assistants used

Adding a new component, requirement, sampling strategy, or tool?

If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.

  • Component
  • Requirement
  • Sampling Strategy
  • Tool

NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.

apply_chat_template(return_tensors="pt") returns a 2-D torch.Tensor,
not a dict. The standard-generation call site at huggingface.py:1119-
1120 then indexes it with string keys ("input_ids", "attention_mask"),
which on torch >= 2.9 is interpreted as fancy indexing with the
codepoints of the string and raises IndexError on macOS (CPU and MPS).

Adding return_dict=True makes apply_chat_template return a
BatchEncoding, so dict access works as the surrounding code already
expects. The downstream isinstance(input_ids, torch.Tensor) ternaries
in processing()/post_processing() were already coded to handle both
the dict producer and the bare-Tensor producer from the merged-cache
path, so they continue to work for both shapes.

Latent since generative-computing#418 (transformers 5 bump). Not caught by PR CI because
the integration test is gated by @require_gpu, and not caught by the
GPU nightly because the Linux CUDA wheel of torch 2.11.0 doesn't
trigger the deprecated indexing path that fails locally on macOS.

Assisted-by: Claude Code
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth ajbozarth requested a review from a team as a code owner June 10, 2026 21:55
@github-actions github-actions Bot added the bug Something isn't working label Jun 10, 2026
@ajbozarth ajbozarth self-assigned this Jun 10, 2026
@ajbozarth ajbozarth requested a review from nrfulton June 10, 2026 21:55
@ajbozarth ajbozarth requested a review from avinash2692 June 11, 2026 15:07

@avinash2692 avinash2692 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajbozarth : LGTM and thanks for catching this, seems like I missed this in my migration. I'll also try and add a test in a followup PR to catch this leak in the future.

@ajbozarth ajbozarth added this pull request to the merge queue Jun 11, 2026
Merged via the queue into generative-computing:main with commit b9b5315 Jun 11, 2026
10 checks passed
@ajbozarth ajbozarth deleted the fix/hf-input-ids-tensor-indexing branch June 11, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants