feat(vllm_model): chunk long single-clip audio in chat_completions by gwarmstrong · Pull Request #1358 · NVIDIA-NeMo/Gym

gwarmstrong · 2026-05-18T21:04:22Z

feat(vllm_model): chunk long single-clip audio in chat_completions

Audio rows whose duration exceeds the model's input limit (e.g. Qwen2-Audio at 30s) currently fail outright in vllm_model — the audio is inlined whole on metadata.audio_data / metadata.audio_path and the request bounces. This adds an opt-in chunking pipeline that splits long single-clip audio into fixed-duration windows, runs each window as its own chat completion through the existing preprocess → vLLM → reasoning-parser path, then space-joins the textual outputs and sums the usage tokens. Mirrors the chunking already shipping in NeMo Skills' VLLMMultimodalModel, so audio benchmarks ported from Skills retain their long-audio behaviour on Gym.

Config

Three new fields on VLLMModelConfig:

enable_audio_chunking: bool = True — master switch. Chunking only fires if it's also possible to chunk the inbound audio (see below); short audio is a no-op regardless.
chunk_audio_threshold_sec: float = 30.0 — both the trigger threshold (audio under this duration is passed through whole) AND the per-chunk window size.
min_audio_chunk_duration_sec: float = 0.5 — the trailing chunk is merged into the previous chunk if it falls below this duration. Avoids emitting near-empty audio that some audio models reject.

When chunking does fire and return_token_id_information: true is also set, the request raises rather than producing a malformed training row — per-chunk prompts have different token streams that can't be meaningfully aggregated. Disable chunking or constrain audio length if you need exact token IDs.

Behaviour

Chunking only owns the single-clip audio sources:

metadata.audio_data — decoded from the data:audio/<fmt>;base64,... URI.
metadata.audio_path — resolved through the existing audio_root rules.

metadata.audio_paths (multi-clip) is left alone — chunking semantics across multiple clips are ambiguous and no current benchmark exercises long multi-clip audio. Existing splice behaviour for audio_paths is unchanged.

Each chunk runs through _run_single_chat_completion (the existing chat_completions body, extracted as a helper) with its metadata.audio_data replaced by a fresh data:audio/wav;base64,... URI for that window. The aggregator then:

Space-joins per-chunk choices[0].message.content after .strip() (matches Skills).
Sums usage.prompt_tokens / completion_tokens / total_tokens across chunks.
Propagates finish_reason="length" if any chunk truncated — downstream code (verifier, training data filter) should treat aggregated output as incomplete when any window hit a hard limit.

Parity with NeMo Skills

The Skills implementation of audio chunking lives in nemo_skills/inference/model/{audio_utils.py, vllm_multimodal.py} and is the reference for this port. Two layers were verified before this PR:

Chunk function parity. Gym's chunk_audio_array and Skills' chunk_audio produce identical output across 40 cases (9 durations × 2 thresholds × 2 minimums, plus 4 explicit tail-merge edge cases including the boundary at exactly min_chunk_duration_sec).
Aggregation parity. Skills' VLLMMultimodalModel._generate_with_chunking and Gym's VLLMModel._run_chunked_chat_completion produce identical aggregated text on the same 5-second WAV + same per-chunk LLM outputs across 3 cases (clean text, whitespace-padded chunks, whitespace-only middle chunk — verifies both sides .strip() per chunk and that the whitespace-only artifact is preserved identically).

Audio rows whose duration exceeds the model's input limit (e.g. Qwen2-Audio at 30s) currently fail outright in the vllm_model server — the audio is inlined whole on metadata.audio_data / audio_path and the request bounces. Mirror NeMo Skills' VLLMMultimodalModel: split long audio into fixed-duration windows, run each window as its own chat completion, space-join the textual outputs and sum the usage tokens. Multi-clip audio_paths is intentionally left alone — chunking semantics across clips are ambiguous. Opt-in via three new VLLMModelConfig fields (defaulted on, threshold 30s). return_token_id_information=True with a long-audio row raises rather than silently producing a malformed training row, since concatenating per-chunk token streams loses the alignment trainers rely on. Pure-function chunker (numpy) is byte-identical with Skills' chunk_audio across the threshold-boundary and tail-merge edges; aggregation matches Skills' _generate_with_chunking text output on clean, whitespace-padded, and whitespace-only-middle cases. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

copy-pr-bot · 2026-05-18T21:04:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

gwarmstrong requested a review from a team as a code owner May 18, 2026 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vllm_model): chunk long single-clip audio in chat_completions#1358

feat(vllm_model): chunk long single-clip audio in chat_completions#1358
gwarmstrong wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
gwarmstrong:georgea/migrate-gym-audio-chunking

gwarmstrong commented May 18, 2026

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gwarmstrong commented May 18, 2026

feat(vllm_model): chunk long single-clip audio in chat_completions

Config

Behaviour

Parity with NeMo Skills

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant