Skip to content

feat(vllm_model): chunk long single-clip audio in chat_completions#1358

Open
gwarmstrong wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
gwarmstrong:georgea/migrate-gym-audio-chunking
Open

feat(vllm_model): chunk long single-clip audio in chat_completions#1358
gwarmstrong wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
gwarmstrong:georgea/migrate-gym-audio-chunking

Conversation

@gwarmstrong
Copy link
Copy Markdown
Contributor

feat(vllm_model): chunk long single-clip audio in chat_completions

Audio rows whose duration exceeds the model's input limit (e.g. Qwen2-Audio at 30s) currently fail outright in vllm_model — the audio is inlined whole on metadata.audio_data / metadata.audio_path and the request bounces. This adds an opt-in chunking pipeline that splits long single-clip audio into fixed-duration windows, runs each window as its own chat completion through the existing preprocess → vLLM → reasoning-parser path, then space-joins the textual outputs and sums the usage tokens. Mirrors the chunking already shipping in NeMo Skills' VLLMMultimodalModel, so audio benchmarks ported from Skills retain their long-audio behaviour on Gym.

Config

Three new fields on VLLMModelConfig:

  • enable_audio_chunking: bool = True — master switch. Chunking only fires if it's also possible to chunk the inbound audio (see below); short audio is a no-op regardless.
  • chunk_audio_threshold_sec: float = 30.0 — both the trigger threshold (audio under this duration is passed through whole) AND the per-chunk window size.
  • min_audio_chunk_duration_sec: float = 0.5 — the trailing chunk is merged into the previous chunk if it falls below this duration. Avoids emitting near-empty audio that some audio models reject.

When chunking does fire and return_token_id_information: true is also set, the request raises rather than producing a malformed training row — per-chunk prompts have different token streams that can't be meaningfully aggregated. Disable chunking or constrain audio length if you need exact token IDs.

Behaviour

Chunking only owns the single-clip audio sources:

  • metadata.audio_data — decoded from the data:audio/<fmt>;base64,... URI.
  • metadata.audio_path — resolved through the existing audio_root rules.

metadata.audio_paths (multi-clip) is left alone — chunking semantics across multiple clips are ambiguous and no current benchmark exercises long multi-clip audio. Existing splice behaviour for audio_paths is unchanged.

Each chunk runs through _run_single_chat_completion (the existing chat_completions body, extracted as a helper) with its metadata.audio_data replaced by a fresh data:audio/wav;base64,... URI for that window. The aggregator then:

  • Space-joins per-chunk choices[0].message.content after .strip() (matches Skills).
  • Sums usage.prompt_tokens / completion_tokens / total_tokens across chunks.
  • Propagates finish_reason="length" if any chunk truncated — downstream code (verifier, training data filter) should treat aggregated output as incomplete when any window hit a hard limit.

Parity with NeMo Skills

The Skills implementation of audio chunking lives in nemo_skills/inference/model/{audio_utils.py, vllm_multimodal.py} and is the reference for this port. Two layers were verified before this PR:

  1. Chunk function parity. Gym's chunk_audio_array and Skills' chunk_audio produce identical output across 40 cases (9 durations × 2 thresholds × 2 minimums, plus 4 explicit tail-merge edge cases including the boundary at exactly min_chunk_duration_sec).
  2. Aggregation parity. Skills' VLLMMultimodalModel._generate_with_chunking and Gym's VLLMModel._run_chunked_chat_completion produce identical aggregated text on the same 5-second WAV + same per-chunk LLM outputs across 3 cases (clean text, whitespace-padded chunks, whitespace-only middle chunk — verifies both sides .strip() per chunk and that the whitespace-only artifact is preserved identically).

Audio rows whose duration exceeds the model's input limit (e.g. Qwen2-Audio
at 30s) currently fail outright in the vllm_model server — the audio is
inlined whole on metadata.audio_data / audio_path and the request bounces.
Mirror NeMo Skills' VLLMMultimodalModel: split long audio into fixed-duration
windows, run each window as its own chat completion, space-join the textual
outputs and sum the usage tokens. Multi-clip audio_paths is intentionally
left alone — chunking semantics across clips are ambiguous.

Opt-in via three new VLLMModelConfig fields (defaulted on, threshold 30s).
return_token_id_information=True with a long-audio row raises rather than
silently producing a malformed training row, since concatenating per-chunk
token streams loses the alignment trainers rely on.

Pure-function chunker (numpy) is byte-identical with Skills' chunk_audio
across the threshold-boundary and tail-merge edges; aggregation matches
Skills' _generate_with_chunking text output on clean, whitespace-padded,
and whitespace-only-middle cases.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
@gwarmstrong gwarmstrong requested a review from a team as a code owner May 18, 2026 21:04
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant