feat(vllm_model): chunk long single-clip audio in chat_completions#1358
Open
gwarmstrong wants to merge 1 commit into
Open
feat(vllm_model): chunk long single-clip audio in chat_completions#1358gwarmstrong wants to merge 1 commit into
gwarmstrong wants to merge 1 commit into
Conversation
Audio rows whose duration exceeds the model's input limit (e.g. Qwen2-Audio at 30s) currently fail outright in the vllm_model server — the audio is inlined whole on metadata.audio_data / audio_path and the request bounces. Mirror NeMo Skills' VLLMMultimodalModel: split long audio into fixed-duration windows, run each window as its own chat completion, space-join the textual outputs and sum the usage tokens. Multi-clip audio_paths is intentionally left alone — chunking semantics across clips are ambiguous. Opt-in via three new VLLMModelConfig fields (defaulted on, threshold 30s). return_token_id_information=True with a long-audio row raises rather than silently producing a malformed training row, since concatenating per-chunk token streams loses the alignment trainers rely on. Pure-function chunker (numpy) is byte-identical with Skills' chunk_audio across the threshold-boundary and tail-merge edges; aggregation matches Skills' _generate_with_chunking text output on clean, whitespace-padded, and whitespace-only-middle cases. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(vllm_model): chunk long single-clip audio in chat_completions
Audio rows whose duration exceeds the model's input limit (e.g. Qwen2-Audio at 30s) currently fail outright in
vllm_model— the audio is inlined whole onmetadata.audio_data/metadata.audio_pathand the request bounces. This adds an opt-in chunking pipeline that splits long single-clip audio into fixed-duration windows, runs each window as its own chat completion through the existing preprocess → vLLM → reasoning-parser path, then space-joins the textual outputs and sums the usage tokens. Mirrors the chunking already shipping in NeMo Skills'VLLMMultimodalModel, so audio benchmarks ported from Skills retain their long-audio behaviour on Gym.Config
Three new fields on
VLLMModelConfig:enable_audio_chunking: bool = True— master switch. Chunking only fires if it's also possible to chunk the inbound audio (see below); short audio is a no-op regardless.chunk_audio_threshold_sec: float = 30.0— both the trigger threshold (audio under this duration is passed through whole) AND the per-chunk window size.min_audio_chunk_duration_sec: float = 0.5— the trailing chunk is merged into the previous chunk if it falls below this duration. Avoids emitting near-empty audio that some audio models reject.When chunking does fire and
return_token_id_information: trueis also set, the request raises rather than producing a malformed training row — per-chunk prompts have different token streams that can't be meaningfully aggregated. Disable chunking or constrain audio length if you need exact token IDs.Behaviour
Chunking only owns the single-clip audio sources:
metadata.audio_data— decoded from thedata:audio/<fmt>;base64,...URI.metadata.audio_path— resolved through the existingaudio_rootrules.metadata.audio_paths(multi-clip) is left alone — chunking semantics across multiple clips are ambiguous and no current benchmark exercises long multi-clip audio. Existing splice behaviour foraudio_pathsis unchanged.Each chunk runs through
_run_single_chat_completion(the existingchat_completionsbody, extracted as a helper) with itsmetadata.audio_datareplaced by a freshdata:audio/wav;base64,...URI for that window. The aggregator then:choices[0].message.contentafter.strip()(matches Skills).usage.prompt_tokens/completion_tokens/total_tokensacross chunks.finish_reason="length"if any chunk truncated — downstream code (verifier, training data filter) should treat aggregated output as incomplete when any window hit a hard limit.Parity with NeMo Skills
The Skills implementation of audio chunking lives in
nemo_skills/inference/model/{audio_utils.py, vllm_multimodal.py}and is the reference for this port. Two layers were verified before this PR:chunk_audio_arrayand Skills'chunk_audioproduce identical output across 40 cases (9 durations × 2 thresholds × 2 minimums, plus 4 explicit tail-merge edge cases including the boundary at exactlymin_chunk_duration_sec).VLLMMultimodalModel._generate_with_chunkingand Gym'sVLLMModel._run_chunked_chat_completionproduce identical aggregated text on the same 5-second WAV + same per-chunk LLM outputs across 3 cases (clean text, whitespace-padded chunks, whitespace-only middle chunk — verifies both sides.strip()per chunk and that the whitespace-only artifact is preserved identically).