Audio refactor by jpc · Pull Request #55 · HumeAI/wsds

jpc · 2026-04-01T12:11:18Z

Focus:

consistent handling of torchaudio and humecodec (torchcodec does not seem to bring anything here)
accurate fast seeking in AAC and MP3 (99,99% of our data) and correct slow fallback for everything else
tests for both of the above

* WSSink: support forcing a fixed schema * ws_decode: added encode_sample, handle extensions more uniformly * WSSink: truncate long inputs when printing schema conflicts

SUMMARY: pyarrow's RecordBatchFileWriter.close() doesn't release the fd — only GC does. Force gc.collect() after close and add explicit close() methods to WSShard/WSDataset so file handles are released between fused pipeline stages. Fused stages run 5-7 stages back-to-back in a single Modal container. Each stage opens pyarrow memory-mapped files for reading shards and IPC writers for output. Before this fix, none of these file handles were explicitly released: - RecordBatchFileWriter.close() flushes metadata but keeps the fd open until the Python object is garbage-collected. On container reuse, volume.reload() fails with "there are open files preventing the operation" because the previous invocation's writer fds are still held. - WSDataset caches WSShard objects in _open_shards, each holding a memory-mapped file via pa.memory_map(). No close() method existed, so these accumulated across fused stages. - _build_key_iter created a temporary WSDataset (with open shard handles) via a lazy generator that was never closed, leaking handles until GC. Changes: - ws_sink.py: gc.collect() after RecordBatchFileWriter.close() to force fd release; close writer on error path too; _build_key_iter eagerly collects keys and closes the temp dataset; catch KeyError for race condition where sibling column dir exists but has no committed shards yet - ws_shard.py: Add close() method; keep reference to _source_file so we can explicitly close the pyarrow NativeFile (memory_map/OSFile) - ws_dataset.py: Add close() that closes all cached shards and linked datasets Co-authored-by: Theo <theo@hume.ai>

…ndow width

…kends

- Use start_skip_samples from demuxer for seek adjustment (mp3 encoder delay, MP4 edit lists). The demuxer applies this skip at pts=0 but not after seeking. - Read codec_delay from decoder's output stream info for codecs where the delay is set by the codec init (wmav2, opus). - Switch read loop to PTS-based termination instead of sample counting, fixing truncation when seek lands far from target. - Detect when seek lands at stream start (chunk0.pts < margin) and use tstart=0 timeline to match ffmpeg CLI behavior. - Handle negative first-chunk PTS (some AAC files) by clamping to 0. Test results: 100/100 full load, 93/100 seek (±5 sample tolerance). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- For short seeks (tstart < 5s) and unreliable codecs (wmav2/wmapro), always read from the start instead of seeking. This avoids the seek_landed_at_start heuristic entirely for short seeks and costs almost nothing for small files. - Only apply start_skip_samples seek_adj when actually seeking, not when reading from start (where the decoder handles it automatically). - Initialize seek_adj to 0.0 to avoid UnboundLocalError. Comprehensive test: 1494/1580 pass (94.6%), up from ~85% before. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Build a packet index (128KB resolution) on first seek and use seek_to_byte_offset for raw MPEG audio formats. This avoids the slow sequential scan that ffmpeg's mp3 demuxer does for timestamp seeks. The index PTS is used directly for trimming since the demuxer doesn't update PTS after byte seek. Also: no seek_adj for indexed seeks (index PTS is in raw timeline, not the skip_samples-adjusted timeline). Comprehensive test: 1556/1580 pass (98.5%) across 8 seek positions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jpc and others added 4 commits April 9, 2026 11:30

WSSink improvements (#54)

f9f17b3

* WSSink: support forcing a fixed schema * ws_decode: added encode_sample, handle extensions more uniformly * WSSink: truncate long inputs when printing schema conflicts

ws_sink: validate __key__ alignment when writing to existing dataset

39af76e

Returning None for null pyarrow scalar values and a bug fix (#61)

3f6966b

jpc force-pushed the jpc/audio-refactor branch from f47f143 to 8309fc6 Compare April 13, 2026 15:57

rashishhume and others added 3 commits April 13, 2026 11:01

Restoring lazy key iteration with dataset cleanup in the end (#65)

11d0355

pupyarrow: bugfix: null-typed columns don't consume offset buffers

afb602b

convplayer: added mel_scaling option, let the column width grow to wi…

c7f2c25

…ndow width

jpc force-pushed the jpc/audio-refactor branch from 8309fc6 to bead1d3 Compare April 14, 2026 08:25

jpc and others added 7 commits April 14, 2026 08:43

Fix file-descriptor leakage in RecordBatchFileWriter

db0b636

ws_audio: rename the class names to WSAudioEpisode and WSAudioSegment

9ef3755

ws_audio: added a general encode_audio method that works with all bac…

72426dc

…kends

Audio decoding rework (NFY)

47938e1

jpc force-pushed the jpc/audio-refactor branch from bead1d3 to b5cbbc8 Compare April 14, 2026 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio refactor#55

Audio refactor#55
jpc wants to merge 14 commits into
jpc/wssink-improvementsfrom
jpc/audio-refactor

jpc commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jpc commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jpc commented Apr 1, 2026 •

edited

Loading