Skip to content

Stream tar extraction to disk and add file-based unpack#167

Merged
ericmj merged 7 commits intomainfrom
ericmj/stream-extract-to-disk
Mar 8, 2026
Merged

Stream tar extraction to disk and add file-based unpack#167
ericmj merged 7 commits intomainfrom
ericmj/stream-extract-to-disk

Conversation

@ericmj
Copy link
Member

@ericmj ericmj commented Mar 8, 2026

When extracting tar entries to disk, hex_erl_tar previously read each file entry fully into memory before writing it to disk. This change makes the disk extraction path stream file entries in chunks (default 64KB) directly to the output file.

  • hex_tarball:unpack/2,3 - Added {file, Path} as first argument to read tarballs from disk without loading into memory
  • hex_tarball:unpack/2,3 - Added none as output mode to extract only metadata and checksums, skipping contents
  • hex_tarball:unpack/2,3 - Refactored so Output drives the outer extraction strategy: memory extracts to memory, path/none extracts to a temp dir
  • hex_tarball:unpack_docs/2,3 - Added {file, Path} as first argument to read doc tarballs from disk without loading into memory
  • hex_erl_tar:extract/2 - Added {chunks, N} option to control chunk size for streaming extraction to disk

ericmj added 2 commits March 8, 2026 20:52
When extracting tar archives to disk, stream file entries in chunks
instead of reading them fully into memory. Also replace hand-rolled
tar parsing in hex_tarball with hex_erl_tar:extract.
Backport the streamed_extract test from erlang/otp#10818 to verify
that files of various sizes (empty, small, chunk-boundary, and large)
are correctly extracted when streaming to disk.
@ericmj ericmj changed the title Ericmj/stream extract to disk Stream tar file entries to disk instead of loading into memory Mar 8, 2026
The compressed_one option for file:open is not available on OTP 24,
causing file-based extraction of compressed tar entries to silently
open files without decompression. Use compressed instead, which has
the same behavior for single-member gzip files like contents.tar.gz.
@ericmj ericmj changed the title Stream tar file entries to disk instead of loading into memory Stream tar extraction to disk and add file-based unpack Mar 8, 2026
@ericmj
Copy link
Member Author

ericmj commented Mar 8, 2026

Also see erlang/otp#10818 for the backport of streaming file extract.

ericmj added 3 commits March 8, 2026 22:07
…dd {file, Path} support to unpack_docs

Previously the outer tarball extraction mode was tied to the input format.
Now Output drives the strategy: memory extracts to memory, path/none extracts
to a temp dir. Also adds {file, Path} input support to unpack_docs/2,3.
@ericmj ericmj merged commit 13bb9fb into main Mar 8, 2026
10 checks passed
@ericmj ericmj deleted the ericmj/stream-extract-to-disk branch March 8, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant