Skip to content

Replace 'aws s3 ls' shell-out in dataset.py with boto3#239

Open
goanpeca wants to merge 1 commit into
Stability-AI:mainfrom
goanpeca:feat/boto3-s3-loader
Open

Replace 'aws s3 ls' shell-out in dataset.py with boto3#239
goanpeca wants to merge 1 commit into
Stability-AI:mainfrom
goanpeca:feat/boto3-s3-loader

Conversation

@goanpeca
Copy link
Copy Markdown

@goanpeca goanpeca commented May 4, 2026

Replaces the aws s3 ls subprocess + pipe:aws s3 cp URLs in stable_audio_tools/data/dataset.py with direct boto3. Honors AWS_ENDPOINT_URL so the same code path works against any S3-compatible endpoint: AWS S3 by default, or Backblaze B2 when AWS_ENDPOINT_URL points at a B2 endpoint. Backward compatible: when AWS_ENDPOINT_URL is unset, behavior matches the prior AWS-only path.

What changed

  • New stable_audio_tools/data/s3_utils.py: an import-light module (boto3 only, no torch/webdataset) that holds the S3 helpers and doubles as a pipe: subprocess entry point. dataset.py re-exports the public get_s3_contents / get_all_s3_urls for backwards compatibility.
  • _get_s3_client(...) builds a boto3 client that honors AWS_ENDPOINT_URL and forces SigV4 (signature_version="s3v4"), which Backblaze B2 requires. The User-Agent carries a stable-audio-tools/<version> token, and the client sets botocore retries for transient-failure resilience.
  • get_s3_contents(...) uses a boto3 paginator instead of subprocess.run(['aws','s3','ls',...]), returning keys relative to dataset_path (the same output shape as the previous implementation).
  • get_all_s3_urls(...) emits a streaming pipe: command per shard: pipe:<python> -m stable_audio_tools.data.s3_utils s3://bucket/key. The subprocess streams the object with boto3, authenticating at open time, so shards re-auth on every open (no presigned-URL expiry mid-run, even across epochs with ResampledShards) and no credentials appear on the command line.
  • boto3 is declared in the train extra in pyproject.toml, and uv.lock is regenerated.

Tests

tests/test_dataset_s3.py (first tests in the repo): 26 offline tests, no network. Covers client/endpoint/profile handling, SigV4 (including a real-boto3 check that the signed URL is AWS4-HMAC-SHA256), versioned User-Agent + override semantics, get_s3_contents pagination/filter/recursive, _parse_s3_url, shard_pipe_command, stream_object, the module CLI entry, and a graceful ImportError when boto3 is absent.

python -m pytest tests/test_dataset_s3.py -q
# 26 passed

README + .gitignore

README note pointing at AWS_ENDPOINT_URL + AWS_DEFAULT_REGION for non-AWS endpoints (and that S3 use needs boto3, in the train extra); .DS_Store added to .gitignore, plus !tests/test_*.py so test modules aren't caught by the existing test_* rule.

Note on the download path: the original pipe:aws s3 cp required the aws CLI installed + configured. This streams via boto3 in a pipe: subprocess instead: no CLI dependency, authenticates per open, and keeps credentials off the process command line.

Copilot AI review requested due to automatic review settings May 4, 2026 23:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the AWS CLI shell-outs in the S3 dataset loader with direct boto3 usage, and switches shard fetching to presigned-URL downloads (via curl) so the same code path can work with AWS S3 or other S3-compatible endpoints (via AWS_ENDPOINT_URL).

Changes:

  • Added _get_s3_client() and updated get_s3_contents() to list objects via boto3 pagination instead of aws s3 ls.
  • Updated get_all_s3_urls() to emit pipe:curl -fsSL "<presigned>" instead of pipe:aws s3 cp ... -.
  • Added offline unit tests for the new S3 behavior; documented AWS_ENDPOINT_URL usage in the README and updated .gitignore.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
stable_audio_tools/data/dataset.py Replaces AWS CLI calls with boto3 listing + presigned-URL generation for shard download.
tests/test_dataset_s3.py Adds offline unit tests covering endpoint env handling, pagination/listing behavior, filtering, and presigned URL emission.
README.md Documents use of AWS_ENDPOINT_URL for S3-compatible storage (Backblaze B2 example).
.gitignore Adds .DS_Store ignore and normalizes wandb/* entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 6be8c78 to 6671c21 Compare May 4, 2026 23:41
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 6671c21 to 3211c59 Compare May 15, 2026 02:42
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 3211c59 to 9aa1b9c Compare June 2, 2026 01:37
@goanpeca goanpeca requested a review from Copilot June 2, 2026 01:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread tests/test_dataset_s3.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 9aa1b9c to 5dc6b0f Compare June 2, 2026 01:58
@goanpeca goanpeca requested a review from Copilot June 2, 2026 01:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread README.md Outdated
Comment thread tests/test_dataset_s3.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 5dc6b0f to 9463b19 Compare June 2, 2026 02:17
@goanpeca goanpeca requested a review from Copilot June 2, 2026 02:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

stable_audio_tools/data/dataset.py:1

  • The comment states SigV4 max is 7 days, but overrides (arg/env) are not range-validated. If a user sets presign_expiry_seconds > 604800 (or <= 0), generate_presigned_url may raise at runtime, causing failures mid-training with a less actionable stack trace. Consider validating presign_expiry_seconds is within an allowed range (e.g., 1.._DEFAULT_PRESIGN_EXPIRY_SECONDS) and raising a clear ValueError when out of bounds.
import importlib

Comment thread stable_audio_tools/data/dataset.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 9463b19 to 5e2b6bd Compare June 2, 2026 02:22
@goanpeca goanpeca requested a review from Copilot June 2, 2026 02:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Comment thread tests/test_dataset_s3.py
Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread pyproject.toml Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 5e2b6bd to 567f358 Compare June 2, 2026 02:28
@goanpeca goanpeca requested a review from Copilot June 2, 2026 02:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch 2 times, most recently from 6cef2e2 to 831d2b3 Compare June 2, 2026 02:34
@goanpeca goanpeca requested a review from Copilot June 2, 2026 02:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

Comment thread stable_audio_tools/data/dataset.py Outdated
@goanpeca goanpeca requested a review from Copilot June 2, 2026 02:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 831d2b3 to d860a20 Compare June 2, 2026 02:47
@goanpeca goanpeca requested a review from Copilot June 2, 2026 02:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

Comment thread stable_audio_tools/data/dataset.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from d860a20 to d731dd7 Compare June 2, 2026 02:56
@goanpeca goanpeca requested a review from Copilot June 2, 2026 03:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread stable_audio_tools/data/dataset.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from d7ddd48 to 48dbb30 Compare June 2, 2026 03:26
@goanpeca goanpeca requested a review from Copilot June 2, 2026 03:26
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

Comment thread stable_audio_tools/data/dataset.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 48dbb30 to 4ea29bd Compare June 2, 2026 03:31
@goanpeca goanpeca requested a review from Copilot June 2, 2026 03:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

Comment thread pyproject.toml Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 4ea29bd to 64aca56 Compare June 2, 2026 03:40
@goanpeca goanpeca requested a review from Copilot June 2, 2026 03:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.

Comment thread stable_audio_tools/data/dataset.py Outdated
Comment thread tests/test_dataset_s3.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 64aca56 to 3a109ee Compare June 2, 2026 03:51
@goanpeca goanpeca requested a review from Copilot June 2, 2026 03:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

Comment thread stable_audio_tools/data/dataset.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 3a109ee to 44ebfd2 Compare June 2, 2026 04:03
@goanpeca goanpeca requested a review from Copilot June 2, 2026 04:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated no new comments.

@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch 2 times, most recently from 6020e30 to e73bdf4 Compare June 2, 2026 04:49
@goanpeca goanpeca requested a review from Copilot June 2, 2026 04:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 7 changed files in this pull request and generated 3 comments.

Comment thread README.md Outdated
Comment thread stable_audio_tools/data/s3_utils.py
Comment thread stable_audio_tools/data/dataset.py Outdated
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch 2 times, most recently from 26d5fde to 049e886 Compare June 2, 2026 05:05
@goanpeca goanpeca requested a review from Copilot June 2, 2026 05:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 7 changed files in this pull request and generated 1 comment.

Comment thread stable_audio_tools/data/s3_utils.py
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 049e886 to 48d881f Compare June 2, 2026 05:12
@goanpeca goanpeca requested a review from Copilot June 2, 2026 05:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 7 changed files in this pull request and generated 1 comment.

Comment on lines +93 to +100
def get_s3_contents(
dataset_path,
s3_url_prefix=None,
filter_str='', # only keep keys containing this substring
recursive=True,
debug=False,
profile=None,
):
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter shadows python keyword.

@goanpeca
Copy link
Copy Markdown
Author

goanpeca commented Jun 2, 2026

Hi @zqevans (cc @nateraw)! Whenever you get a chance, the PR is ready for a look. It swaps the aws s3 ls shell-out in dataset.py for boto3, keeps the existing behavior, and adds a few tests. I just rebased it onto the latest main, so it should be clean to review or merge.

No rush at all, and I am happy to rebase again, split it up, or make any changes you'd like to help get it across the line.

Thanks for the great project! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants