Replace 'aws s3 ls' shell-out in dataset.py with boto3#239
Conversation
There was a problem hiding this comment.
Pull request overview
This PR replaces the AWS CLI shell-outs in the S3 dataset loader with direct boto3 usage, and switches shard fetching to presigned-URL downloads (via curl) so the same code path can work with AWS S3 or other S3-compatible endpoints (via AWS_ENDPOINT_URL).
Changes:
- Added
_get_s3_client()and updatedget_s3_contents()to list objects via boto3 pagination instead ofaws s3 ls. - Updated
get_all_s3_urls()to emitpipe:curl -fsSL "<presigned>"instead ofpipe:aws s3 cp ... -. - Added offline unit tests for the new S3 behavior; documented
AWS_ENDPOINT_URLusage in the README and updated.gitignore.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
stable_audio_tools/data/dataset.py |
Replaces AWS CLI calls with boto3 listing + presigned-URL generation for shard download. |
tests/test_dataset_s3.py |
Adds offline unit tests covering endpoint env handling, pagination/listing behavior, filtering, and presigned URL emission. |
README.md |
Documents use of AWS_ENDPOINT_URL for S3-compatible storage (Backblaze B2 example). |
.gitignore |
Adds .DS_Store ignore and normalizes wandb/* entry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6be8c78 to
6671c21
Compare
6671c21 to
3211c59
Compare
3211c59 to
9aa1b9c
Compare
9aa1b9c to
5dc6b0f
Compare
5dc6b0f to
9463b19
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
stable_audio_tools/data/dataset.py:1
- The comment states SigV4 max is 7 days, but overrides (arg/env) are not range-validated. If a user sets
presign_expiry_seconds> 604800 (or <= 0),generate_presigned_urlmay raise at runtime, causing failures mid-training with a less actionable stack trace. Consider validatingpresign_expiry_secondsis within an allowed range (e.g.,1.._DEFAULT_PRESIGN_EXPIRY_SECONDS) and raising a clearValueErrorwhen out of bounds.
import importlib
9463b19 to
5e2b6bd
Compare
5e2b6bd to
567f358
Compare
6cef2e2 to
831d2b3
Compare
831d2b3 to
d860a20
Compare
d860a20 to
d731dd7
Compare
d7ddd48 to
48dbb30
Compare
48dbb30 to
4ea29bd
Compare
4ea29bd to
64aca56
Compare
64aca56 to
3a109ee
Compare
3a109ee to
44ebfd2
Compare
6020e30 to
e73bdf4
Compare
26d5fde to
049e886
Compare
049e886 to
48d881f
Compare
| def get_s3_contents( | ||
| dataset_path, | ||
| s3_url_prefix=None, | ||
| filter_str='', # only keep keys containing this substring | ||
| recursive=True, | ||
| debug=False, | ||
| profile=None, | ||
| ): |
|
Hi @zqevans (cc @nateraw)! Whenever you get a chance, the PR is ready for a look. It swaps the aws s3 ls shell-out in dataset.py for boto3, keeps the existing behavior, and adds a few tests. I just rebased it onto the latest main, so it should be clean to review or merge. No rush at all, and I am happy to rebase again, split it up, or make any changes you'd like to help get it across the line. Thanks for the great project! 🚀 |
Replaces the
aws s3 lssubprocess +pipe:aws s3 cpURLs instable_audio_tools/data/dataset.pywith direct boto3. HonorsAWS_ENDPOINT_URLso the same code path works against any S3-compatible endpoint: AWS S3 by default, or Backblaze B2 whenAWS_ENDPOINT_URLpoints at a B2 endpoint. Backward compatible: whenAWS_ENDPOINT_URLis unset, behavior matches the prior AWS-only path.What changed
stable_audio_tools/data/s3_utils.py: an import-light module (boto3 only, no torch/webdataset) that holds the S3 helpers and doubles as apipe:subprocess entry point.dataset.pyre-exports the publicget_s3_contents/get_all_s3_urlsfor backwards compatibility._get_s3_client(...)builds a boto3 client that honorsAWS_ENDPOINT_URLand forces SigV4 (signature_version="s3v4"), which Backblaze B2 requires. The User-Agent carries astable-audio-tools/<version>token, and the client sets botocoreretriesfor transient-failure resilience.get_s3_contents(...)uses a boto3 paginator instead ofsubprocess.run(['aws','s3','ls',...]), returning keys relative todataset_path(the same output shape as the previous implementation).get_all_s3_urls(...)emits a streamingpipe:command per shard:pipe:<python> -m stable_audio_tools.data.s3_utils s3://bucket/key. The subprocess streams the object with boto3, authenticating at open time, so shards re-auth on every open (no presigned-URL expiry mid-run, even across epochs withResampledShards) and no credentials appear on the command line.boto3is declared in thetrainextra inpyproject.toml, anduv.lockis regenerated.Tests
tests/test_dataset_s3.py(first tests in the repo): 26 offline tests, no network. Covers client/endpoint/profile handling, SigV4 (including a real-boto3 check that the signed URL isAWS4-HMAC-SHA256), versioned User-Agent + override semantics,get_s3_contentspagination/filter/recursive,_parse_s3_url,shard_pipe_command,stream_object, the module CLI entry, and a gracefulImportErrorwhen boto3 is absent.python -m pytest tests/test_dataset_s3.py -q # 26 passedREADME + .gitignore
README note pointing at
AWS_ENDPOINT_URL+AWS_DEFAULT_REGIONfor non-AWS endpoints (and that S3 use needsboto3, in thetrainextra);.DS_Storeadded to.gitignore, plus!tests/test_*.pyso test modules aren't caught by the existingtest_*rule.Note on the download path: the original
pipe:aws s3 cprequired theawsCLI installed + configured. This streams via boto3 in apipe:subprocess instead: no CLI dependency, authenticates per open, and keeps credentials off the process command line.