Unify remote on-demand sequence retrieval across gtars bindings#268
Merged
Conversation
Expose the same three retrieval flows through a consistent interface in every binding (Python, R, Node, Wasm): - get_substring / get_substrings (flow 1): lean partial read, now with a remote HTTP byte-range fallback (resident -> local .seq -> remote), persisting nothing. New &self get_substring_from_remote() reuses the existing open_remote_range machinery. - stream_sequence (flow 2): O(1)-memory region stream, now exposed in Python (SequenceStream iterator) and R in addition to Node. - load_sequence / load_all_sequences (flow 3): explicit whole-sequence download + cache, now exposed as explicit methods in Node and R. Behavioral change (no backwards compat): Node's get_substring no longer implicitly downloads whole chromosomes; whole-sequence caching is now opt-in via loadSequence. All mutable bindings lazily load the cheap sequence index so a bare remote get_substring "just works". Wasm gains encoded_byte_range + decode_encoded_range primitives so JS can fetch an HTTP Range and decode just those bytes (flow-2 equivalent). Tests: Range-honoring mock-HTTP core test asserting cold remote get_substring/get_substrings decode correctly and persist nothing; network-gated pytest covering all three flows. Verified live against the refgenie S3 pangenome store: byte-identical bases across Python and Node.
When a remote-only sequence is read with >=REMOTE_BULK_FETCH_THRESHOLD (16) ranges and a local cache dir exists, download+cache the whole .seq once via fetch_file(persist=true) and serve all ranges from disk, instead of N HTTP byte-range round-trips. Single/small-batch reads stay pure byte-range. Adds test_get_substrings_bulk_remote_promotes_to_whole_download.
Adds gtars-wasm/js/remote-refget-store.{js,d.ts}: an HTTP byte-range +
OPFS-caching refget client built on the wasm decode primitives, shipped via
@databio/gtars/remote. getSubstring fetches only covering bytes (Range
request) and decodes via decodeEncodedRange; prefetchSequence/getSequenceBytes
download-once + cache whole .seq. Export the decode primitives as camelCase
(encodedByteRange/decodeEncodedRange) via js_name. Wire the JS/d.ts into the
pkg via Makefile + build/publish workflows (files + exports map).
ccc3b85 added stream_sequence/load_sequence/load_all_sequences to gtars-r but omitted their generated .Rd docs and left NAMESPACE export ordering non-canonical. Regenerated via rextendr::document() (extendr-wrappers.R was byte-identical, confirming no drift).
Patch bumps for the products changed by the unified remote on-demand sequence-retrieval work (all additive, no breaking changes): gtars-refget 0.9.0 -> 0.9.1 (crates.io) gtars-python 0.9.1 -> 0.9.2 (PyPI) gtars-node 0.7.0 -> 0.7.1 (npm; Cargo.toml + package.json) gtars-wasm 0.9.0 -> 0.9.1 (npm) gtars-r 0.9.1 -> 0.9.2 (Bioconductor; DESCRIPTION + src/rust/Cargo.toml) gtars-refget stays within the meta gtars crate's ^0.9.0 range, so no cascade re-release of gtars / gtars-cli is required.
node --test __test__/ misresolves the dir as a module on Node 22; quote a glob so Node's test runner expands it: node --test "__test__/*.test.mjs".
get_substring resolved by sha512t24u only, while get_substrings/get_sequence/ stream_sequence/load_sequence also accept md5. Apply the same md5_lookup -> sha512t24u fallback so get_substring is consistent. Adds test_get_substring_by_md5. Bindings call core get_substring and inherit the fix.
get_substring / get_substrings errored on start == end while JS getSubstring and core stream_sequence already treated it as a valid empty interval. Return "" for start == end across resident/disk/remote paths (handled early in get_substring so all paths agree); start > end and out-of-bounds still error. Adds zero-length-range tests on resident, disk-stub, and remote paths.
stream_sequence_store read_to_string's the region into one String (O(n)), so the 'decoding on the fly with O(1) memory ... single string' rustdoc was self- contradictory. Reword to 'bounded intermediate buffers ... peak memory is O(region length)'. Doc-only; rextendr::document() confirms no wrapper drift.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #268 +/- ##
==========================================
+ Coverage 82.50% 82.67% +0.16%
==========================================
Files 95 95
Lines 26786 27022 +236
==========================================
+ Hits 22100 22340 +240
+ Misses 4686 4682 -4 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
The quoted glob node --test "__test__/*.test.mjs" relies on Node's built-in --test glob expansion, which only exists on Node 21+. CI runs Node 20, so it treated the pattern as a literal path (Could not find ...). Drop the quotes so the shell expands the glob to explicit file args, which works on all Node versions.
The remote byte-range path (get_substring_from_remote) now uses checked String::from_utf8 — bytes come from an untrusted HTTP server and the validation cost is nil against the network fetch. The 5 whole-sequence from_utf8_unchecked sites are consolidated into one audited helper, decoded_sequence_to_string, documenting the ASCII invariant and citing the benchmark (~6-10% / ~10-20ms per 250Mbp chromosome) that justifies skipping validation on the hot path. A debug_assert re-checks the invariant in debug/test builds. Crate goes from 6 -> 1 from_utf8_unchecked.
…-chgr-c6px-7xpp Dependabot flagged two pyo3 advisories, both fixed only in 0.29.0 (nothing backported to 0.27/0.28): - GHSA-36hh-v3qg-5jq4 (high): OOB read in nth/nth_back for PyList/PyTuple iters - GHSA-chgr-c6px-7xpp (moderate): missing Sync bound on PyCFunction::new_closure pyo3 0.29 API migration (mechanical, no behavior change): - Python::allow_threads -> Python::detach (15 sites) - Bound::downcast / Py::downcast_bound -> cast / cast_bound (4 sites) - PyObject type alias removed -> Py<PyAny> Verified: cargo build -p gtars-py clean; 234 Python tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Unifies remote on-demand reference-sequence retrieval across all gtars bindings around three documented flows, and ships the previously-missing browser (wasm/JS) layer.
The three flows:
.seq→ remoteRange:request); no whole-file download.What's in here
gtars-refget):get_substring/get_substringsnow fall through resident → local → remote byte-range viaget_substring_from_remote. Plus a bulk-fetch promotion: a remote-only sequence read with ≥16 ranges (and a local cache dir) downloads+caches the whole.seqonce instead of N round-trips. Single/small reads stay pure byte-range.stream_sequence+load_sequence+load_all_sequences).encodedByteRange/decodeEncodedRange); newRemoteRefgetStoreJS layer (HTTP byte-range + OPFS caching) shipped via@databio/gtars/remote.man/*.Rdfor the new functions (were omitted earlier).Verification
Verified live against the public pangenome store (
refgenie.s3.us-east-1.amazonaws.com/pangenome_refget_store, collection0qveCdMlbF_kYn6XWb7YBy-FtRZ6gSAL) — all three flows return byte-identical real DNA, with flow-1 caching nothing and flow-3 caching one.seq:databio/cran)rextendr::document()regenerated a byte-identicalextendr-wrappers.R, confirming no drift in the R bindings.Release (per RELEASING.md — cut a
<package>-v<version>tag per product)gtars-refgetgtars-pythongtars-node@databio/gtars-node)gtars-wasm@databio/gtars)gtars-rAll additive (patch) bumps;
gtars-refgetstays inside the metagtarscrate's^0.9.0range, so no cascade re-release ofgtars/gtars-cli.Follow-ups (not in this PR)
refgetfrontend repo:vrsWorker.jswas refactored to useRemoteRefgetStore— lives in that separate repo, handle alongside the wasm release.await fetch()mid-loop); true byte-range compute needs a futureadd_encoded_windowRust API.npm testscript broken on Node 22 (needs a glob);get_substringresolves by sha512t24u only while sibling methods also accept md5.