Skip to content

Unify remote on-demand sequence retrieval across gtars bindings#268

Merged
nsheff merged 12 commits into
masterfrom
feat/remote-seq-fetch-unified-flows
Jun 17, 2026
Merged

Unify remote on-demand sequence retrieval across gtars bindings#268
nsheff merged 12 commits into
masterfrom
feat/remote-seq-fetch-unified-flows

Conversation

@nsheff

@nsheff nsheff commented Jun 17, 2026

Copy link
Copy Markdown
Member

Summary

Unifies remote on-demand reference-sequence retrieval across all gtars bindings around three documented flows, and ships the previously-missing browser (wasm/JS) layer.

The three flows:

  • get_substring — lean HTTP byte-range read (resident → local .seq → remote Range: request); no whole-file download.
  • stream_sequence — O(1)-memory streaming reader over a remote region.
  • load_sequence — explicit whole-sequence download + cache + resident reuse.

What's in here

  • core (gtars-refget): get_substring/get_substrings now fall through resident → local → remote byte-range via get_substring_from_remote. Plus a bulk-fetch promotion: a remote-only sequence read with ≥16 ranges (and a local cache dir) downloads+caches the whole .seq once instead of N round-trips. Single/small reads stay pure byte-range.
  • Python / Node / R: consistent exposure of the three flows (Node no longer auto-downloads whole chromosomes; Python/R gain stream_sequence + load_sequence + load_all_sequences).
  • Wasm: decode primitives exported as camelCase (encodedByteRange/decodeEncodedRange); new RemoteRefgetStore JS layer (HTTP byte-range + OPFS caching) shipped via @databio/gtars/remote.
  • R docs: generated man/*.Rd for the new functions (were omitted earlier).
  • Release: patch version bumps for the 5 changed products (see below).

Verification

Verified live against the public pangenome store (refgenie.s3.us-east-1.amazonaws.com/pangenome_refget_store, collection 0qveCdMlbF_kYn6XWb7YBy-FtRZ6gSAL) — all three flows return byte-identical real DNA, with flow-1 caching nothing and flow-3 caching one .seq:

Binding Build Tests Live 3-flow
core ✅ all + doctests
Python ✅ 61
Node ✅ 17
R (via bulker databio/cran) ✅ 130 refget
Wasm ✅ decode + E2E (mocked fetch/OPFS); eslint + vite build browser-only run pending

rextendr::document() regenerated a byte-identical extendr-wrappers.R, confirming no drift in the R bindings.

Release (per RELEASING.md — cut a <package>-v<version> tag per product)

Product Bump Publishes to
gtars-refget 0.9.0 → 0.9.1 crates.io
gtars-python 0.9.1 → 0.9.2 PyPI
gtars-node 0.7.0 → 0.7.1 npm (@databio/gtars-node)
gtars-wasm 0.9.0 → 0.9.1 npm (@databio/gtars)
gtars-r 0.9.1 → 0.9.2 Bioconductor

All additive (patch) bumps; gtars-refget stays inside the meta gtars crate's ^0.9.0 range, so no cascade re-release of gtars / gtars-cli.

Follow-ups (not in this PR)

  • refget frontend repo: vrsWorker.js was refactored to use RemoteRefgetStore — lives in that separate repo, handle alongside the wasm release.
  • VRS compute path still ingests whole chromosomes (wasm can't await fetch() mid-loop); true byte-range compute needs a future add_encoded_window Rust API.
  • Minor: Node npm test script broken on Node 22 (needs a glob); get_substring resolves by sha512t24u only while sibling methods also accept md5.

nsheff added 9 commits June 16, 2026 22:24
Expose the same three retrieval flows through a consistent interface in
every binding (Python, R, Node, Wasm):

- get_substring / get_substrings (flow 1): lean partial read, now with a
  remote HTTP byte-range fallback (resident -> local .seq -> remote),
  persisting nothing. New &self get_substring_from_remote() reuses the
  existing open_remote_range machinery.
- stream_sequence (flow 2): O(1)-memory region stream, now exposed in
  Python (SequenceStream iterator) and R in addition to Node.
- load_sequence / load_all_sequences (flow 3): explicit whole-sequence
  download + cache, now exposed as explicit methods in Node and R.

Behavioral change (no backwards compat): Node's get_substring no longer
implicitly downloads whole chromosomes; whole-sequence caching is now
opt-in via loadSequence. All mutable bindings lazily load the cheap
sequence index so a bare remote get_substring "just works".

Wasm gains encoded_byte_range + decode_encoded_range primitives so JS can
fetch an HTTP Range and decode just those bytes (flow-2 equivalent).

Tests: Range-honoring mock-HTTP core test asserting cold remote
get_substring/get_substrings decode correctly and persist nothing;
network-gated pytest covering all three flows. Verified live against the
refgenie S3 pangenome store: byte-identical bases across Python and Node.
When a remote-only sequence is read with >=REMOTE_BULK_FETCH_THRESHOLD (16)
ranges and a local cache dir exists, download+cache the whole .seq once via
fetch_file(persist=true) and serve all ranges from disk, instead of N HTTP
byte-range round-trips. Single/small-batch reads stay pure byte-range.
Adds test_get_substrings_bulk_remote_promotes_to_whole_download.
Adds gtars-wasm/js/remote-refget-store.{js,d.ts}: an HTTP byte-range +
OPFS-caching refget client built on the wasm decode primitives, shipped via
@databio/gtars/remote. getSubstring fetches only covering bytes (Range
request) and decodes via decodeEncodedRange; prefetchSequence/getSequenceBytes
download-once + cache whole .seq. Export the decode primitives as camelCase
(encodedByteRange/decodeEncodedRange) via js_name. Wire the JS/d.ts into the
pkg via Makefile + build/publish workflows (files + exports map).
ccc3b85 added stream_sequence/load_sequence/load_all_sequences to gtars-r
but omitted their generated .Rd docs and left NAMESPACE export ordering
non-canonical. Regenerated via rextendr::document() (extendr-wrappers.R was
byte-identical, confirming no drift).
Patch bumps for the products changed by the unified remote on-demand
sequence-retrieval work (all additive, no breaking changes):

  gtars-refget  0.9.0 -> 0.9.1  (crates.io)
  gtars-python  0.9.1 -> 0.9.2  (PyPI)
  gtars-node    0.7.0 -> 0.7.1  (npm; Cargo.toml + package.json)
  gtars-wasm    0.9.0 -> 0.9.1  (npm)
  gtars-r       0.9.1 -> 0.9.2  (Bioconductor; DESCRIPTION + src/rust/Cargo.toml)

gtars-refget stays within the meta gtars crate's ^0.9.0 range, so no
cascade re-release of gtars / gtars-cli is required.
node --test __test__/ misresolves the dir as a module on Node 22; quote a
glob so Node's test runner expands it: node --test "__test__/*.test.mjs".
get_substring resolved by sha512t24u only, while get_substrings/get_sequence/
stream_sequence/load_sequence also accept md5. Apply the same md5_lookup ->
sha512t24u fallback so get_substring is consistent. Adds test_get_substring_by_md5.
Bindings call core get_substring and inherit the fix.
get_substring / get_substrings errored on start == end while JS getSubstring
and core stream_sequence already treated it as a valid empty interval. Return
"" for start == end across resident/disk/remote paths (handled early in
get_substring so all paths agree); start > end and out-of-bounds still error.
Adds zero-length-range tests on resident, disk-stub, and remote paths.
stream_sequence_store read_to_string's the region into one String (O(n)), so
the 'decoding on the fly with O(1) memory ... single string' rustdoc was self-
contradictory. Reword to 'bounded intermediate buffers ... peak memory is
O(region length)'. Doc-only; rextendr::document() confirms no wrapper drift.
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.67925% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.67%. Comparing base (be70f69) to head (10cd698).

Files with missing lines Patch % Lines
gtars-node/src/lib.rs 0.00% 22 Missing ⚠️
gtars-refget/src/store/readonly.rs 96.70% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #268      +/-   ##
==========================================
+ Coverage   82.50%   82.67%   +0.16%     
==========================================
  Files          95       95              
  Lines       26786    27022     +236     
==========================================
+ Hits        22100    22340     +240     
+ Misses       4686     4682       -4     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nsheff added 3 commits June 17, 2026 09:55
The quoted glob node --test "__test__/*.test.mjs" relies on Node's built-in
--test glob expansion, which only exists on Node 21+. CI runs Node 20, so it
treated the pattern as a literal path (Could not find ...). Drop the quotes so
the shell expands the glob to explicit file args, which works on all Node
versions.
The remote byte-range path (get_substring_from_remote) now uses checked
String::from_utf8 — bytes come from an untrusted HTTP server and the
validation cost is nil against the network fetch.

The 5 whole-sequence from_utf8_unchecked sites are consolidated into one
audited helper, decoded_sequence_to_string, documenting the ASCII
invariant and citing the benchmark (~6-10% / ~10-20ms per 250Mbp
chromosome) that justifies skipping validation on the hot path. A
debug_assert re-checks the invariant in debug/test builds.

Crate goes from 6 -> 1 from_utf8_unchecked.
…-chgr-c6px-7xpp

Dependabot flagged two pyo3 advisories, both fixed only in 0.29.0 (nothing
backported to 0.27/0.28):
- GHSA-36hh-v3qg-5jq4 (high): OOB read in nth/nth_back for PyList/PyTuple iters
- GHSA-chgr-c6px-7xpp (moderate): missing Sync bound on PyCFunction::new_closure

pyo3 0.29 API migration (mechanical, no behavior change):
- Python::allow_threads -> Python::detach (15 sites)
- Bound::downcast / Py::downcast_bound -> cast / cast_bound (4 sites)
- PyObject type alias removed -> Py<PyAny>

Verified: cargo build -p gtars-py clean; 234 Python tests pass.
@nsheff nsheff merged commit c79ad79 into master Jun 17, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant