-
Notifications
You must be signed in to change notification settings - Fork 11
feature added: --validate checksum flag #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feature added: --validate checksum flag #44
Conversation
|
Warning Rate limit exceeded@agrim-git-hub has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 59 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds optional SHA256 checksum validation to the download flow: extract 64‑char hex checksums from JSON‑LD metadata, propagate validate flags and per-file expected checksums through download functions, compute actual checksums after file fetch, and raise on mismatches. A CLI flag exposes the feature. Changes
Sequence DiagramsequenceDiagram
participant User
participant CLI as CLI/Download
participant API as Download API
participant Meta as Databus Metadata
participant WebDAV as WebDAV Fetch
participant SHA as compute_sha256_and_length
User->>CLI: run download --validate-checksum
CLI->>API: download(validate_checksum=True, ...)
API->>Meta: fetch version/artifact metadata (if needed)
Meta-->>API: JSON-LD graph (Parts with checksum nodes)
API->>API: _extract_checksum_from_node() -> build per-file checksums
API->>WebDAV: _download_file(url, expected_checksum)
WebDAV->>WebDAV: download file to disk
WebDAV->>SHA: compute_sha256_and_length(file)
SHA-->>WebDAV: actual_checksum
WebDAV-->>API: return success + actual_checksum
alt expected checksum present
API->>API: compare actual vs expected
alt match
API-->>CLI: success
else mismatch
API-->>CLI: raise checksum mismatch error
end
else expected checksum missing
API-->>CLI: warn and continue
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Pre-merge checks and finishing touches✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)
791-802: Remove theexpected_checksumparameter on line 801.Line 801 passes
expected_checksum=expected, but:
- The variable
expectedis only defined in theelif file is not None:block (lines 764-790), not in thiselif version is not None:block, causing aNameError.- The
_download_versionfunction does not accept anexpected_checksumparameter (see line 473), which would cause aTypeError.Version downloads extract checksums internally from their JSON-LD metadata (lines 488-502), so there's no need to pass an external expected checksum.
🔎 Proposed fix
elif version is not None: print(f"Downloading version: {databusURI}") _download_version( databusURI, localDir, vault_token_file=token, databus_key=databus_key, auth_url=auth_url, client_id=client_id, validate_checksum=validate_checksum, - expected_checksum=expected, )
🧹 Nitpick comments (2)
databusclient/cli.py (1)
177-177: Remove extra blank line.PEP 8 style guidelines recommend no blank line between the function signature and the docstring.
🔎 Proposed fix
def download( databusuris: List[str], localdir, databus, vault_token, databus_key, all_versions, authurl, clientid, validate_checksum, ): - """ Download datasets from databus, optionally using vault access if vault options are provided. """databusclient/api/download.py (1)
235-254: Consider moving the import to module level.The checksum validation logic is correct (skip if no expected checksum, warn on computation failure, raise on mismatch). However, importing
compute_sha256_and_lengthinside the try block is unconventional. Unless there's a specific reason (e.g., optional dependency or avoiding circular imports), consider importing at the module level for clarity.🔎 Proposed refactor to move import to module level
Add to the top of the file (after line 13):
from databusclient.api.utils import ( fetch_databus_jsonld, get_databus_id_parts_from_file_url, ) +from databusclient.extensions.webdav import compute_sha256_and_lengthThen simplify the validation block:
# --- 6. Optional checksum validation --- if validate_checksum: - # reuse compute_sha256_and_length from webdav extension - try: - from databusclient.extensions.webdav import compute_sha256_and_length - - actual, _ = compute_sha256_and_length(filename) - except Exception: - actual = None + try: + actual, _ = compute_sha256_and_length(filename) + except Exception: + actual = None if expected_checksum is None: print(f"WARNING: no expected checksum available for {filename}; skipping validation") elif actual is None: print(f"WARNING: could not compute checksum for {filename}; skipping validation") else: if actual.lower() != expected_checksum.lower(): raise IOError( f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}" )
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
databusclient/api/download.pydatabusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (2)
databusclient/cli.py (1)
databusclient/api/download.py (1)
download(710-856)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)databusclient/api/utils.py (1)
fetch_databus_jsonld(33-50)
🔇 Additional comments (11)
databusclient/cli.py (2)
161-175: LGTM: CLI option and parameter addition.The
--validate-checksumflag is correctly defined and threaded through to the API call. The parameter follows Click conventions and integrates cleanly with the download command.
191-192: LGTM: Parameter forwarding.The
validate_checksumparameter is correctly passed through to the API download function.databusclient/api/download.py (9)
16-60: LGTM: Checksum extraction handles diverse JSON-LD structures.The function correctly extracts SHA-256 checksums from various JSON-LD formats (plain strings,
@valueobjects, nested structures). The 64-character hex validation is appropriate, and the fallback recursive search provides flexibility for unpredictable metadata structures.
263-290: LGTM: Per-URL checksum mapping.The function correctly extracts expected checksums from the provided mapping and forwards them to
_download_filefor validation.
437-462: LGTM: Checksum flag propagated through collection downloads.The
validate_checksumflag is correctly threaded through. Note that collection downloads won't have per-file expected checksums available (since collections are query-based), so validation warnings will appear for each file. This is acceptable given the architecture.
473-513: LGTM: Version downloads extract per-file checksums from metadata.The function correctly parses the version's JSON-LD to build a per-file checksum mapping. The fallback to an empty dict on parsing errors ensures downloads continue with warnings rather than failing completely.
524-571: LGTM: Artifact downloads extract checksums per version.The function correctly iterates through versions and builds checksum mappings for each. The implementation is consistent with the version download path.
646-672: LGTM: Group downloads propagate checksum validation.The flag is correctly passed through to artifact downloads, which handle checksum extraction.
719-760: LGTM: Checksum flag added to main download dispatcher.The
validate_checksumparameter is correctly added and propagated through collection downloads.
764-790: LGTM: Single-file downloads fetch checksums from parent version metadata.The logic to construct the version URI and extract the expected checksum from the parent version's JSON-LD is sound. The fallback warning on errors ensures downloads continue gracefully.
803-856: LGTM: Checksum validation threaded through remaining download paths.The
validate_checksumflag is correctly propagated to artifact, group, and SPARQL query download paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
databusclient/api/download.py (1)
545-559: Same duplication as in_download_version.This checksum extraction logic is identical to lines 487-501 in
_download_version. Please apply the same refactoring suggested above to eliminate this duplication.
🧹 Nitpick comments (2)
databusclient/api/download.py (2)
83-84: Checksum validation logic is correct and user-friendly.The implementation properly computes and compares checksums, provides clear warnings when validation cannot be performed, and raises appropriate errors on mismatches. The case-insensitive comparison at line 249 is correct for hex strings.
💡 Optional: Consider more specific exception handling
At line 241, the broad
except Exceptioncatches all errors during checksum computation. While this ensures robustness, consider catching specific exceptions for clarity:try: actual, _ = compute_sha256_and_length(filename) - except Exception: + except (OSError, IOError) as e: + print(f"WARNING: error computing checksum for {filename}: {e}") actual = NoneThis makes the error handling more explicit while still providing graceful degradation.
Also applies to: 236-252
487-501: Extract duplicated checksum-building logic into a helper function.The code at lines 487-501 that builds the checksums dictionary from JSON-LD metadata is duplicated in
_download_artifact(lines 545-559). This creates a maintenance burden.🔎 Proposed refactor to eliminate duplication
Extract the common logic into a helper function:
def _extract_checksums_from_jsonld(json_str: str) -> dict: """ Extract file URL -> checksum mapping from artifact version JSON-LD. Returns: Dictionary mapping file URIs to their expected SHA256 checksums. """ checksums: dict = {} try: json_dict = json.loads(json_str) graph = json_dict.get("@graph", []) for node in graph: if node.get("@type") == "Part": file_uri = node.get("file") if not isinstance(file_uri, str): continue expected = _extract_checksum_from_node(node) if expected: checksums[file_uri] = expected except Exception: return {} return checksumsThen use it in both functions:
def _download_version(...): json_str = fetch_databus_jsonld(uri, databus_key=databus_key) file_urls = _get_file_download_urls_from_artifact_jsonld(json_str) - # build url -> checksum mapping from JSON-LD when available - checksums: dict = {} - try: - json_dict = json.loads(json_str) - graph = json_dict.get("@graph", []) - for node in graph: - if node.get("@type") == "Part": - file_uri = node.get("file") - if not isinstance(file_uri, str): - continue - expected = _extract_checksum_from_node(node) - if expected: - checksums[file_uri] = expected - except Exception: - checksums = {} + checksums = _extract_checksums_from_jsonld(json_str)Apply the same change in
_download_artifact.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
databusclient/api/download.pydatabusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)databusclient/api/utils.py (1)
fetch_databus_jsonld(33-50)
🔇 Additional comments (8)
databusclient/api/download.py (6)
15-15: LGTM! Appropriate import for checksum functionality.The import of
compute_sha256_and_lengthis correctly used later for computing file checksums during validation.
17-60: Well-structured checksum extraction with comprehensive JSON-LD handling.The function correctly handles multiple JSON-LD node shapes and uses appropriate validation for 64-character hex strings. The recursive approach with early returns is efficient.
262-263: LGTM! Clean per-file checksum resolution.The logic correctly extracts expected checksums from the mapping and forwards them to
_download_file. The defensiveisinstancecheck at line 278 is good practice.Also applies to: 277-289
645-645: LGTM! Correct propagation of checksum validation flag.The parameter is properly threaded through to
_download_artifact.Also applies to: 671-671
763-789: Single-file checksum extraction is well-implemented.The logic to fetch the parent version's metadata and extract the expected checksum is sound. The try/except block with a clear warning (line 777) provides good user experience when metadata cannot be retrieved.
718-718: LGTM! Checksum validation flag properly propagated throughout download paths.The
validate_checksumparameter is correctly threaded through all download scenarios (collections, versions, artifacts, groups, and custom queries).Also applies to: 759-759, 799-799, 813-813, 827-827, 853-853
databusclient/cli.py (2)
161-165: LGTM! CLI option properly configured.The
--validate-checksumflag follows Click conventions and has clear help text. The parameter is correctly added to the function signature.Also applies to: 175-175
190-191: LGTM! Checksum validation flag correctly forwarded to API.The parameter is properly passed through to
api_download.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
databusclient/api/download.py (2)
15-60: Well-implemented checksum extraction with broad compatibility.The import and
_extract_checksum_from_nodefunction are well-structured. The function correctly validates SHA256 checksums (64 hex characters) and handles diverse JSON-LD node shapes by trying known keys first, then falling back to recursive search.The recursive fallback (lines 55-59) searches all node values for any 64-char hex string. While this provides excellent compatibility with varied JSON-LD formats, it could theoretically match unrelated data. Consider limiting the fallback search to a specific depth or removing it entirely if all Databus metadata uses the known keys.
Optional: Restrict to known keys only
- # fallback: search all values recursively for a 64-char hex string - for v in node.values(): - res = find_in_value(v) - if res: - return res return None
476-489: Consider more specific exception handling for checksum extraction.The broad
except Exceptionclauses provide best-effort behavior for the optional checksum feature, which is appropriate. However, more specific exception handling would improve debuggability while maintaining robustness.🔎 More specific exception types
Replace broad
except Exception:with specific types:- except Exception: + except (requests.RequestException, json.JSONDecodeError, KeyError, ValueError, TypeError): # Best-effort: if fetching a version fails, skip it continueApply similar specificity to the other exception handlers at lines 538, 596, and 814. This catches the expected failure modes (network errors, JSON parsing, missing keys, type mismatches) while allowing truly unexpected exceptions to surface for investigation.
Also applies to: 538-539, 596-597, 814-815
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
databusclient/api/download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)databusclient/api/utils.py (2)
get_databus_id_parts_from_file_url(6-30)fetch_databus_jsonld(33-50)
🔇 Additional comments (7)
databusclient/api/download.py (7)
236-253: Checksum validation logic is correct and well-structured.The validation implementation correctly:
- Computes the actual SHA256 checksum using the imported function
- Handles missing expected checksums and computation errors with warnings (allowing the download to complete)
- Performs case-insensitive comparison (appropriate for hex strings)
- Raises a descriptive error on mismatch
This aligns perfectly with the PR objectives to skip validation when checksums are unavailable but fail on mismatches.
256-290: Correct propagation of checksums to individual file downloads.The function properly extracts expected checksums from the provided mapping and forwards them along with the validation flag to
_download_file.
525-550: Version-level checksum extraction is correctly implemented.The function properly extracts checksums from the version's JSON-LD metadata by scanning Part nodes in the @graph. The best-effort exception handling ensures that failures in checksum extraction don't break the download.
583-608: Artifact-level checksum extraction correctly handles multiple versions.The implementation properly extracts and propagates checksums for each version within an artifact, maintaining consistency with the
_download_versionlogic.
683-710: Group-level function correctly propagates validation flag.The function properly passes through the
validate_checksumflag to child artifact downloads, where checksum extraction is performed.
801-827: Single-file checksum validation correctly fetches parent metadata.For single-file downloads with validation enabled, the code appropriately fetches the parent version's JSON-LD metadata to extract the expected checksum, since file URLs don't contain checksum information directly. The matching logic (line 810) checks both
fileand@idfields to locate the correct Part node.
756-892: Validation flag is consistently propagated through all download paths.The
validate_checksumparameter is correctly added to the maindownloadfunction signature and properly forwarded to all download path functions (collection, version, artifact, group, and query-based downloads).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)
886-894: Checksum validation is ineffective for user-provided SPARQL queries.When a user provides a raw SPARQL query (the
elsebranch),validate_checksumis passed to_download_filesbut nochecksumsmapping is provided. This means every file will trigger the "no expected checksum available" warning at line 246, silently degrading to no validation.Consider either:
- Applying the same version-grouping approach used in
_download_collectionto extract checksums, or- Warning the user upfront that checksum validation is not supported for raw SPARQL queries.
🔎 Option 2: Warn user upfront
else: print("QUERY {}", databusURI.replace("\n", " ")) + if validate_checksum: + print("WARNING: Checksum validation is not supported for user-defined SPARQL queries") if uri_endpoint is None: # endpoint is required for queries (--databus) raise ValueError("No endpoint given for query")
🧹 Nitpick comments (4)
databusclient/api/download.py (4)
236-254: Consider removing the file on checksum mismatch.When checksum validation fails, the corrupted/incomplete file remains on disk at
filename. A user might inadvertently use this file later if they miss the error. Consider deleting the file before raising theIOError, or at least document this behavior.🔎 Proposed fix to clean up on mismatch
else: if actual.lower() != expected_checksum.lower(): + try: + os.remove(filename) + except OSError: + pass # best-effort cleanup raise IOError( f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}" )
585-600: Consider extracting checksum parsing into a helper function.The checksum extraction logic (lines 585-600) is nearly identical to lines 527-542 in
_download_version. Extracting this into a shared helper like_extract_checksums_from_jsonld(json_str) -> dictwould reduce duplication and centralize the parsing logic.
803-818: Add defensive None-check before constructing version_uri.If the URL parsing yields
Noneforhost,account,group, orartifact, line 807 will produce a malformed URI like"https://None/None/...". While the exception handler catches this, adding an explicit check makes the failure mode clearer.🔎 Proposed defensive check
if validate_checksum: + if not all([host, account, group, artifact, version]): + print(f"WARNING: Cannot construct version URI for checksum lookup from {databusURI}") + else: try: version_uri = f"https://{host}/{account}/{group}/{artifact}/{version}" # ... rest of the block except Exception as e: print(f"WARNING: Could not fetch checksum for single file: {e}")
17-61: Fallback search may match non-checksum 64-char hex strings.The fallback (lines 55-59) searches all node values for any 64-character hex string. While the primary key-based lookup targets specific checksum fields (checksum, sha256sum, sha256, databus:checksum), the fallback could match other hex identifiers unintentionally. If this is a concern, consider restricting the fallback to specific key patterns or adding logging to identify when the fallback is triggered during debugging.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
databusclient/api/download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)databusclient/api/utils.py (2)
get_databus_id_parts_from_file_url(6-30)fetch_databus_jsonld(33-50)
🔇 Additional comments (5)
databusclient/api/download.py (5)
15-16: LGTM!Import correctly reuses the existing
compute_sha256_and_lengthutility from the webdav extension rather than duplicating the SHA256 computation logic.
277-290: LGTM!The checksum propagation logic correctly resolves per-URL checksums from the mapping and passes them to
_download_file. The defensive type check on line 279 handles edge cases safely.
455-502: LGTM - Previous review feedback addressed.The defensive None-checks at lines 468-471 properly handle cases where URL parsing fails partially, addressing the previous review comment. The version-grouping approach efficiently minimizes HTTP requests for checksum metadata.
527-552: LGTM!Efficiently reuses the already-fetched
json_strto extract checksums, avoiding an extra HTTP request. The checksum mapping is correctly built and passed to_download_files.
685-712: LGTM!The
validate_checksumparameter is correctly propagated to_download_artifact.
Pull Request
Description
This PR implements the requested --validate-checksum flag for the download command to ensure data integrity.
Briefly describe the changes introduced in this PR.
Key Changes:
CLI: Added the --validate-checksum boolean flag to the download command in cli.py.
Logic: Implemented SHA256 verification in _download_file. It uses the existing compute_sha256_and_length utility from webdav.py to avoid code duplication.
Metadata Extraction: Updated _download_version and _download_artifact to extract expected checksums from the JSON-LD metadata and pass them down to the downloader.
Single File Fix: Added specific logic in download() to fetch the parent Version's JSON-LD when downloading a single file, ensuring checksum validation works for individual file downloads as well.
Related Issues
Closes #28
Type of change
Checklist:
poetry run pytest- all tests passedpoetry run ruff check- no linting errorsSummary by CodeRabbit
New Features
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.