feat: implement model downloads from huggingface#18
Conversation
rogercloud
left a comment
There was a problem hiding this comment.
Thanks for the PR — the multi-provider abstraction (_candidate_remote_providers / _resolve_remote_location / _download_remote_repo) is clean and the auto-fallback logic is correct.
I'm requesting changes on a few issues:
Critical:
- The
"remote"alias silently remaps from ModelScope to HuggingFace. Users withDEEPDOC_MODEL_PROVIDER=remotewill be rerouted without warning. Please either preserve the old mapping or document this as a breaking change.
Major:
_normalize_provideris duplicated inconfig.pyandmodel_store.py. This PR extends both copies, making the duplication riskier. Consider exporting from one location.autoprovider now prefers HuggingFace over ModelScope (both scan and download). Previouslyautoonly used ModelScope. This should be called out explicitly.- Error messages lost diagnostic detail — the old message listed expected files, env var names, and search paths. The new generic message makes debugging harder.
- CLI default changed from
modelscopetohuggingfacein both--provideranddownload_all().
Minor (test gaps):
- No test for
autodiscovering pre-downloaded HF artifacts (only MS pre-download is tested). - No test for HF per-bundle env override (e.g.
DEEPDOC_HUGGINGFACE_VISION_REPO). huggingface_repo_env: str = ""default silently skips per-bundle HF env vars viaos.getenv("") → Nonefor custom BundleSpecs.- The backward-compat
repo_env/repo_default/ etc. properties delegate to modelscope fields. Consider updating the one test reference (ms.BUNDLES["vision"].repo_env) directly and removing the properties.
| GLOBAL_HUGGINGFACE_REPO_ENV = "DEEPDOC_HUGGINGFACE_REPO" | ||
| GLOBAL_HUGGINGFACE_REVISION_ENV = "DEEPDOC_HUGGINGFACE_REVISION" | ||
| TOKENIZER_MODEL_DIR_ENV = "DEEPDOC_TOKENIZER_MODEL_DIR" | ||
|
|
There was a problem hiding this comment.
"remote" alias silently remapped (critical)
"remote" was previously mapped to "modelscope". Changing it to "huggingface" means any user with DEEPDOC_MODEL_PROVIDER=remote will be silently rerouted to HuggingFace after upgrading.
This is particularly impactful for users in regions where HF access is restricted.
Suggestion: Either preserve "remote" → "modelscope" for backward compatibility, or document this as an explicit breaking change.
| revision_env: str | ||
| revision_default: str = "master" | ||
| modelscope_repo_env: str | ||
| modelscope_repo_default: str |
There was a problem hiding this comment.
auto provider priority changed from ModelScope-first to HF-first (major)
Previously auto only tried ModelScope. Now it tries HuggingFace first for both scanning pre-downloaded artifacts and downloading. This means:
- Pre-downloaded MS files that previously satisfied
autowill be tried after HF files - New downloads go to the HF cache path
This is a bigger behavioral change than just "adding HF support" and should be called out explicitly.
| @@ -92,18 +115,24 @@ class BundleSpec: | |||
| "tsr.onnx", | |||
There was a problem hiding this comment.
Error message lost diagnostic detail (major)
The old error listed: expected files, relevant env var names (per-bundle and global), and search paths. This was very helpful for debugging misconfigurations.
The new aggregated error is generic: "Failed to resolve bundle 'X'. ...". Consider including the expected files list and relevant env var names in the aggregated error, e.g.:
raise FileNotFoundError(
"Failed to resolve bundle '{}' (expected files: {}). {}".format(
spec.name,
", ".join(spec.required_files),
" ".join(failures),
)
)|
|
||
| GLOBAL_MODELSCOPE_REPO_ENV = "DEEPDOC_MODELSCOPE_REPO" | ||
| GLOBAL_MODELSCOPE_REVISION_ENV = "DEEPDOC_MODELSCOPE_REVISION" | ||
| GLOBAL_HUGGINGFACE_REPO_ENV = "DEEPDOC_HUGGINGFACE_REPO" |
There was a problem hiding this comment.
huggingface_repo_env: str = "" is a minor footgun
When huggingface_repo_env is "", os.getenv("") returns None, silently skipping per-bundle HF env var resolution. All current BUNDLES entries override this, so it works today — but anyone adding a custom BundleSpec without specifying huggingface_repo_env would get silent behavior.
Consider using a sentinel value or documenting the empty-string default more clearly.
|
|
||
|
|
||
| def _normalize_provider(provider: str) -> ProviderType: | ||
| normalized = provider.strip().lower() |
There was a problem hiding this comment.
_normalize_provider is duplicated (major)
The same function (with the same aliases) exists in both config.py and model_store.py. This PR extends both copies identically, which works but increases the risk of them diverging in the future.
Consider exporting _normalize_provider from model_store.py and importing it here, or vice versa.
| default="modelscope", | ||
| choices=("auto", "local", "modelscope"), | ||
| default="huggingface", | ||
| choices=("auto", "local", "huggingface", "modelscope"), |
There was a problem hiding this comment.
Default provider changed from modelscope to huggingface (major)
Both the CLI --provider default and the download_all() function default changed. Existing users running deepdoc-download-models with no args will download from HuggingFace instead of ModelScope.
The README does note the new default, which is good. Just flagging this as a breaking default change for awareness.
No description provided.