Skip to content

feat: implement model downloads from huggingface#18

Open
bsbds wants to merge 1 commit into
xorbitsai:masterfrom
bsbds:feat/model-download-hf
Open

feat: implement model downloads from huggingface#18
bsbds wants to merge 1 commit into
xorbitsai:masterfrom
bsbds:feat/model-download-hf

Conversation

@bsbds
Copy link
Copy Markdown
Collaborator

@bsbds bsbds commented Apr 13, 2026

No description provided.

Copy link
Copy Markdown

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR — the multi-provider abstraction (_candidate_remote_providers / _resolve_remote_location / _download_remote_repo) is clean and the auto-fallback logic is correct.

I'm requesting changes on a few issues:

Critical:

  • The "remote" alias silently remaps from ModelScope to HuggingFace. Users with DEEPDOC_MODEL_PROVIDER=remote will be rerouted without warning. Please either preserve the old mapping or document this as a breaking change.

Major:

  • _normalize_provider is duplicated in config.py and model_store.py. This PR extends both copies, making the duplication riskier. Consider exporting from one location.
  • auto provider now prefers HuggingFace over ModelScope (both scan and download). Previously auto only used ModelScope. This should be called out explicitly.
  • Error messages lost diagnostic detail — the old message listed expected files, env var names, and search paths. The new generic message makes debugging harder.
  • CLI default changed from modelscope to huggingface in both --provider and download_all().

Minor (test gaps):

  • No test for auto discovering pre-downloaded HF artifacts (only MS pre-download is tested).
  • No test for HF per-bundle env override (e.g. DEEPDOC_HUGGINGFACE_VISION_REPO).
  • huggingface_repo_env: str = "" default silently skips per-bundle HF env vars via os.getenv("") → None for custom BundleSpecs.
  • The backward-compat repo_env / repo_default / etc. properties delegate to modelscope fields. Consider updating the one test reference (ms.BUNDLES["vision"].repo_env) directly and removing the properties.

GLOBAL_HUGGINGFACE_REPO_ENV = "DEEPDOC_HUGGINGFACE_REPO"
GLOBAL_HUGGINGFACE_REVISION_ENV = "DEEPDOC_HUGGINGFACE_REVISION"
TOKENIZER_MODEL_DIR_ENV = "DEEPDOC_TOKENIZER_MODEL_DIR"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"remote" alias silently remapped (critical)

"remote" was previously mapped to "modelscope". Changing it to "huggingface" means any user with DEEPDOC_MODEL_PROVIDER=remote will be silently rerouted to HuggingFace after upgrading.

This is particularly impactful for users in regions where HF access is restricted.

Suggestion: Either preserve "remote" → "modelscope" for backward compatibility, or document this as an explicit breaking change.

revision_env: str
revision_default: str = "master"
modelscope_repo_env: str
modelscope_repo_default: str
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto provider priority changed from ModelScope-first to HF-first (major)

Previously auto only tried ModelScope. Now it tries HuggingFace first for both scanning pre-downloaded artifacts and downloading. This means:

  1. Pre-downloaded MS files that previously satisfied auto will be tried after HF files
  2. New downloads go to the HF cache path

This is a bigger behavioral change than just "adding HF support" and should be called out explicitly.

@@ -92,18 +115,24 @@ class BundleSpec:
"tsr.onnx",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message lost diagnostic detail (major)

The old error listed: expected files, relevant env var names (per-bundle and global), and search paths. This was very helpful for debugging misconfigurations.

The new aggregated error is generic: "Failed to resolve bundle 'X'. ...". Consider including the expected files list and relevant env var names in the aggregated error, e.g.:

raise FileNotFoundError(
    "Failed to resolve bundle '{}' (expected files: {}). {}".format(
        spec.name,
        ", ".join(spec.required_files),
        " ".join(failures),
    )
)


GLOBAL_MODELSCOPE_REPO_ENV = "DEEPDOC_MODELSCOPE_REPO"
GLOBAL_MODELSCOPE_REVISION_ENV = "DEEPDOC_MODELSCOPE_REVISION"
GLOBAL_HUGGINGFACE_REPO_ENV = "DEEPDOC_HUGGINGFACE_REPO"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huggingface_repo_env: str = "" is a minor footgun

When huggingface_repo_env is "", os.getenv("") returns None, silently skipping per-bundle HF env var resolution. All current BUNDLES entries override this, so it works today — but anyone adding a custom BundleSpec without specifying huggingface_repo_env would get silent behavior.

Consider using a sentinel value or documenting the empty-string default more clearly.

Comment thread deepdoc/config.py


def _normalize_provider(provider: str) -> ProviderType:
normalized = provider.strip().lower()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_normalize_provider is duplicated (major)

The same function (with the same aliases) exists in both config.py and model_store.py. This PR extends both copies identically, which works but increases the risk of them diverging in the future.

Consider exporting _normalize_provider from model_store.py and importing it here, or vice versa.

default="modelscope",
choices=("auto", "local", "modelscope"),
default="huggingface",
choices=("auto", "local", "huggingface", "modelscope"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default provider changed from modelscope to huggingface (major)

Both the CLI --provider default and the download_all() function default changed. Existing users running deepdoc-download-models with no args will download from HuggingFace instead of ModelScope.

The README does note the new default, which is good. Just flagging this as a breaking default change for awareness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants