Skip to content

fix: qnn-aot: correct block size for qwen-4b and clean code#661

Merged
chenghuaWang merged 3 commits intoUbiquitousLearning:mainfrom
LXY1226:main
Mar 29, 2026
Merged

fix: qnn-aot: correct block size for qwen-4b and clean code#661
chenghuaWang merged 3 commits intoUbiquitousLearning:mainfrom
LXY1226:main

Conversation

@LXY1226
Copy link
Copy Markdown
Contributor

@LXY1226 LXY1226 commented Mar 25, 2026

改动:

目前量化中的linear block size是写死的16,改成了从配置文件读
为量化后infer测试的阶段设置了可配置的最大token数(8)以避免生成过多token消耗算力(在显存不够的机器使用CPU测试推理的速度大约50s/token[AMD Ryzen 8845H/48G RAM])
在vibe clean出一个干净的diff的时候model把一些重复的代码段也抽离了

虽然站在我的视角看起来代码是挺行的以及 it works on my machine ,但毕竟初入LLM,还请拜托更好的指点其中的问题

resolves #659

Summary by CodeRabbit

  • New Features

    • Configs gain a configurable linear_block_size for block-based linear ops (validated at model load).
    • Inference supports max_new_tokens (default 8) to control generation length.
    • CLI adds --output_context_name to specify the output context file (now required).
  • Refactoring

    • Compilation utilities extracted into a shared module and trace/input construction consolidated for reuse.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

Refactors Qwen3 QNN-AOT example to extract causal-mask and trace-input helpers, consolidate MIR trace/dump logic, require an output context name, add configurable LPBQ linear block sizes, and centralize model input construction with token-budget validation for inference and calibration.

Changes

Cohort / File(s) Summary
QNN‑AOT compile callers
examples/qwen3_qnn_aot/compile.cpp, examples/qwen3_qnn_aot/compile_sha.cpp
Removed inline causal-mask and per-seq trace/input construction; added --output_context_name/-o enforcement; replaced duplicated tracing blocks with a shared trace_and_dump lambda and kContextLength constant; now call shared helpers.
QNN‑AOT common helpers
examples/qwen3_qnn_aot/compile_common.hpp
New header providing qwen3_qnn_aot::addCausalMaskParams(params) and qwen3_qnn_aot::makeTraceInputs(seq_len, context_len, model_cfg, params) to create causal-mask params and per-layer KV trace tensors with fake-quant attachments.
Configs: LPBQ block size
examples/qwen3_qnn_aot/config_1.7B.json, examples/qwen3_qnn_aot/config_4B.json
Added linear_block_size field (1.7B: 16, 4B: 32) to model config JSONs.
Model wiring: LPBQ block size validation & usage
pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py
Added normalize_qwen3_lpbq_block_size(config) and updated LPBQ linear layers (MLP, Attention, LM head) to use validated config.linear_block_size, storing it as self.block_size.
Inference & calibration input handling
pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py, pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py
Introduced _build_model_inputs() to centralize prompt rendering/tokenization and device placement; infer() accepts max_new_tokens with token‑budget validation and uses computed input_length for slicing; training CLI adds --infer_max_new_tokens and calls infer accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • liang1232018
  • oreomaker

Poem

🐰 I nibble code, I hop and prune,
Helpers sprout where dupe blocks were strewn.
Block sizes set, the traces run,
Name your context, then we’re done—
Hooray for carrots, compile won! 🥕

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description is written in Chinese and covers the main changes, but lacks structured sections matching the template's expected format for a professional repository. Consider reformatting the description to match the repository template structure with clear sections for changes, testing, and related issues.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: qnn-aot: correct block size for qwen-4b and clean code' directly reflects the main changes: configuring block size from config files and code refactoring.
Linked Issues check ✅ Passed The PR addresses #659 by making block size configurable (fixes 4B model issues) and adds token-limiting for inference testing to manage resource usage on CPU.
Out of Scope Changes check ✅ Passed All changes directly support the objectives: block size configuration for Qwen3-4B support, configurable inference tokens, and code deduplication to improve maintainability.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
examples/qwen3_qnn_aot/compile.cpp (1)

27-27: Consider deriving the trace cache length from config.

makeTraceInputs(...) sizes every past_key_* and past_value_* tensor from this constant, while Qwen3Config already exposes max_cache_length. Freezing 1024 here means a JSON config change can silently compile the wrong cache shape. The same constant is duplicated in examples/qwen3_qnn_aot/compile_sha.cpp.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3_qnn_aot/compile.cpp` at line 27, Replace the hardcoded
constexpr kContextLength = 1024 with a value derived from the model config: read
Qwen3Config::max_cache_length (or equivalent accessor) and use that to size the
trace cache in makeTraceInputs and any other places; update both compile.cpp and
compile_sha.cpp to use the config-derived length so changes to max_cache_length
in the JSON produce matching past_key_*/past_value_* tensor shapes at compile
time (ensure the symbol kContextLength is removed or replaced and references are
switched to the config-based variable).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/qwen3_qnn_aot/compile_sha.cpp`:
- Line 31: The example banner/usage text in
examples/qwen3_qnn_aot/compile_sha.cpp does not include the new required
-o/--output_context_name flag, so users copying the command hit the "No output
context path provided" exit; update the usage/banner string (the printed help
message displayed at program start—where you currently show the input flags) to
include "-o|--output_context_name <path>" (or its short form -o) so it matches
the Argparse::add<std::string>("-o|--output_context_name") declaration and
prevents confusion.

In `@examples/qwen3_qnn_aot/compile.cpp`:
- Line 67: qnn_aot_env.saveContext currently only logs failures and doesn't
propagate errors, so update the compile flow to detect failure and return a
non-zero exit code: either modify QnnAOTEnv::saveContext to return a bool/status
and check that result after calling qnn_aot_env.saveContext("context.0",
output_context_path.get()), or after the call verify the file was actually
written (e.g., attempt to open the output path) and if it fails log an error and
exit non-zero; apply the same change/guard to the identical call in
examples/qwen3_qnn_aot/compile_sha.cpp so the CLI fails when the context file
cannot be created.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py`:
- Around line 61-69: normalize_qwen3_lpbq_block_size currently only checks
positivity of config.linear_block_size; update it to also fail fast when any
relevant feature dimension in the Qwen3Config is not divisible by
linear_block_size. In the normalize_qwen3_lpbq_block_size function, after
computing/validating block_size, iterate over config attributes (e.g., inspect
config.__dict__) and check integer attributes whose names indicate feature
dimensions (for example names containing "size", "dim", "features", or
"in_features" such as hidden_size or intermediate_size) and collect any that are
not divisible by block_size; if any are found, raise a ValueError listing the
offending attribute names and values and the expected divisibility by
linear_block_size. Ensure the raised error message mentions
normalize_qwen3_lpbq_block_size, Qwen3Config, and linear_block_size for clear
debugging.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 254-260: The infer method currently only clamps max_new_tokens to
the available budget but doesn't reject non-positive values; before computing
available_tokens (and after building model_inputs), validate the incoming
max_new_tokens parameter and raise ValueError if max_new_tokens < 1, ensuring
you still compute input_length via self._build_model_inputs(prompt) and respect
self.mllm_qualcomm_max_length when later clamping to available_tokens.
- Around line 326-328: The calibration code currently passes max_seq_length
directly into _build_model_inputs; add a guard in the calibration entry path to
validate and clamp/raise if max_seq_length exceeds the configured model max
(mllm_qualcomm_max_length) before calling _build_model_inputs (reference
symbols: max_seq_length, _build_model_inputs, mllm_qualcomm_max_length,
model_inputs); implement: retrieve the configured max, if provided
max_seq_length > mllm_qualcomm_max_length either set max_seq_length =
mllm_qualcomm_max_length or raise a clear ValueError, then call
_build_model_inputs with the validated value.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py`:
- Around line 30-35: The CLI currently allows non-positive values for
--infer_max_new_tokens which can break inference; change the parser to enforce a
positive integer by either supplying a custom argparse type (e.g., positive_int
that raises ValueError for <=0) to parser.add_argument("--infer_max_new_tokens",
...) or, after parsing, check args.infer_max_new_tokens <= 0 and call
parser.error("infer_max_new_tokens must be a positive integer"); reference
parser.add_argument and the --infer_max_new_tokens flag when making the change.

---

Nitpick comments:
In `@examples/qwen3_qnn_aot/compile.cpp`:
- Line 27: Replace the hardcoded constexpr kContextLength = 1024 with a value
derived from the model config: read Qwen3Config::max_cache_length (or equivalent
accessor) and use that to size the trace cache in makeTraceInputs and any other
places; update both compile.cpp and compile_sha.cpp to use the config-derived
length so changes to max_cache_length in the JSON produce matching
past_key_*/past_value_* tensor shapes at compile time (ensure the symbol
kContextLength is removed or replaced and references are switched to the
config-based variable).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03d11f60-1583-4d0d-b104-e82d493b43c5

📥 Commits

Reviewing files that changed from the base of the PR and between 05c75f8 and 453bb0a.

📒 Files selected for processing (8)
  • examples/qwen3_qnn_aot/compile.cpp
  • examples/qwen3_qnn_aot/compile_common.hpp
  • examples/qwen3_qnn_aot/compile_sha.cpp
  • examples/qwen3_qnn_aot/config_1.7B.json
  • examples/qwen3_qnn_aot/config_4B.json
  • pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py
  • pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py
  • pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py

auto& qnn_env_path = Argparse::add<std::string>("-qnn_env|--qnn_env_path")
.def("/opt/qcom/aistack/qairt/2.41.0.251128/lib/x86_64-linux-clang/")
.help("QNN AOT Environment path.");
auto& output_context_path = Argparse::add<std::string>("-o|--output_context_name").help("Output QNN context path.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the source usage example for the new required flag.

The banner at Lines 9-10 still omits -o, so copying the documented command now hits the new "No output context path provided" exit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3_qnn_aot/compile_sha.cpp` at line 31, The example banner/usage
text in examples/qwen3_qnn_aot/compile_sha.cpp does not include the new required
-o/--output_context_name flag, so users copying the command hit the "No output
context path provided" exit; update the usage/banner string (the printed help
message displayed at program start—where you currently show the input flags) to
include "-o|--output_context_name <path>" (or its short form -o) so it matches
the Argparse::add<std::string>("-o|--output_context_name") declaration and
prevents confusion.

trace_and_dump(1, "qwen3_qnn_aot_1.mir");

qnn_aot_env.saveContext("context.0", "qwen3-1.7B-lpbq.bin");
qnn_aot_env.saveContext("context.0", output_context_path.get());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Propagate saveContext failures to the CLI.

QnnAOTEnv::saveContext only logs and returns on missing contexts or file-open failures, so this command can still exit successfully with no context file when -o points at an invalid destination. Please verify the file was written, or change saveContext to return status, before returning here. The identical call in examples/qwen3_qnn_aot/compile_sha.cpp needs the same guard.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3_qnn_aot/compile.cpp` at line 67, qnn_aot_env.saveContext
currently only logs failures and doesn't propagate errors, so update the compile
flow to detect failure and return a non-zero exit code: either modify
QnnAOTEnv::saveContext to return a bool/status and check that result after
calling qnn_aot_env.saveContext("context.0", output_context_path.get()), or
after the call verify the file was actually written (e.g., attempt to open the
output path) and if it fails log an error and exit non-zero; apply the same
change/guard to the identical call in examples/qwen3_qnn_aot/compile_sha.cpp so
the CLI fails when the context file cannot be created.

Comment on lines +61 to +69
def normalize_qwen3_lpbq_block_size(config: Qwen3Config) -> int:
block_size = getattr(config, "linear_block_size", None)
if isinstance(block_size, int) and block_size > 0:
config.linear_block_size = block_size
return block_size

raise ValueError(
"Qwen3 LPBQ requires a positive `linear_block_size` in model config"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate LPBQ divisibility constraints at config-normalization time.

Line 61 currently validates only positivity. LPBQ layers also require each relevant in_features dimension to be divisible by linear_block_size; otherwise failure appears later (e.g., during deploy conversion) instead of failing fast with a clear config error.

🔧 Proposed fix
 def normalize_qwen3_lpbq_block_size(config: Qwen3Config) -> int:
     block_size = getattr(config, "linear_block_size", None)
     if isinstance(block_size, int) and block_size > 0:
+        head_dim = getattr(
+            config, "head_dim", config.hidden_size // config.num_attention_heads
+        )
+        in_feature_dims = {
+            "hidden_size": config.hidden_size,  # q/k/v/lm_head in_features
+            "intermediate_size": config.intermediate_size,  # mlp down_proj in_features
+            "attn_output_size": config.num_attention_heads * head_dim,  # o_proj in_features
+        }
+        for name, dim in in_feature_dims.items():
+            if dim % block_size != 0:
+                raise ValueError(
+                    f"`linear_block_size` ({block_size}) must divide {name} ({dim})"
+                )
         config.linear_block_size = block_size
         return block_size
 
     raise ValueError(
         "Qwen3 LPBQ requires a positive `linear_block_size` in model config"
     )

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around
lines 61 - 69, normalize_qwen3_lpbq_block_size currently only checks positivity
of config.linear_block_size; update it to also fail fast when any relevant
feature dimension in the Qwen3Config is not divisible by linear_block_size. In
the normalize_qwen3_lpbq_block_size function, after computing/validating
block_size, iterate over config attributes (e.g., inspect config.__dict__) and
check integer attributes whose names indicate feature dimensions (for example
names containing "size", "dim", "features", or "in_features" such as hidden_size
or intermediate_size) and collect any that are not divisible by block_size; if
any are found, raise a ValueError listing the offending attribute names and
values and the expected divisibility by linear_block_size. Ensure the raised
error message mentions normalize_qwen3_lpbq_block_size, Qwen3Config, and
linear_block_size for clear debugging.

Comment on lines +254 to +260
def infer(self, prompt: str, max_new_tokens: int = 8):
model_inputs = self._build_model_inputs(prompt)
input_length = model_inputs.input_ids.shape[1]
available_tokens = self.mllm_qualcomm_max_length - input_length - 1
if available_tokens < 1:
raise ValueError("Prompt exceeds configured mllm_qualcomm_max_length")
max_new_tokens = min(max_new_tokens, available_tokens)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject non-positive max_new_tokens before budgeting.

Line 260 only clamps the upper bound; non-positive inputs still pass through. Add an explicit lower-bound check.

Proposed fix
 def infer(self, prompt: str, max_new_tokens: int = 8):
+    if max_new_tokens < 1:
+        raise ValueError("max_new_tokens must be >= 1")
     model_inputs = self._build_model_inputs(prompt)
     input_length = model_inputs.input_ids.shape[1]
     available_tokens = self.mllm_qualcomm_max_length - input_length - 1
     if available_tokens < 1:
         raise ValueError("Prompt exceeds configured mllm_qualcomm_max_length")
     max_new_tokens = min(max_new_tokens, available_tokens)

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
254 - 260, The infer method currently only clamps max_new_tokens to the
available budget but doesn't reject non-positive values; before computing
available_tokens (and after building model_inputs), validate the incoming
max_new_tokens parameter and raise ValueError if max_new_tokens < 1, ensuring
you still compute input_length via self._build_model_inputs(prompt) and respect
self.mllm_qualcomm_max_length when later clamping to available_tokens.

Comment on lines +326 to 328
model_inputs = self._build_model_inputs(
entry["text"], max_length=max_seq_length
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate calibration max_seq_length against configured model max length.

Line 326 forwards max_seq_length directly without enforcing the documented constraint (not exceeding mllm_qualcomm_max_length). Add a guard at calibration entry.

Proposed fix
 def calibrate(self, num_samples=64, max_seq_length=512):
+    if max_seq_length < 1 or max_seq_length > self.mllm_qualcomm_max_length:
+        raise ValueError(
+            f"max_seq_length must be in [1, {self.mllm_qualcomm_max_length}]"
+        )

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
326 - 328, The calibration code currently passes max_seq_length directly into
_build_model_inputs; add a guard in the calibration entry path to validate and
clamp/raise if max_seq_length exceeds the configured model max
(mllm_qualcomm_max_length) before calling _build_model_inputs (reference
symbols: max_seq_length, _build_model_inputs, mllm_qualcomm_max_length,
model_inputs); implement: retrieve the configured max, if provided
max_seq_length > mllm_qualcomm_max_length either set max_seq_length =
mllm_qualcomm_max_length or raise a clear ValueError, then call
_build_model_inputs with the validated value.

Comment on lines +30 to +35
parser.add_argument(
"--infer_max_new_tokens",
type=int,
default=8,
help="Maximum new tokens for post-calibration inference sanity check",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate --infer_max_new_tokens as a positive integer.

Line 30 currently accepts any integer; 0 or negative values can propagate and break inference. Guard this at parse time.

Proposed fix
+def positive_int(value: str) -> int:
+    ivalue = int(value)
+    if ivalue < 1:
+        raise argparse.ArgumentTypeError("must be >= 1")
+    return ivalue
+
 ...
     parser.add_argument(
         "--infer_max_new_tokens",
-        type=int,
+        type=positive_int,
         default=8,
         help="Maximum new tokens for post-calibration inference sanity check",
     )

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py` around lines 30
- 35, The CLI currently allows non-positive values for --infer_max_new_tokens
which can break inference; change the parser to enforce a positive integer by
either supplying a custom argparse type (e.g., positive_int that raises
ValueError for <=0) to parser.add_argument("--infer_max_new_tokens", ...) or,
after parsing, check args.infer_max_new_tokens <= 0 and call
parser.error("infer_max_new_tokens must be a positive integer"); reference
parser.add_argument and the --infer_max_new_tokens flag when making the change.

@chenghuaWang
Copy link
Copy Markdown
Collaborator

chenghuaWang commented Mar 27, 2026

@LXY1226

  1. 最大token数(8)的请改一下为 1024 吧。校准还是推荐大家在 GPU 上做。
  2. 请问您测试的 Qnn 版本和 Qualcomm 芯片型号是什么?

@LXY1226
Copy link
Copy Markdown
Contributor Author

LXY1226 commented Mar 27, 2026

@LXY1226

  1. 最大token数(8)的请改一下为 1024 吧。校准还是推荐大家在 GPU 上做。
  2. 请问您测试的 Qnn 版本和 Qualcomm 芯片型号是什么?
  1. done,但顺便好奇一下,在正常的不爆显存的GPU的量化速度(每sample)和推理速度(每token)大概是多少 我倒要看看差距多大🤕
  2. QNN: 2.44.0.260225 芯片:SM8850(骁龙8 gen 5 elite)
    另外一提:4B在我的旧设备(SM7450 SD 7g2+/V69)上无法正常convert,回报限制大小2GB,codex找了一圈结论是V81(似乎只有SM8850)版本及以上可以支持2G+(结论存疑) 希望是有用的信息

/ping @chenghuaWang

@chenghuaWang
Copy link
Copy Markdown
Collaborator

感谢您的贡献

  1. done,但顺便好奇一下,在正常的不爆显存的GPU的量化速度(每sample)和推理速度(每token)大概是多少 我倒要看看差距多大🤕

在 H20 显卡上 128 个 sample 几分钟就够了。我没有测试过 GPU 上的推理速度(每token),但是由于 Qnn 的推理需要使用 Eager attention,速度会比 flash attention 的实现慢很多。

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (2)

270-276: ⚠️ Potential issue | 🟡 Minor

Non-positive max_new_tokens values are not rejected.

The current implementation only clamps the upper bound but does not validate that max_new_tokens >= 1. A value of 0 or negative would pass through and could cause unexpected behavior in generate().

Proposed fix
     def infer(self, prompt: str, max_new_tokens: int = 8):
+        if max_new_tokens < 1:
+            raise ValueError("max_new_tokens must be >= 1")
         model_inputs = self._build_model_inputs(prompt)
         input_length = model_inputs.input_ids.shape[1]

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
270 - 276, The infer method allows zero or negative max_new_tokens to proceed;
add input validation in infer (function infer) to reject non-positive values by
raising ValueError if the caller passes max_new_tokens < 1, and also ensure
after computing available_tokens and applying max_new_tokens =
min(max_new_tokens, available_tokens) that the resulting max_new_tokens is still
>= 1 (otherwise raise the same ValueError mentioning mllm_qualcomm_max_length
and available_tokens); reference model_inputs, input_length and
self.mllm_qualcomm_max_length when performing these checks so generate() never
receives a non-positive token count.

306-347: ⚠️ Potential issue | 🟡 Minor

Validate max_seq_length against configured model limits.

The docstring states max_seq_length should not exceed mllm_qualcomm_max_length, but this constraint is not enforced. Passing a value larger than the configured limit could cause incorrect model behavior.

Proposed fix at method entry
     def calibrate(self, num_samples=64, max_seq_length=512):
         """
         Perform calibration using Wikipedia dataset (PTQ)
         :param num_samples: Number of samples for calibration
         :param max_seq_length: Maximum length for each sample (not exceeding mllm_qualcomm_max_length)
         """
+        if max_seq_length < 1 or max_seq_length > self.mllm_qualcomm_max_length:
+            raise ValueError(
+                f"max_seq_length must be in [1, {self.mllm_qualcomm_max_length}]"
+            )
         print(
             f"Starting calibration, samples: {num_samples}, max length: {max_seq_length}"
         )

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
306 - 347, In calibrate, validate the incoming max_seq_length against the
model's configured limit (mllm_qualcomm_max_length) at the start of the method:
fetch the configured limit from the instance (e.g.,
self.mllm_qualcomm_max_length or self.config.mllm_qualcomm_max_length), and if
max_seq_length is greater than that limit either clamp it to the limit or raise
a ValueError (choose team convention); then continue using the validated/clamped
value when calling _build_model_inputs and elsewhere in the method to ensure the
calibration never uses a longer sequence than the model supports.
🧹 Nitpick comments (1)
pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (1)

234-249: Add docstring and validate max_length when provided.

This new helper method centralizes tokenization logic well. Consider adding:

  1. A brief docstring explaining purpose, parameters, and return type
  2. Validation that max_length is positive when provided to fail fast with a clear error
Proposed enhancement
     def _build_model_inputs(self, prompt: str, max_length: int | None = None):
+        """
+        Build model inputs from a prompt string.
+
+        Args:
+            prompt: The user prompt text.
+            max_length: Optional maximum sequence length for truncation.
+
+        Returns:
+            Tokenized inputs on the model device.
+        """
+        if max_length is not None and max_length < 1:
+            raise ValueError("max_length must be >= 1")
         messages = [{"role": "user", "content": prompt}]

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
234 - 249, Add a concise docstring to _build_model_inputs describing its
purpose, parameters (prompt: str, max_length: int | None), return type
(tokenized tensors on self.model.device) and possible errors; also validate
max_length when provided (ensure it's an int > 0) and raise a ValueError with a
clear message if invalid, then proceed with the existing tokenization logic
(references: _build_model_inputs, self.tokenizer, tokenizer_kwargs,
self.model.device).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 270-276: The infer method allows zero or negative max_new_tokens
to proceed; add input validation in infer (function infer) to reject
non-positive values by raising ValueError if the caller passes max_new_tokens <
1, and also ensure after computing available_tokens and applying max_new_tokens
= min(max_new_tokens, available_tokens) that the resulting max_new_tokens is
still >= 1 (otherwise raise the same ValueError mentioning
mllm_qualcomm_max_length and available_tokens); reference model_inputs,
input_length and self.mllm_qualcomm_max_length when performing these checks so
generate() never receives a non-positive token count.
- Around line 306-347: In calibrate, validate the incoming max_seq_length
against the model's configured limit (mllm_qualcomm_max_length) at the start of
the method: fetch the configured limit from the instance (e.g.,
self.mllm_qualcomm_max_length or self.config.mllm_qualcomm_max_length), and if
max_seq_length is greater than that limit either clamp it to the limit or raise
a ValueError (choose team convention); then continue using the validated/clamped
value when calling _build_model_inputs and elsewhere in the method to ensure the
calibration never uses a longer sequence than the model supports.

---

Nitpick comments:
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 234-249: Add a concise docstring to _build_model_inputs describing
its purpose, parameters (prompt: str, max_length: int | None), return type
(tokenized tensors on self.model.device) and possible errors; also validate
max_length when provided (ensure it's an int > 0) and raise a ValueError with a
clear message if invalid, then proceed with the existing tokenization logic
(references: _build_model_inputs, self.tokenizer, tokenizer_kwargs,
self.model.device).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: be328b84-2fdb-4144-98b4-d63c283b964d

📥 Commits

Reviewing files that changed from the base of the PR and between efa029c and 4222759.

📒 Files selected for processing (1)
  • pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py

Copy link
Copy Markdown
Collaborator

@chenghuaWang chenghuaWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenghuaWang chenghuaWang merged commit 690b859 into UbiquitousLearning:main Mar 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

QNN-AOT Qwen3-4B not working

2 participants