fix: qnn-aot: correct block size for qwen-4b and clean code by LXY1226 · Pull Request #661 · UbiquitousLearning/mllm

LXY1226 · 2026-03-25T15:01:07Z

改动：

目前量化中的linear block size是写死的16，改成了从配置文件读
为量化后infer测试的阶段设置了可配置的最大token数（8）以避免生成过多token消耗算力（在显存不够的机器使用CPU测试推理的速度大约50s/token[AMD Ryzen 8845H/48G RAM]）
在vibe clean出一个干净的diff的时候model把一些重复的代码段也抽离了

虽然站在我的视角看起来代码是挺行的以及 it works on my machine ，但毕竟初入LLM，还请拜托更好的指点其中的问题

resolves #659

Summary by CodeRabbit

New Features
- Configs gain a configurable linear_block_size for block-based linear ops (validated at model load).
- Inference supports max_new_tokens (default 8) to control generation length.
- CLI adds --output_context_name to specify the output context file (now required).
Refactoring
- Compilation utilities extracted into a shared module and trace/input construction consolidated for reuse.

coderabbitai · 2026-03-25T15:01:26Z

📝 Walkthrough

Walkthrough

Refactors Qwen3 QNN-AOT example to extract causal-mask and trace-input helpers, consolidate MIR trace/dump logic, require an output context name, add configurable LPBQ linear block sizes, and centralize model input construction with token-budget validation for inference and calibration.

Changes

Cohort / File(s)	Summary
QNN‑AOT compile callers `examples/qwen3_qnn_aot/compile.cpp`, `examples/qwen3_qnn_aot/compile_sha.cpp`	Removed inline causal-mask and per-seq trace/input construction; added `--output_context_name/-o` enforcement; replaced duplicated tracing blocks with a shared `trace_and_dump` lambda and `kContextLength` constant; now call shared helpers.
QNN‑AOT common helpers `examples/qwen3_qnn_aot/compile_common.hpp`	New header providing `qwen3_qnn_aot::addCausalMaskParams(params)` and `qwen3_qnn_aot::makeTraceInputs(seq_len, context_len, model_cfg, params)` to create causal-mask params and per-layer KV trace tensors with fake-quant attachments.
Configs: LPBQ block size `examples/qwen3_qnn_aot/config_1.7B.json`, `examples/qwen3_qnn_aot/config_4B.json`	Added `linear_block_size` field (1.7B: `16`, 4B: `32`) to model config JSONs.
Model wiring: LPBQ block size validation & usage `pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py`	Added `normalize_qwen3_lpbq_block_size(config)` and updated LPBQ linear layers (MLP, Attention, LM head) to use validated `config.linear_block_size`, storing it as `self.block_size`.
Inference & calibration input handling `pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py`, `pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py`	Introduced `_build_model_inputs()` to centralize prompt rendering/tokenization and device placement; `infer()` accepts `max_new_tokens` with token‑budget validation and uses computed `input_length` for slicing; training CLI adds `--infer_max_new_tokens` and calls `infer` accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

refactor(qwen3): Simplify input handling in Qwen3 model #614 — related changes to LPBQ block-size usage and qwen3 model wiring.
feat(qwen3): Introduce Single Head Attention (SHA) optimization for Qualcomm qwen model #616 — related refactor of causal-mask and trace-input construction for QNN-AOT examples.
feat(qwen3): Add configuration files and enhance Qwen3 model with layer indexing and quantization improvements. #611 — overlaps in Qwen3 QNN/AOT config and per-layer quantization parameter wiring.

Suggested reviewers

liang1232018
oreomaker

Poem

🐰 I nibble code, I hop and prune,
Helpers sprout where dupe blocks were strewn.
Block sizes set, the traces run,
Name your context, then we’re done—
Hooray for carrots, compile won! 🥕

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description is written in Chinese and covers the main changes, but lacks structured sections matching the template's expected format for a professional repository.	Consider reformatting the description to match the repository template structure with clear sections for changes, testing, and related issues.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: qnn-aot: correct block size for qwen-4b and clean code' directly reflects the main changes: configuring block size from config files and code refactoring.
Linked Issues check	✅ Passed	The PR addresses `#659` by making block size configurable (fixes 4B model issues) and adds token-limiting for inference testing to manage resource usage on CPU.
Out of Scope Changes check	✅ Passed	All changes directly support the objectives: block size configuration for Qwen3-4B support, configurable inference tokens, and code deduplication to improve maintainability.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

examples/qwen3_qnn_aot/compile.cpp (1)
27-27: Consider deriving the trace cache length from config.

makeTraceInputs(...) sizes every past_key_* and past_value_* tensor from this constant, while Qwen3Config already exposes max_cache_length. Freezing 1024 here means a JSON config change can silently compile the wrong cache shape. The same constant is duplicated in examples/qwen3_qnn_aot/compile_sha.cpp.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3_qnn_aot/compile.cpp` at line 27, Replace the hardcoded
constexpr kContextLength = 1024 with a value derived from the model config: read
Qwen3Config::max_cache_length (or equivalent accessor) and use that to size the
trace cache in makeTraceInputs and any other places; update both compile.cpp and
compile_sha.cpp to use the config-derived length so changes to max_cache_length
in the JSON produce matching past_key_*/past_value_* tensor shapes at compile
time (ensure the symbol kContextLength is removed or replaced and references are
switched to the config-based variable).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/qwen3_qnn_aot/compile_sha.cpp`:
- Line 31: The example banner/usage text in
examples/qwen3_qnn_aot/compile_sha.cpp does not include the new required
-o/--output_context_name flag, so users copying the command hit the "No output
context path provided" exit; update the usage/banner string (the printed help
message displayed at program start—where you currently show the input flags) to
include "-o|--output_context_name <path>" (or its short form -o) so it matches
the Argparse::add<std::string>("-o|--output_context_name") declaration and
prevents confusion.

In `@examples/qwen3_qnn_aot/compile.cpp`:
- Line 67: qnn_aot_env.saveContext currently only logs failures and doesn't
propagate errors, so update the compile flow to detect failure and return a
non-zero exit code: either modify QnnAOTEnv::saveContext to return a bool/status
and check that result after calling qnn_aot_env.saveContext("context.0",
output_context_path.get()), or after the call verify the file was actually
written (e.g., attempt to open the output path) and if it fails log an error and
exit non-zero; apply the same change/guard to the identical call in
examples/qwen3_qnn_aot/compile_sha.cpp so the CLI fails when the context file
cannot be created.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py`:
- Around line 61-69: normalize_qwen3_lpbq_block_size currently only checks
positivity of config.linear_block_size; update it to also fail fast when any
relevant feature dimension in the Qwen3Config is not divisible by
linear_block_size. In the normalize_qwen3_lpbq_block_size function, after
computing/validating block_size, iterate over config attributes (e.g., inspect
config.__dict__) and check integer attributes whose names indicate feature
dimensions (for example names containing "size", "dim", "features", or
"in_features" such as hidden_size or intermediate_size) and collect any that are
not divisible by block_size; if any are found, raise a ValueError listing the
offending attribute names and values and the expected divisibility by
linear_block_size. Ensure the raised error message mentions
normalize_qwen3_lpbq_block_size, Qwen3Config, and linear_block_size for clear
debugging.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 254-260: The infer method currently only clamps max_new_tokens to
the available budget but doesn't reject non-positive values; before computing
available_tokens (and after building model_inputs), validate the incoming
max_new_tokens parameter and raise ValueError if max_new_tokens < 1, ensuring
you still compute input_length via self._build_model_inputs(prompt) and respect
self.mllm_qualcomm_max_length when later clamping to available_tokens.
- Around line 326-328: The calibration code currently passes max_seq_length
directly into _build_model_inputs; add a guard in the calibration entry path to
validate and clamp/raise if max_seq_length exceeds the configured model max
(mllm_qualcomm_max_length) before calling _build_model_inputs (reference
symbols: max_seq_length, _build_model_inputs, mllm_qualcomm_max_length,
model_inputs); implement: retrieve the configured max, if provided
max_seq_length > mllm_qualcomm_max_length either set max_seq_length =
mllm_qualcomm_max_length or raise a clear ValueError, then call
_build_model_inputs with the validated value.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py`:
- Around line 30-35: The CLI currently allows non-positive values for
--infer_max_new_tokens which can break inference; change the parser to enforce a
positive integer by either supplying a custom argparse type (e.g., positive_int
that raises ValueError for <=0) to parser.add_argument("--infer_max_new_tokens",
...) or, after parsing, check args.infer_max_new_tokens <= 0 and call
parser.error("infer_max_new_tokens must be a positive integer"); reference
parser.add_argument and the --infer_max_new_tokens flag when making the change.

---

Nitpick comments:
In `@examples/qwen3_qnn_aot/compile.cpp`:
- Line 27: Replace the hardcoded constexpr kContextLength = 1024 with a value
derived from the model config: read Qwen3Config::max_cache_length (or equivalent
accessor) and use that to size the trace cache in makeTraceInputs and any other
places; update both compile.cpp and compile_sha.cpp to use the config-derived
length so changes to max_cache_length in the JSON produce matching
past_key_*/past_value_* tensor shapes at compile time (ensure the symbol
kContextLength is removed or replaced and references are switched to the
config-based variable).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03d11f60-1583-4d0d-b104-e82d493b43c5

📥 Commits

Reviewing files that changed from the base of the PR and between 05c75f8 and 453bb0a.

📒 Files selected for processing (8)

examples/qwen3_qnn_aot/compile.cpp
examples/qwen3_qnn_aot/compile_common.hpp
examples/qwen3_qnn_aot/compile_sha.cpp
examples/qwen3_qnn_aot/config_1.7B.json
examples/qwen3_qnn_aot/config_4B.json
pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py
pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py
pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py

coderabbitai · 2026-03-25T15:11:00Z

examples/qwen3_qnn_aot/compile_sha.cpp

  auto& qnn_env_path = Argparse::add<std::string>("-qnn_env|--qnn_env_path")
                           .def("/opt/qcom/aistack/qairt/2.41.0.251128/lib/x86_64-linux-clang/")
                           .help("QNN AOT Environment path.");
+  auto& output_context_path = Argparse::add<std::string>("-o|--output_context_name").help("Output QNN context path.");


⚠️ Potential issue | 🟡 Minor

Update the source usage example for the new required flag.

The banner at Lines 9-10 still omits -o, so copying the documented command now hits the new "No output context path provided" exit.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/qwen3_qnn_aot/compile_sha.cpp` at line 31, The example banner/usage text in examples/qwen3_qnn_aot/compile_sha.cpp does not include the new required -o/--output_context_name flag, so users copying the command hit the "No output context path provided" exit; update the usage/banner string (the printed help message displayed at program start—where you currently show the input flags) to include "-o|--output_context_name <path>" (or its short form -o) so it matches the Argparse::add<std::string>("-o|--output_context_name") declaration and prevents confusion.

coderabbitai · 2026-03-25T15:11:01Z

examples/qwen3_qnn_aot/compile.cpp

+  trace_and_dump(1, "qwen3_qnn_aot_1.mir");

-  qnn_aot_env.saveContext("context.0", "qwen3-1.7B-lpbq.bin");
+  qnn_aot_env.saveContext("context.0", output_context_path.get());


⚠️ Potential issue | 🟠 Major

Propagate saveContext failures to the CLI.

QnnAOTEnv::saveContext only logs and returns on missing contexts or file-open failures, so this command can still exit successfully with no context file when -o points at an invalid destination. Please verify the file was written, or change saveContext to return status, before returning here. The identical call in examples/qwen3_qnn_aot/compile_sha.cpp needs the same guard.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/qwen3_qnn_aot/compile.cpp` at line 67, qnn_aot_env.saveContext currently only logs failures and doesn't propagate errors, so update the compile flow to detect failure and return a non-zero exit code: either modify QnnAOTEnv::saveContext to return a bool/status and check that result after calling qnn_aot_env.saveContext("context.0", output_context_path.get()), or after the call verify the file was actually written (e.g., attempt to open the output path) and if it fails log an error and exit non-zero; apply the same change/guard to the identical call in examples/qwen3_qnn_aot/compile_sha.cpp so the CLI fails when the context file cannot be created.

coderabbitai · 2026-03-25T15:11:01Z

pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py

+def normalize_qwen3_lpbq_block_size(config: Qwen3Config) -> int:
+    block_size = getattr(config, "linear_block_size", None)
+    if isinstance(block_size, int) and block_size > 0:
+        config.linear_block_size = block_size
+        return block_size
+
+    raise ValueError(
+        "Qwen3 LPBQ requires a positive `linear_block_size` in model config"
+    )


⚠️ Potential issue | 🟠 Major

Validate LPBQ divisibility constraints at config-normalization time.

Line 61 currently validates only positivity. LPBQ layers also require each relevant in_features dimension to be divisible by linear_block_size; otherwise failure appears later (e.g., during deploy conversion) instead of failing fast with a clear config error.

🔧 Proposed fix

def normalize_qwen3_lpbq_block_size(config: Qwen3Config) -> int: block_size = getattr(config, "linear_block_size", None) if isinstance(block_size, int) and block_size > 0: + head_dim = getattr( + config, "head_dim", config.hidden_size // config.num_attention_heads + ) + in_feature_dims = { + "hidden_size": config.hidden_size, # q/k/v/lm_head in_features + "intermediate_size": config.intermediate_size, # mlp down_proj in_features + "attn_output_size": config.num_attention_heads * head_dim, # o_proj in_features + } + for name, dim in in_feature_dims.items(): + if dim % block_size != 0: + raise ValueError( + f"`linear_block_size` ({block_size}) must divide {name} ({dim})" + ) config.linear_block_size = block_size return block_size raise ValueError( "Qwen3 LPBQ requires a positive `linear_block_size` in model config" )

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around lines 61 - 69, normalize_qwen3_lpbq_block_size currently only checks positivity of config.linear_block_size; update it to also fail fast when any relevant feature dimension in the Qwen3Config is not divisible by linear_block_size. In the normalize_qwen3_lpbq_block_size function, after computing/validating block_size, iterate over config attributes (e.g., inspect config.__dict__) and check integer attributes whose names indicate feature dimensions (for example names containing "size", "dim", "features", or "in_features" such as hidden_size or intermediate_size) and collect any that are not divisible by block_size; if any are found, raise a ValueError listing the offending attribute names and values and the expected divisibility by linear_block_size. Ensure the raised error message mentions normalize_qwen3_lpbq_block_size, Qwen3Config, and linear_block_size for clear debugging.

coderabbitai · 2026-03-25T15:11:01Z

pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py

+    def infer(self, prompt: str, max_new_tokens: int = 8):
+        model_inputs = self._build_model_inputs(prompt)
+        input_length = model_inputs.input_ids.shape[1]
+        available_tokens = self.mllm_qualcomm_max_length - input_length - 1
+        if available_tokens < 1:
+            raise ValueError("Prompt exceeds configured mllm_qualcomm_max_length")
+        max_new_tokens = min(max_new_tokens, available_tokens)


⚠️ Potential issue | 🟠 Major

Reject non-positive max_new_tokens before budgeting.

Line 260 only clamps the upper bound; non-positive inputs still pass through. Add an explicit lower-bound check.

Proposed fix

def infer(self, prompt: str, max_new_tokens: int = 8): + if max_new_tokens < 1: + raise ValueError("max_new_tokens must be >= 1") model_inputs = self._build_model_inputs(prompt) input_length = model_inputs.input_ids.shape[1] available_tokens = self.mllm_qualcomm_max_length - input_length - 1 if available_tokens < 1: raise ValueError("Prompt exceeds configured mllm_qualcomm_max_length") max_new_tokens = min(max_new_tokens, available_tokens)

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 254 - 260, The infer method currently only clamps max_new_tokens to the available budget but doesn't reject non-positive values; before computing available_tokens (and after building model_inputs), validate the incoming max_new_tokens parameter and raise ValueError if max_new_tokens < 1, ensuring you still compute input_length via self._build_model_inputs(prompt) and respect self.mllm_qualcomm_max_length when later clamping to available_tokens.

coderabbitai · 2026-03-25T15:11:01Z

pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py

+                model_inputs = self._build_model_inputs(
+                    entry["text"], max_length=max_seq_length
                )


⚠️ Potential issue | 🟠 Major

Validate calibration max_seq_length against configured model max length.

Line 326 forwards max_seq_length directly without enforcing the documented constraint (not exceeding mllm_qualcomm_max_length). Add a guard at calibration entry.

Proposed fix

def calibrate(self, num_samples=64, max_seq_length=512): + if max_seq_length < 1 or max_seq_length > self.mllm_qualcomm_max_length: + raise ValueError( + f"max_seq_length must be in [1, {self.mllm_qualcomm_max_length}]" + )

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 326 - 328, The calibration code currently passes max_seq_length directly into _build_model_inputs; add a guard in the calibration entry path to validate and clamp/raise if max_seq_length exceeds the configured model max (mllm_qualcomm_max_length) before calling _build_model_inputs (reference symbols: max_seq_length, _build_model_inputs, mllm_qualcomm_max_length, model_inputs); implement: retrieve the configured max, if provided max_seq_length > mllm_qualcomm_max_length either set max_seq_length = mllm_qualcomm_max_length or raise a clear ValueError, then call _build_model_inputs with the validated value.

coderabbitai · 2026-03-25T15:11:01Z

pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py

+    parser.add_argument(
+        "--infer_max_new_tokens",
+        type=int,
+        default=8,
+        help="Maximum new tokens for post-calibration inference sanity check",
+    )


⚠️ Potential issue | 🟠 Major

Validate --infer_max_new_tokens as a positive integer.

Line 30 currently accepts any integer; 0 or negative values can propagate and break inference. Guard this at parse time.

Proposed fix

+def positive_int(value: str) -> int: + ivalue = int(value) + if ivalue < 1: + raise argparse.ArgumentTypeError("must be >= 1") + return ivalue + ... parser.add_argument( "--infer_max_new_tokens", - type=int, + type=positive_int, default=8, help="Maximum new tokens for post-calibration inference sanity check", )

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py` around lines 30 - 35, The CLI currently allows non-positive values for --infer_max_new_tokens which can break inference; change the parser to enforce a positive integer by either supplying a custom argparse type (e.g., positive_int that raises ValueError for <=0) to parser.add_argument("--infer_max_new_tokens", ...) or, after parsing, check args.infer_max_new_tokens <= 0 and call parser.error("infer_max_new_tokens must be a positive integer"); reference parser.add_argument and the --infer_max_new_tokens flag when making the change.

chenghuaWang · 2026-03-27T05:44:46Z

@LXY1226

最大token数（8）的请改一下为 1024 吧。校准还是推荐大家在 GPU 上做。
请问您测试的 Qnn 版本和 Qualcomm 芯片型号是什么？

LXY1226 · 2026-03-27T05:57:01Z

@LXY1226

最大token数（8）的请改一下为 1024 吧。校准还是推荐大家在 GPU 上做。

请问您测试的 Qnn 版本和 Qualcomm 芯片型号是什么？

done，但顺便好奇一下，在正常的不爆显存的GPU的量化速度（每sample）和推理速度（每token）大概是多少我倒要看看差距多大🤕
QNN: 2.44.0.260225 芯片：SM8850（骁龙8 gen 5 elite）
另外一提：4B在我的旧设备(SM7450 SD 7g2+/V69)上无法正常convert，回报限制大小2GB，codex找了一圈结论是V81(似乎只有SM8850)版本及以上可以支持2G+(结论存疑) 希望是有用的信息

/ping @chenghuaWang

chenghuaWang · 2026-03-28T16:06:51Z

感谢您的贡献

done，但顺便好奇一下，在正常的不爆显存的GPU的量化速度（每sample）和推理速度（每token）大概是多少我倒要看看差距多大🤕

在 H20 显卡上 128 个 sample 几分钟就够了。我没有测试过 GPU 上的推理速度（每token），但是由于 Qnn 的推理需要使用 Eager attention，速度会比 flash attention 的实现慢很多。

coderabbitai

♻️ Duplicate comments (2)

pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (2)

270-276: ⚠️ Potential issue | 🟡 Minor

Non-positive max_new_tokens values are not rejected.

The current implementation only clamps the upper bound but does not validate that max_new_tokens >= 1. A value of 0 or negative would pass through and could cause unexpected behavior in generate().

Proposed fix

     def infer(self, prompt: str, max_new_tokens: int = 8):
+        if max_new_tokens < 1:
+            raise ValueError("max_new_tokens must be >= 1")
         model_inputs = self._build_model_inputs(prompt)
         input_length = model_inputs.input_ids.shape[1]

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
270 - 276, The infer method allows zero or negative max_new_tokens to proceed;
add input validation in infer (function infer) to reject non-positive values by
raising ValueError if the caller passes max_new_tokens < 1, and also ensure
after computing available_tokens and applying max_new_tokens =
min(max_new_tokens, available_tokens) that the resulting max_new_tokens is still
>= 1 (otherwise raise the same ValueError mentioning mllm_qualcomm_max_length
and available_tokens); reference model_inputs, input_length and
self.mllm_qualcomm_max_length when performing these checks so generate() never
receives a non-positive token count.

306-347: ⚠️ Potential issue | 🟡 Minor

Validate max_seq_length against configured model limits.

The docstring states max_seq_length should not exceed mllm_qualcomm_max_length, but this constraint is not enforced. Passing a value larger than the configured limit could cause incorrect model behavior.

Proposed fix at method entry

     def calibrate(self, num_samples=64, max_seq_length=512):
         """
         Perform calibration using Wikipedia dataset (PTQ)
         :param num_samples: Number of samples for calibration
         :param max_seq_length: Maximum length for each sample (not exceeding mllm_qualcomm_max_length)
         """
+        if max_seq_length < 1 or max_seq_length > self.mllm_qualcomm_max_length:
+            raise ValueError(
+                f"max_seq_length must be in [1, {self.mllm_qualcomm_max_length}]"
+            )
         print(
             f"Starting calibration, samples: {num_samples}, max length: {max_seq_length}"
         )

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
306 - 347, In calibrate, validate the incoming max_seq_length against the
model's configured limit (mllm_qualcomm_max_length) at the start of the method:
fetch the configured limit from the instance (e.g.,
self.mllm_qualcomm_max_length or self.config.mllm_qualcomm_max_length), and if
max_seq_length is greater than that limit either clamp it to the limit or raise
a ValueError (choose team convention); then continue using the validated/clamped
value when calling _build_model_inputs and elsewhere in the method to ensure the
calibration never uses a longer sequence than the model supports.

🧹 Nitpick comments (1)

pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (1)

234-249: Add docstring and validate max_length when provided.

This new helper method centralizes tokenization logic well. Consider adding:

A brief docstring explaining purpose, parameters, and return type
Validation that max_length is positive when provided to fail fast with a clear error

Proposed enhancement

     def _build_model_inputs(self, prompt: str, max_length: int | None = None):
+        """
+        Build model inputs from a prompt string.
+
+        Args:
+            prompt: The user prompt text.
+            max_length: Optional maximum sequence length for truncation.
+
+        Returns:
+            Tokenized inputs on the model device.
+        """
+        if max_length is not None and max_length < 1:
+            raise ValueError("max_length must be >= 1")
         messages = [{"role": "user", "content": prompt}]

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
234 - 249, Add a concise docstring to _build_model_inputs describing its
purpose, parameters (prompt: str, max_length: int | None), return type
(tokenized tensors on self.model.device) and possible errors; also validate
max_length when provided (ensure it's an int > 0) and raise a ValueError with a
clear message if invalid, then proceed with the existing tokenization logic
(references: _build_model_inputs, self.tokenizer, tokenizer_kwargs,
self.model.device).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 270-276: The infer method allows zero or negative max_new_tokens
to proceed; add input validation in infer (function infer) to reject
non-positive values by raising ValueError if the caller passes max_new_tokens <
1, and also ensure after computing available_tokens and applying max_new_tokens
= min(max_new_tokens, available_tokens) that the resulting max_new_tokens is
still >= 1 (otherwise raise the same ValueError mentioning
mllm_qualcomm_max_length and available_tokens); reference model_inputs,
input_length and self.mllm_qualcomm_max_length when performing these checks so
generate() never receives a non-positive token count.
- Around line 306-347: In calibrate, validate the incoming max_seq_length
against the model's configured limit (mllm_qualcomm_max_length) at the start of
the method: fetch the configured limit from the instance (e.g.,
self.mllm_qualcomm_max_length or self.config.mllm_qualcomm_max_length), and if
max_seq_length is greater than that limit either clamp it to the limit or raise
a ValueError (choose team convention); then continue using the validated/clamped
value when calling _build_model_inputs and elsewhere in the method to ensure the
calibration never uses a longer sequence than the model supports.

---

Nitpick comments:
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 234-249: Add a concise docstring to _build_model_inputs describing
its purpose, parameters (prompt: str, max_length: int | None), return type
(tokenized tensors on self.model.device) and possible errors; also validate
max_length when provided (ensure it's an int > 0) and raise a ValueError with a
clear message if invalid, then proceed with the existing tokenization logic
(references: _build_model_inputs, self.tokenizer, tokenizer_kwargs,
self.model.device).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: be328b84-2fdb-4144-98b4-d63c283b964d

📥 Commits

Reviewing files that changed from the base of the PR and between efa029c and 4222759.

📒 Files selected for processing (1)

pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py

chenghuaWang

LGTM

fix: qnn-aot: correct block size for qwen-4b and clean code

453bb0a

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

Increase default max new tokens for inference

efa029c

Merge branch 'main' into main

4222759

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

chenghuaWang approved these changes Mar 29, 2026

View reviewed changes

chenghuaWang merged commit 690b859 into UbiquitousLearning:main Mar 29, 2026
1 check passed

Conversation

LXY1226 commented Mar 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

改动：

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

chenghuaWang commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LXY1226 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenghuaWang commented Mar 28, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

chenghuaWang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LXY1226 commented Mar 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 25, 2026 •

edited

Loading

chenghuaWang commented Mar 27, 2026 •

edited

Loading

LXY1226 commented Mar 27, 2026 •

edited

Loading