fix: qnn-aot: correct block size for qwen-4b and clean code#661
fix: qnn-aot: correct block size for qwen-4b and clean code#661chenghuaWang merged 3 commits intoUbiquitousLearning:mainfrom
Conversation
📝 WalkthroughWalkthroughRefactors Qwen3 QNN-AOT example to extract causal-mask and trace-input helpers, consolidate MIR trace/dump logic, require an output context name, add configurable LPBQ linear block sizes, and centralize model input construction with token-budget validation for inference and calibration. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
examples/qwen3_qnn_aot/compile.cpp (1)
27-27: Consider deriving the trace cache length from config.
makeTraceInputs(...)sizes everypast_key_*andpast_value_*tensor from this constant, whileQwen3Configalready exposesmax_cache_length. Freezing 1024 here means a JSON config change can silently compile the wrong cache shape. The same constant is duplicated inexamples/qwen3_qnn_aot/compile_sha.cpp.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qwen3_qnn_aot/compile.cpp` at line 27, Replace the hardcoded constexpr kContextLength = 1024 with a value derived from the model config: read Qwen3Config::max_cache_length (or equivalent accessor) and use that to size the trace cache in makeTraceInputs and any other places; update both compile.cpp and compile_sha.cpp to use the config-derived length so changes to max_cache_length in the JSON produce matching past_key_*/past_value_* tensor shapes at compile time (ensure the symbol kContextLength is removed or replaced and references are switched to the config-based variable).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/qwen3_qnn_aot/compile_sha.cpp`:
- Line 31: The example banner/usage text in
examples/qwen3_qnn_aot/compile_sha.cpp does not include the new required
-o/--output_context_name flag, so users copying the command hit the "No output
context path provided" exit; update the usage/banner string (the printed help
message displayed at program start—where you currently show the input flags) to
include "-o|--output_context_name <path>" (or its short form -o) so it matches
the Argparse::add<std::string>("-o|--output_context_name") declaration and
prevents confusion.
In `@examples/qwen3_qnn_aot/compile.cpp`:
- Line 67: qnn_aot_env.saveContext currently only logs failures and doesn't
propagate errors, so update the compile flow to detect failure and return a
non-zero exit code: either modify QnnAOTEnv::saveContext to return a bool/status
and check that result after calling qnn_aot_env.saveContext("context.0",
output_context_path.get()), or after the call verify the file was actually
written (e.g., attempt to open the output path) and if it fails log an error and
exit non-zero; apply the same change/guard to the identical call in
examples/qwen3_qnn_aot/compile_sha.cpp so the CLI fails when the context file
cannot be created.
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py`:
- Around line 61-69: normalize_qwen3_lpbq_block_size currently only checks
positivity of config.linear_block_size; update it to also fail fast when any
relevant feature dimension in the Qwen3Config is not divisible by
linear_block_size. In the normalize_qwen3_lpbq_block_size function, after
computing/validating block_size, iterate over config attributes (e.g., inspect
config.__dict__) and check integer attributes whose names indicate feature
dimensions (for example names containing "size", "dim", "features", or
"in_features" such as hidden_size or intermediate_size) and collect any that are
not divisible by block_size; if any are found, raise a ValueError listing the
offending attribute names and values and the expected divisibility by
linear_block_size. Ensure the raised error message mentions
normalize_qwen3_lpbq_block_size, Qwen3Config, and linear_block_size for clear
debugging.
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 254-260: The infer method currently only clamps max_new_tokens to
the available budget but doesn't reject non-positive values; before computing
available_tokens (and after building model_inputs), validate the incoming
max_new_tokens parameter and raise ValueError if max_new_tokens < 1, ensuring
you still compute input_length via self._build_model_inputs(prompt) and respect
self.mllm_qualcomm_max_length when later clamping to available_tokens.
- Around line 326-328: The calibration code currently passes max_seq_length
directly into _build_model_inputs; add a guard in the calibration entry path to
validate and clamp/raise if max_seq_length exceeds the configured model max
(mllm_qualcomm_max_length) before calling _build_model_inputs (reference
symbols: max_seq_length, _build_model_inputs, mllm_qualcomm_max_length,
model_inputs); implement: retrieve the configured max, if provided
max_seq_length > mllm_qualcomm_max_length either set max_seq_length =
mllm_qualcomm_max_length or raise a clear ValueError, then call
_build_model_inputs with the validated value.
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py`:
- Around line 30-35: The CLI currently allows non-positive values for
--infer_max_new_tokens which can break inference; change the parser to enforce a
positive integer by either supplying a custom argparse type (e.g., positive_int
that raises ValueError for <=0) to parser.add_argument("--infer_max_new_tokens",
...) or, after parsing, check args.infer_max_new_tokens <= 0 and call
parser.error("infer_max_new_tokens must be a positive integer"); reference
parser.add_argument and the --infer_max_new_tokens flag when making the change.
---
Nitpick comments:
In `@examples/qwen3_qnn_aot/compile.cpp`:
- Line 27: Replace the hardcoded constexpr kContextLength = 1024 with a value
derived from the model config: read Qwen3Config::max_cache_length (or equivalent
accessor) and use that to size the trace cache in makeTraceInputs and any other
places; update both compile.cpp and compile_sha.cpp to use the config-derived
length so changes to max_cache_length in the JSON produce matching
past_key_*/past_value_* tensor shapes at compile time (ensure the symbol
kContextLength is removed or replaced and references are switched to the
config-based variable).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 03d11f60-1583-4d0d-b104-e82d493b43c5
📒 Files selected for processing (8)
examples/qwen3_qnn_aot/compile.cppexamples/qwen3_qnn_aot/compile_common.hppexamples/qwen3_qnn_aot/compile_sha.cppexamples/qwen3_qnn_aot/config_1.7B.jsonexamples/qwen3_qnn_aot/config_4B.jsonpymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.pypymllm/mobile/backends/qualcomm/transformers/qwen3/runner.pypymllm/mobile/backends/qualcomm/transformers/qwen3/train.py
| auto& qnn_env_path = Argparse::add<std::string>("-qnn_env|--qnn_env_path") | ||
| .def("/opt/qcom/aistack/qairt/2.41.0.251128/lib/x86_64-linux-clang/") | ||
| .help("QNN AOT Environment path."); | ||
| auto& output_context_path = Argparse::add<std::string>("-o|--output_context_name").help("Output QNN context path."); |
There was a problem hiding this comment.
Update the source usage example for the new required flag.
The banner at Lines 9-10 still omits -o, so copying the documented command now hits the new "No output context path provided" exit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen3_qnn_aot/compile_sha.cpp` at line 31, The example banner/usage
text in examples/qwen3_qnn_aot/compile_sha.cpp does not include the new required
-o/--output_context_name flag, so users copying the command hit the "No output
context path provided" exit; update the usage/banner string (the printed help
message displayed at program start—where you currently show the input flags) to
include "-o|--output_context_name <path>" (or its short form -o) so it matches
the Argparse::add<std::string>("-o|--output_context_name") declaration and
prevents confusion.
| trace_and_dump(1, "qwen3_qnn_aot_1.mir"); | ||
|
|
||
| qnn_aot_env.saveContext("context.0", "qwen3-1.7B-lpbq.bin"); | ||
| qnn_aot_env.saveContext("context.0", output_context_path.get()); |
There was a problem hiding this comment.
Propagate saveContext failures to the CLI.
QnnAOTEnv::saveContext only logs and returns on missing contexts or file-open failures, so this command can still exit successfully with no context file when -o points at an invalid destination. Please verify the file was written, or change saveContext to return status, before returning here. The identical call in examples/qwen3_qnn_aot/compile_sha.cpp needs the same guard.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen3_qnn_aot/compile.cpp` at line 67, qnn_aot_env.saveContext
currently only logs failures and doesn't propagate errors, so update the compile
flow to detect failure and return a non-zero exit code: either modify
QnnAOTEnv::saveContext to return a bool/status and check that result after
calling qnn_aot_env.saveContext("context.0", output_context_path.get()), or
after the call verify the file was actually written (e.g., attempt to open the
output path) and if it fails log an error and exit non-zero; apply the same
change/guard to the identical call in examples/qwen3_qnn_aot/compile_sha.cpp so
the CLI fails when the context file cannot be created.
| def normalize_qwen3_lpbq_block_size(config: Qwen3Config) -> int: | ||
| block_size = getattr(config, "linear_block_size", None) | ||
| if isinstance(block_size, int) and block_size > 0: | ||
| config.linear_block_size = block_size | ||
| return block_size | ||
|
|
||
| raise ValueError( | ||
| "Qwen3 LPBQ requires a positive `linear_block_size` in model config" | ||
| ) |
There was a problem hiding this comment.
Validate LPBQ divisibility constraints at config-normalization time.
Line 61 currently validates only positivity. LPBQ layers also require each relevant in_features dimension to be divisible by linear_block_size; otherwise failure appears later (e.g., during deploy conversion) instead of failing fast with a clear config error.
🔧 Proposed fix
def normalize_qwen3_lpbq_block_size(config: Qwen3Config) -> int:
block_size = getattr(config, "linear_block_size", None)
if isinstance(block_size, int) and block_size > 0:
+ head_dim = getattr(
+ config, "head_dim", config.hidden_size // config.num_attention_heads
+ )
+ in_feature_dims = {
+ "hidden_size": config.hidden_size, # q/k/v/lm_head in_features
+ "intermediate_size": config.intermediate_size, # mlp down_proj in_features
+ "attn_output_size": config.num_attention_heads * head_dim, # o_proj in_features
+ }
+ for name, dim in in_feature_dims.items():
+ if dim % block_size != 0:
+ raise ValueError(
+ f"`linear_block_size` ({block_size}) must divide {name} ({dim})"
+ )
config.linear_block_size = block_size
return block_size
raise ValueError(
"Qwen3 LPBQ requires a positive `linear_block_size` in model config"
)As per coding guidelines, "Validate inputs for public APIs and critical internal functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around
lines 61 - 69, normalize_qwen3_lpbq_block_size currently only checks positivity
of config.linear_block_size; update it to also fail fast when any relevant
feature dimension in the Qwen3Config is not divisible by linear_block_size. In
the normalize_qwen3_lpbq_block_size function, after computing/validating
block_size, iterate over config attributes (e.g., inspect config.__dict__) and
check integer attributes whose names indicate feature dimensions (for example
names containing "size", "dim", "features", or "in_features" such as hidden_size
or intermediate_size) and collect any that are not divisible by block_size; if
any are found, raise a ValueError listing the offending attribute names and
values and the expected divisibility by linear_block_size. Ensure the raised
error message mentions normalize_qwen3_lpbq_block_size, Qwen3Config, and
linear_block_size for clear debugging.
| def infer(self, prompt: str, max_new_tokens: int = 8): | ||
| model_inputs = self._build_model_inputs(prompt) | ||
| input_length = model_inputs.input_ids.shape[1] | ||
| available_tokens = self.mllm_qualcomm_max_length - input_length - 1 | ||
| if available_tokens < 1: | ||
| raise ValueError("Prompt exceeds configured mllm_qualcomm_max_length") | ||
| max_new_tokens = min(max_new_tokens, available_tokens) |
There was a problem hiding this comment.
Reject non-positive max_new_tokens before budgeting.
Line 260 only clamps the upper bound; non-positive inputs still pass through. Add an explicit lower-bound check.
Proposed fix
def infer(self, prompt: str, max_new_tokens: int = 8):
+ if max_new_tokens < 1:
+ raise ValueError("max_new_tokens must be >= 1")
model_inputs = self._build_model_inputs(prompt)
input_length = model_inputs.input_ids.shape[1]
available_tokens = self.mllm_qualcomm_max_length - input_length - 1
if available_tokens < 1:
raise ValueError("Prompt exceeds configured mllm_qualcomm_max_length")
max_new_tokens = min(max_new_tokens, available_tokens)As per coding guidelines, "Validate inputs for public APIs and critical internal functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
254 - 260, The infer method currently only clamps max_new_tokens to the
available budget but doesn't reject non-positive values; before computing
available_tokens (and after building model_inputs), validate the incoming
max_new_tokens parameter and raise ValueError if max_new_tokens < 1, ensuring
you still compute input_length via self._build_model_inputs(prompt) and respect
self.mllm_qualcomm_max_length when later clamping to available_tokens.
| model_inputs = self._build_model_inputs( | ||
| entry["text"], max_length=max_seq_length | ||
| ) |
There was a problem hiding this comment.
Validate calibration max_seq_length against configured model max length.
Line 326 forwards max_seq_length directly without enforcing the documented constraint (not exceeding mllm_qualcomm_max_length). Add a guard at calibration entry.
Proposed fix
def calibrate(self, num_samples=64, max_seq_length=512):
+ if max_seq_length < 1 or max_seq_length > self.mllm_qualcomm_max_length:
+ raise ValueError(
+ f"max_seq_length must be in [1, {self.mllm_qualcomm_max_length}]"
+ )As per coding guidelines, "Validate inputs for public APIs and critical internal functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
326 - 328, The calibration code currently passes max_seq_length directly into
_build_model_inputs; add a guard in the calibration entry path to validate and
clamp/raise if max_seq_length exceeds the configured model max
(mllm_qualcomm_max_length) before calling _build_model_inputs (reference
symbols: max_seq_length, _build_model_inputs, mllm_qualcomm_max_length,
model_inputs); implement: retrieve the configured max, if provided
max_seq_length > mllm_qualcomm_max_length either set max_seq_length =
mllm_qualcomm_max_length or raise a clear ValueError, then call
_build_model_inputs with the validated value.
| parser.add_argument( | ||
| "--infer_max_new_tokens", | ||
| type=int, | ||
| default=8, | ||
| help="Maximum new tokens for post-calibration inference sanity check", | ||
| ) |
There was a problem hiding this comment.
Validate --infer_max_new_tokens as a positive integer.
Line 30 currently accepts any integer; 0 or negative values can propagate and break inference. Guard this at parse time.
Proposed fix
+def positive_int(value: str) -> int:
+ ivalue = int(value)
+ if ivalue < 1:
+ raise argparse.ArgumentTypeError("must be >= 1")
+ return ivalue
+
...
parser.add_argument(
"--infer_max_new_tokens",
- type=int,
+ type=positive_int,
default=8,
help="Maximum new tokens for post-calibration inference sanity check",
)As per coding guidelines, "Validate inputs for public APIs and critical internal functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py` around lines 30
- 35, The CLI currently allows non-positive values for --infer_max_new_tokens
which can break inference; change the parser to enforce a positive integer by
either supplying a custom argparse type (e.g., positive_int that raises
ValueError for <=0) to parser.add_argument("--infer_max_new_tokens", ...) or,
after parsing, check args.infer_max_new_tokens <= 0 and call
parser.error("infer_max_new_tokens must be a positive integer"); reference
parser.add_argument and the --infer_max_new_tokens flag when making the change.
|
/ping @chenghuaWang |
|
感谢您的贡献
在 H20 显卡上 128 个 sample 几分钟就够了。我没有测试过 GPU 上的推理速度(每token),但是由于 Qnn 的推理需要使用 Eager attention,速度会比 flash attention 的实现慢很多。 |
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (2)
270-276:⚠️ Potential issue | 🟡 MinorNon-positive
max_new_tokensvalues are not rejected.The current implementation only clamps the upper bound but does not validate that
max_new_tokens >= 1. A value of 0 or negative would pass through and could cause unexpected behavior ingenerate().Proposed fix
def infer(self, prompt: str, max_new_tokens: int = 8): + if max_new_tokens < 1: + raise ValueError("max_new_tokens must be >= 1") model_inputs = self._build_model_inputs(prompt) input_length = model_inputs.input_ids.shape[1]As per coding guidelines, "Validate inputs for public APIs and critical internal functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 270 - 276, The infer method allows zero or negative max_new_tokens to proceed; add input validation in infer (function infer) to reject non-positive values by raising ValueError if the caller passes max_new_tokens < 1, and also ensure after computing available_tokens and applying max_new_tokens = min(max_new_tokens, available_tokens) that the resulting max_new_tokens is still >= 1 (otherwise raise the same ValueError mentioning mllm_qualcomm_max_length and available_tokens); reference model_inputs, input_length and self.mllm_qualcomm_max_length when performing these checks so generate() never receives a non-positive token count.
306-347:⚠️ Potential issue | 🟡 MinorValidate
max_seq_lengthagainst configured model limits.The docstring states
max_seq_lengthshould not exceedmllm_qualcomm_max_length, but this constraint is not enforced. Passing a value larger than the configured limit could cause incorrect model behavior.Proposed fix at method entry
def calibrate(self, num_samples=64, max_seq_length=512): """ Perform calibration using Wikipedia dataset (PTQ) :param num_samples: Number of samples for calibration :param max_seq_length: Maximum length for each sample (not exceeding mllm_qualcomm_max_length) """ + if max_seq_length < 1 or max_seq_length > self.mllm_qualcomm_max_length: + raise ValueError( + f"max_seq_length must be in [1, {self.mllm_qualcomm_max_length}]" + ) print( f"Starting calibration, samples: {num_samples}, max length: {max_seq_length}" )As per coding guidelines, "Validate inputs for public APIs and critical internal functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 306 - 347, In calibrate, validate the incoming max_seq_length against the model's configured limit (mllm_qualcomm_max_length) at the start of the method: fetch the configured limit from the instance (e.g., self.mllm_qualcomm_max_length or self.config.mllm_qualcomm_max_length), and if max_seq_length is greater than that limit either clamp it to the limit or raise a ValueError (choose team convention); then continue using the validated/clamped value when calling _build_model_inputs and elsewhere in the method to ensure the calibration never uses a longer sequence than the model supports.
🧹 Nitpick comments (1)
pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (1)
234-249: Add docstring and validatemax_lengthwhen provided.This new helper method centralizes tokenization logic well. Consider adding:
- A brief docstring explaining purpose, parameters, and return type
- Validation that
max_lengthis positive when provided to fail fast with a clear errorProposed enhancement
def _build_model_inputs(self, prompt: str, max_length: int | None = None): + """ + Build model inputs from a prompt string. + + Args: + prompt: The user prompt text. + max_length: Optional maximum sequence length for truncation. + + Returns: + Tokenized inputs on the model device. + """ + if max_length is not None and max_length < 1: + raise ValueError("max_length must be >= 1") messages = [{"role": "user", "content": prompt}]As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 234 - 249, Add a concise docstring to _build_model_inputs describing its purpose, parameters (prompt: str, max_length: int | None), return type (tokenized tensors on self.model.device) and possible errors; also validate max_length when provided (ensure it's an int > 0) and raise a ValueError with a clear message if invalid, then proceed with the existing tokenization logic (references: _build_model_inputs, self.tokenizer, tokenizer_kwargs, self.model.device).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 270-276: The infer method allows zero or negative max_new_tokens
to proceed; add input validation in infer (function infer) to reject
non-positive values by raising ValueError if the caller passes max_new_tokens <
1, and also ensure after computing available_tokens and applying max_new_tokens
= min(max_new_tokens, available_tokens) that the resulting max_new_tokens is
still >= 1 (otherwise raise the same ValueError mentioning
mllm_qualcomm_max_length and available_tokens); reference model_inputs,
input_length and self.mllm_qualcomm_max_length when performing these checks so
generate() never receives a non-positive token count.
- Around line 306-347: In calibrate, validate the incoming max_seq_length
against the model's configured limit (mllm_qualcomm_max_length) at the start of
the method: fetch the configured limit from the instance (e.g.,
self.mllm_qualcomm_max_length or self.config.mllm_qualcomm_max_length), and if
max_seq_length is greater than that limit either clamp it to the limit or raise
a ValueError (choose team convention); then continue using the validated/clamped
value when calling _build_model_inputs and elsewhere in the method to ensure the
calibration never uses a longer sequence than the model supports.
---
Nitpick comments:
In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 234-249: Add a concise docstring to _build_model_inputs describing
its purpose, parameters (prompt: str, max_length: int | None), return type
(tokenized tensors on self.model.device) and possible errors; also validate
max_length when provided (ensure it's an int > 0) and raise a ValueError with a
clear message if invalid, then proceed with the existing tokenization logic
(references: _build_model_inputs, self.tokenizer, tokenizer_kwargs,
self.model.device).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: be328b84-2fdb-4144-98b4-d63c283b964d
📒 Files selected for processing (1)
pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py
改动:
目前量化中的linear block size是写死的16,改成了从配置文件读
为量化后infer测试的阶段设置了可配置的最大token数(8)以避免生成过多token消耗算力(在显存不够的机器使用CPU测试推理的速度大约50s/token[AMD Ryzen 8845H/48G RAM])
在vibe clean出一个干净的diff的时候model把一些重复的代码段也抽离了
虽然站在我的视角看起来代码是挺行的以及
it works on my machine,但毕竟初入LLM,还请拜托更好的指点其中的问题resolves #659
Summary by CodeRabbit
New Features
Refactoring