Skip to content

RuntimeError: CUDA error: device-side assert triggered #9578

@jiangxinufo

Description

@jiangxinufo

Reminder

  • I have read the above rules and searched the existing issues.

System Info

Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3.1-8B-Instruct
2025-12-06 05:35:15,826 - modelscope - INFO - Target directory already exists, skipping creation.
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,831 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,831 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,831 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,832 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,832 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,832 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2299] 2025-12-06 05:35:16,194 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:696] 2025-12-06 05:35:16,196 >> loading configuration file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/config.json
[INFO|configuration_utils.py:770] 2025-12-06 05:35:16,199 >> Model config LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
],
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.52.4",
"use_cache": true,
"vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2299] 2025-12-06 05:35:16,547 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2025-12-06 05:35:16] llamafactory.data.template:143 >> Add pad token: <|eot_id|>
[INFO|2025-12-06 05:35:16] llamafactory.data.template:143 >> Add <|eom_id|> to stop words.
[INFO|configuration_utils.py:696] 2025-12-06 05:35:16,574 >> loading configuration file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/config.json
[INFO|configuration_utils.py:770] 2025-12-06 05:35:16,575 >> Model config LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
],
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.52.4",
"use_cache": true,
"vocab_size": 128256
}

[INFO|2025-12-06 05:35:16] llamafactory.model.model_utils.kv_cache:143 >> KV cache is enabled for faster generation.
[INFO|modeling_utils.py:1148] 2025-12-06 05:35:16,576 >> loading weights file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:2241] 2025-12-06 05:35:16,576 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1135] 2025-12-06 05:35:16,577 >> Generate config GenerationConfig {
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
]
}

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 101.70it/s]
[INFO|modeling_utils.py:5131] 2025-12-06 05:35:16,656 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:5139] 2025-12-06 05:35:16,656 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1088] 2025-12-06 05:35:16,658 >> loading configuration file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1135] 2025-12-06 05:35:16,658 >> Generate config GenerationConfig {
"bos_token_id": 128000,
"do_sample": true,
"eos_token_id": [
128001,
128008,
128009
],
"temperature": 0.6,
"top_p": 0.9
}

[INFO|2025-12-06 05:35:16] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/gradio/queueing.py", line 715, in process_events
response = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/route_utils.py", line 322, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/blocks.py", line 2191, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/blocks.py", line 1714, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 739, in async_iteration
return await anext(iterator)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 733, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/anyio/to_thread.py", line 61, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2525, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 986, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 716, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 877, in gen_wrapper
response = next(iterator)
^^^^^^^^^^^^^^
File "/app/src/llamafactory/webui/chatter.py", line 158, in load_model
super().init(args)
File "/app/src/llamafactory/chat/chat_model.py", line 53, in init
self.engine: BaseEngine = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/src/llamafactory/chat/hf_engine.py", line 59, in init
self.model = load_model(
^^^^^^^^^^^
File "/app/src/llamafactory/model/loader.py", line 189, in load_model
model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/src/llamafactory/model/adapter.py", line 300, in init_adapter
model = _setup_lora_tuning(
^^^^^^^^^^^^^^^^^^^
File "/app/src/llamafactory/model/adapter.py", line 183, in _setup_lora_tuning
model: LoraModel = PeftModel.from_pretrained(model, adapter, **init_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/peft/peft_model.py", line 541, in from_pretrained
load_result = model.load_adapter(
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/peft/peft_model.py", line 1272, in load_adapter
adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/peft/utils/save_and_load.py", line 567, in load_peft_weights
adapters_weights = safe_load_file(filename, device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/safetensors/torch.py", line 338, in load_file
result[k] = f.get_tensor(k)
^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Reproduction

点击Chat加载模型后,一直报这个错,GPU使用的是两块24G的L4

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions