RuntimeError: CUDA error: device-side assert triggered

### Reminder

- [x] I have read the above rules and searched the existing issues.

### System Info

Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3.1-8B-Instruct
2025-12-06 05:35:15,826 - modelscope - INFO - Target directory already exists, skipping creation.
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,831 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,831 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,831 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,832 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,832 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,832 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2299] 2025-12-06 05:35:16,194 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:696] 2025-12-06 05:35:16,196 >> loading configuration file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/config.json
[INFO|configuration_utils.py:770] 2025-12-06 05:35:16,199 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.52.4",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2299] 2025-12-06 05:35:16,547 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2025-12-06 05:35:16] llamafactory.data.template:143 >> Add pad token: <|eot_id|>
[INFO|2025-12-06 05:35:16] llamafactory.data.template:143 >> Add <|eom_id|> to stop words.
[INFO|configuration_utils.py:696] 2025-12-06 05:35:16,574 >> loading configuration file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/config.json
[INFO|configuration_utils.py:770] 2025-12-06 05:35:16,575 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.52.4",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|2025-12-06 05:35:16] llamafactory.model.model_utils.kv_cache:143 >> KV cache is enabled for faster generation.
[INFO|modeling_utils.py:1148] 2025-12-06 05:35:16,576 >> loading weights file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:2241] 2025-12-06 05:35:16,576 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1135] 2025-12-06 05:35:16,577 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ]
}

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 101.70it/s]
[INFO|modeling_utils.py:5131] 2025-12-06 05:35:16,656 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:5139] 2025-12-06 05:35:16,656 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1088] 2025-12-06 05:35:16,658 >> loading configuration file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1135] 2025-12-06 05:35:16,658 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9
}

[INFO|2025-12-06 05:35:16] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/gradio/queueing.py", line 715, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/gradio/blocks.py", line 2191, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/gradio/blocks.py", line 1714, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 739, in async_iteration
    return await anext(iterator)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 733, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/anyio/to_thread.py", line 61, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2525, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 986, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 716, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 877, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "/app/src/llamafactory/webui/chatter.py", line 158, in load_model
    super().__init__(args)
  File "/app/src/llamafactory/chat/chat_model.py", line 53, in __init__
    self.engine: BaseEngine = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/src/llamafactory/chat/hf_engine.py", line 59, in __init__
    self.model = load_model(
                 ^^^^^^^^^^^
  File "/app/src/llamafactory/model/loader.py", line 189, in load_model
    model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/src/llamafactory/model/adapter.py", line 300, in init_adapter
    model = _setup_lora_tuning(
            ^^^^^^^^^^^^^^^^^^^
  File "/app/src/llamafactory/model/adapter.py", line 183, in _setup_lora_tuning
    model: LoraModel = PeftModel.from_pretrained(model, adapter, **init_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/peft/peft_model.py", line 541, in from_pretrained
    load_result = model.load_adapter(
                  ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/peft/peft_model.py", line 1272, in load_adapter
    adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/peft/utils/save_and_load.py", line 567, in load_peft_weights
    adapters_weights = safe_load_file(filename, device=device)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/safetensors/torch.py", line 338, in load_file
    result[k] = f.get_tensor(k)
                ^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


### Reproduction

```text
点击Chat加载模型后，一直报这个错，GPU使用的是两块24G的L4
```


### Others

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError: CUDA error: device-side assert triggered #9578

Reminder

System Info

Reproduction

Others

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RuntimeError: CUDA error: device-side assert triggered #9578

Description

Reminder

System Info

Reproduction

Others

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions