-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3.1-8B-Instruct
2025-12-06 05:35:15,826 - modelscope - INFO - Target directory already exists, skipping creation.
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,831 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,831 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,831 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,832 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,832 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:15,832 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2299] 2025-12-06 05:35:16,194 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:696] 2025-12-06 05:35:16,196 >> loading configuration file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/config.json
[INFO|configuration_utils.py:770] 2025-12-06 05:35:16,199 >> Model config LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
],
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.52.4",
"use_cache": true,
"vocab_size": 128256
}
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2021] 2025-12-06 05:35:16,201 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2299] 2025-12-06 05:35:16,547 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2025-12-06 05:35:16] llamafactory.data.template:143 >> Add pad token: <|eot_id|>
[INFO|2025-12-06 05:35:16] llamafactory.data.template:143 >> Add <|eom_id|> to stop words.
[INFO|configuration_utils.py:696] 2025-12-06 05:35:16,574 >> loading configuration file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/config.json
[INFO|configuration_utils.py:770] 2025-12-06 05:35:16,575 >> Model config LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
],
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.52.4",
"use_cache": true,
"vocab_size": 128256
}
[INFO|2025-12-06 05:35:16] llamafactory.model.model_utils.kv_cache:143 >> KV cache is enabled for faster generation.
[INFO|modeling_utils.py:1148] 2025-12-06 05:35:16,576 >> loading weights file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:2241] 2025-12-06 05:35:16,576 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1135] 2025-12-06 05:35:16,577 >> Generate config GenerationConfig {
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
]
}
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 101.70it/s]
[INFO|modeling_utils.py:5131] 2025-12-06 05:35:16,656 >> All model checkpoint weights were used when initializing LlamaForCausalLM.
[INFO|modeling_utils.py:5139] 2025-12-06 05:35:16,656 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1088] 2025-12-06 05:35:16,658 >> loading configuration file /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1135] 2025-12-06 05:35:16,658 >> Generate config GenerationConfig {
"bos_token_id": 128000,
"do_sample": true,
"eos_token_id": [
128001,
128008,
128009
],
"temperature": 0.6,
"top_p": 0.9
}
[INFO|2025-12-06 05:35:16] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/gradio/queueing.py", line 715, in process_events
response = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/route_utils.py", line 322, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/blocks.py", line 2191, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/blocks.py", line 1714, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 739, in async_iteration
return await anext(iterator)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 733, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/anyio/to_thread.py", line 61, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2525, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 986, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 716, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gradio/utils.py", line 877, in gen_wrapper
response = next(iterator)
^^^^^^^^^^^^^^
File "/app/src/llamafactory/webui/chatter.py", line 158, in load_model
super().init(args)
File "/app/src/llamafactory/chat/chat_model.py", line 53, in init
self.engine: BaseEngine = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/src/llamafactory/chat/hf_engine.py", line 59, in init
self.model = load_model(
^^^^^^^^^^^
File "/app/src/llamafactory/model/loader.py", line 189, in load_model
model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/src/llamafactory/model/adapter.py", line 300, in init_adapter
model = _setup_lora_tuning(
^^^^^^^^^^^^^^^^^^^
File "/app/src/llamafactory/model/adapter.py", line 183, in _setup_lora_tuning
model: LoraModel = PeftModel.from_pretrained(model, adapter, **init_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/peft/peft_model.py", line 541, in from_pretrained
load_result = model.load_adapter(
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/peft/peft_model.py", line 1272, in load_adapter
adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/peft/utils/save_and_load.py", line 567, in load_peft_weights
adapters_weights = safe_load_file(filename, device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/safetensors/torch.py", line 338, in load_file
result[k] = f.get_tensor(k)
^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Reproduction
点击Chat加载模型后,一直报这个错,GPU使用的是两块24G的L4
Others
No response