-
Notifications
You must be signed in to change notification settings - Fork 267
Open
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
optimum 1.21.4
optimum-habana 1.14.0.dev0
transformers 4.45.2
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.18.0-fw-53.1.1.1 |
| Driver Version: 1.18.0-ee698fb |
|-------------------------------+----------------------+----------------------+Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
1,
download bigscience/bloomz-7b1 weight from: https://huggingface.co/bigscience/bloomz-7b1
2,
cd optimum-habana/examples/language-modeling
pip install -r requirements.txt
3,
PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 \
python ../gaudi_spawn.py \
--use_deepspeed --world_size 8 run_clm.py \
--model_name_or_path /ai_workdir/models/bloomz-7b1 \
--dataset_name tatsu-lab/alpaca \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 1 \
--do_train \
--do_eval \
--output_dir /ai_workdir/models/bloomz-7b1-clm \
--use_habana \
--use_lazy_mode \
--gradient_checkpointing \
--throughput_warmup_steps 3 \
--deepspeed ./llama2_ds_zero3_config.json \
--gaudi_config_name gaudi_config.json \
--trust_remote_code True \
--overwrite_output_dir \
--block_size 4096 \
--save_strategy epoch
4, The running error log is as follows:
[rank4]: Traceback (most recent call last):
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank4]: return func(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor
[rank4]: work = group._allgather_base(output_tensor, input_tensor, opts)
[rank4]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].
[rank4]: During handling of the above exception, another exception occurred:
[rank4]: Traceback (most recent call last):
[rank4]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank4]: main()
[rank4]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 662, in main
[rank4]: metrics = trainer.evaluate()
[rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1748, in evaluate
[rank4]: output = eval_loop(
[rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1904, in evaluation_loop
[rank4]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2110, in prediction_step
[rank4]: raise error
[rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2087, in prediction_step
[rank4]: loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3532, in compute_loss
[rank4]: outputs = model(**inputs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank4]: return self._call_impl(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank4]: result = forward_call(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 559, in forward
[rank4]: transformer_outputs = self.transformer(
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank4]: return self._call_impl(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank4]: result = forward_call(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 438, in gaudi_bloom_model_forward
[rank4]: outputs = block(
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank4]: return self._call_impl(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank4]: result = forward_call(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 246, in gaudi_bloom_block_forward
[rank4]: attn_outputs = self.self_attention(
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank4]: return self._call_impl(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank4]: result = forward_call(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 205, in gaudi_bloom_attention_forward
[rank4]: output_tensor = self.dense(context_layer)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank4]: return self._call_impl(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1595, in _call_impl
[rank4]: args_result = hook(self, args)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]: ret_val = func(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank4]: self.pre_sub_module_forward_function(module)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank4]: return func(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank4]: param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank4]: return fn(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]: ret_val = func(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank4]: return func(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 291, in fetch_sub_module
[rank4]: self.__all_gather_params(params_to_fetch, forward)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]: ret_val = func(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 435, in __all_gather_params
[rank4]: self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 464, in __all_gather_params_
[rank4]: handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]: ret_val = func(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1242, in all_gather_coalesced
[rank4]: handles = _dist_allgather_fn(
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank4]: return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]: ret_val = func(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank4]: return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank4]: return func(*args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank4]: return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 205, in all_gather_into_tensor
[rank4]: return self.all_gather_function(output_tensor=output_tensor,
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank4]: msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 54, in _get_msg_dict
[rank4]: "args": f"{args}, {kwargs}",
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 473, in __repr__
[rank4]: return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 698, in _str
[rank4]: return _str_intern(self, tensor_contents=tensor_contents)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 618, in _str_intern
[rank4]: tensor_str = _tensor_str(self, indent)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 350, in _tensor_str
[rank4]: formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 138, in __init__
[rank4]: nonzero_finite_vals = torch.masked_select(
[rank4]: RuntimeError: [Rank:4] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
[rank4]: Check $HABANA_LOGS/ for detailsGraph compile failed. synStatus=synStatus 26 [Generic failure].
[rank4]: [Rank:4] Habana exception raised from compile at graph.cpp:599
[rank5]: Traceback (most recent call last):
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank5]: return func(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor
[rank5]: work = group._allgather_base(output_tensor, input_tensor, opts)
[rank5]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].
[rank5]: During handling of the above exception, another exception occurred:
[rank5]: Traceback (most recent call last):
[rank5]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank5]: main()
[rank5]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 662, in main
[rank5]: metrics = trainer.evaluate()
[rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1748, in evaluate
[rank5]: output = eval_loop(
[rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1904, in evaluation_loop
[rank5]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2110, in prediction_step
[rank5]: raise error
[rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2087, in prediction_step
[rank5]: loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3532, in compute_loss
[rank5]: outputs = model(**inputs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank5]: return self._call_impl(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank5]: result = forward_call(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 559, in forward
[rank5]: transformer_outputs = self.transformer(
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank5]: return self._call_impl(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank5]: result = forward_call(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 438, in gaudi_bloom_model_forward
[rank5]: outputs = block(
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank5]: return self._call_impl(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank5]: result = forward_call(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 246, in gaudi_bloom_block_forward
[rank5]: attn_outputs = self.self_attention(
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank5]: return self._call_impl(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank5]: result = forward_call(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 205, in gaudi_bloom_attention_forward
[rank5]: output_tensor = self.dense(context_layer)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank5]: return self._call_impl(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1595, in _call_impl
[rank5]: args_result = hook(self, args)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]: ret_val = func(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank5]: self.pre_sub_module_forward_function(module)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank5]: return func(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank5]: param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank5]: return fn(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]: ret_val = func(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank5]: return func(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 291, in fetch_sub_module
[rank5]: self.__all_gather_params(params_to_fetch, forward)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]: ret_val = func(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 435, in __all_gather_params
[rank5]: self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 464, in __all_gather_params_
[rank5]: handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]: ret_val = func(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1242, in all_gather_coalesced
[rank5]: handles = _dist_allgather_fn(
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank5]: return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]: ret_val = func(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank5]: return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank5]: return func(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank5]: return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 205, in all_gather_into_tensor
[rank5]: return self.all_gather_function(output_tensor=output_tensor,
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank5]: msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 54, in _get_msg_dict
[rank5]: "args": f"{args}, {kwargs}",
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 473, in __repr__
[rank5]: return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 698, in _str
[rank5]: return _str_intern(self, tensor_contents=tensor_contents)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 618, in _str_intern
[rank5]: tensor_str = _tensor_str(self, indent)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 350, in _tensor_str
[rank5]: formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 138, in __init__
[rank5]: nonzero_finite_vals = torch.masked_select(
[rank5]: RuntimeError: [Rank:5] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
[rank5]: Check $HABANA_LOGS/ for detailsGraph compile failed. synStatus=synStatus 26 [Generic failure].
[rank5]: [Rank:5] Habana exception raised from compile at graph.cpp:599
[rank6]: Traceback (most recent call last):
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank6]: return func(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor
[rank6]: work = group._allgather_base(output_tensor, input_tensor, opts)
[rank6]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].
[rank6]: During handling of the above exception, another exception occurred:
[rank6]: Traceback (most recent call last):
[rank6]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank6]: main()
[rank6]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 662, in main
[rank6]: metrics = trainer.evaluate()
[rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1748, in evaluate
[rank6]: output = eval_loop(
[rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1904, in evaluation_loop
[rank6]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2110, in prediction_step
[rank6]: raise error
[rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2087, in prediction_step
[rank6]: loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3532, in compute_loss
[rank6]: outputs = model(**inputs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank6]: return self._call_impl(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank6]: result = forward_call(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 559, in forward
[rank6]: transformer_outputs = self.transformer(
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank6]: return self._call_impl(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank6]: result = forward_call(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 438, in gaudi_bloom_model_forward
[rank6]: outputs = block(
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank6]: return self._call_impl(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank6]: result = forward_call(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 246, in gaudi_bloom_block_forward
[rank6]: attn_outputs = self.self_attention(
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank6]: return self._call_impl(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank6]: result = forward_call(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 205, in gaudi_bloom_attention_forward
[rank6]: output_tensor = self.dense(context_layer)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank6]: return self._call_impl(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1595, in _call_impl
[rank6]: args_result = hook(self, args)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank6]: self.pre_sub_module_forward_function(module)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]: return func(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank6]: param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank6]: return fn(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]: return func(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 291, in fetch_sub_module
[rank6]: self.__all_gather_params(params_to_fetch, forward)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 435, in __all_gather_params
[rank6]: self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 464, in __all_gather_params_
[rank6]: handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1242, in all_gather_coalesced
[rank6]: handles = _dist_allgather_fn(
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank6]: return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank6]: return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank6]: return func(*args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank6]: return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 205, in all_gather_into_tensor
[rank6]: return self.all_gather_function(output_tensor=output_tensor,
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank6]: msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 54, in _get_msg_dict
[rank6]: "args": f"{args}, {kwargs}",
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 473, in __repr__
[rank6]: return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 698, in _str
[rank6]: return _str_intern(self, tensor_contents=tensor_contents)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 618, in _str_intern
[rank6]: tensor_str = _tensor_str(self, indent)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 350, in _tensor_str
[rank6]: formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 138, in __init__
[rank6]: nonzero_finite_vals = torch.masked_select(
[rank6]: RuntimeError: [Rank:6] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
[rank6]: Check $HABANA_LOGS/ for detailsGraph compile failed. synStatus=synStatus 26 [Generic failure].
[rank6]: [Rank:6] Habana exception raised from compile at graph.cpp:599
[rank2]: Traceback (most recent call last):
[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor
[rank2]: work = group._allgather_base(output_tensor, input_tensor, opts)
[rank2]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].
......
Expected behavior
Can successfully run bigscience/bloomz-7b1 with fine-tuning.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working