Hi, I am having a problem while trying to replicate the pretraining process of the model. I am running on a Ubuntu 18.04.5 LTS (GNU/Linux 5.9.11-3-MANJARO x86_64) machine with one GeForce RTX3090 GPU (NVIDIA-SMI 455.45.01, Driver Version: 455.45.01, CUDA Version: 11.1).
After running ./scripts/pretrain/preprocess-pretrain-all.sh to process the provided data in your repo under the data-src/pretrain_all, and running ./scripts/pretrain/pretrain-all.sh, I got an error UnboundLocalError: local variable 'num_updates' referenced before assignment. This happened after multiple overflow detected, setting loss scale to: XX messages. The full log is given in the txt file below.
pretrain_log.txt
Does this mean I need to tweak the parameters in the scripts/pretrain/pretrain-all.sh to get it running? Or do I need to use some data other than those provided in the data-src/pretrain_all to run the model?
I am a novice to this whole thing, so please allow me to apologize in advance if this was not a good question. Thank you!
Hi, I am having a problem while trying to replicate the pretraining process of the model. I am running on a
Ubuntu 18.04.5 LTS (GNU/Linux 5.9.11-3-MANJARO x86_64)machine with one GeForce RTX3090 GPU (NVIDIA-SMI 455.45.01, Driver Version: 455.45.01, CUDA Version: 11.1).After running
./scripts/pretrain/preprocess-pretrain-all.shto process the provided data in your repo under thedata-src/pretrain_all, and running./scripts/pretrain/pretrain-all.sh, I got an errorUnboundLocalError: local variable 'num_updates' referenced before assignment. This happened after multipleoverflow detected, setting loss scale to: XXmessages. The full log is given in the txt file below.pretrain_log.txt
Does this mean I need to tweak the parameters in the
scripts/pretrain/pretrain-all.shto get it running? Or do I need to use some data other than those provided in thedata-src/pretrain_allto run the model?I am a novice to this whole thing, so please allow me to apologize in advance if this was not a good question. Thank you!