This folder contains scripts to train ResNet50, ResNet152 & ResNext101 model on Habana GaudiTM device to achieve state-of-the-art accuracy. Please visit this page for performance information.
The ResNet50 demos included in this release are Eager mode and Lazy mode training for BS128 with FP32 and BS256 with BF16 mixed precision. The ResNet152 demos included in this release are Eager mode and Lazy mode training for BS128 with BF16 mixed precision. The ResNext101 demos included in this release are Eager mode and Lazy mode training for BS64 with FP32 and BS128 with BF16 mixed precision.
The base model used is from GitHub: PyTorch-Vision. A copy of the model file resnet.py is used to make certain model changes functionally equivalent to the original model. Please refer to a below section for a summary of model changes, changes to training script and the original files.
Please follow the instructions given in the following link for setting up the
environment including the $PYTHON environment variable: Gaudi Setup and
Installation Guide. Please
answer the questions in the guide according to your preferences. This guide will
walk you through the process of setting up your system to run the model on
Gaudi.
Set the MPI_ROOT environment variable to the directory where OpenMPI is installed.
For example, in Habana containers, use
export MPI_ROOT=/usr/local/openmpi/The base training and modelling scripts for training are based on a clone of https://github.com/pytorch/vision.git with certain changes for modeling and training script. Please refer to later sections on training script and model modifications for a summary of modifications to the original files.
In the docker container, clone this repository and switch to the branch that
matches your SynapseAI version. (Run the
hl-smi
utility to determine the SynapseAI version.)
git clone -b [SynapseAI version] https://github.com/HabanaAI/Model-ReferencesGo to PyTorch ResNet directory:
cd Model-References/PyTorch/computer_vision/ImageClassification/ResNetNote: If the repository is not in the PYTHONPATH, make sure you update it.
export PYTHONPATH=/path/to/Model-References:$PYTHONPATHImagenet 2012 dataset needs to be organized as per PyTorch requirement. PyTorch requirements are specified in the link below which contains scripts to organize Imagenet data. https://github.com/soumith/imagenet-multiGPU.
The following commands assume that Imagenet dataset is available at /root/software/data/pytorch/imagenet/ILSVRC2012/ directory.
Please refer to the following command for available training parameters:
cd Model-References/PyTorch/computer_vision/ImageClassification/ResNet
i. Example: ResNet50, lazy mode, bf16 mixed precision, Batch Size 256, Custom learning rate
$PYTHON -u demo_resnet.py --world-size 1 --dl-worker-type HABANA --batch-size 256 --model resnet50 --device hpu --workers 8 --print-freq 20 --channels-last True --dl-time-exclude False --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16 --custom-lr-values 0.1,0.01,0.001,0.0001 --custom-lr-milestones 0,30,60,80ii. Example: ResNext101 lazy mode, bf16 mixed precision, Batch size 128, Custom learning rate
$PYTHON -u demo_resnet.py --world-size 1 --dl-worker-type HABANA --batch-size 128 --model resnext101_32x4d --device hpu --workers 8 --print-freq 20 --channels-last True --dl-time-exclude False --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 100 --data-type bf16 --custom-lr-values 0.1,0.01,0.001,0.0001 --custom-lr-milestones 0,30,60,80iii. Example: ResNet152, lazy mode, bf16 mixed precision, Batch Size 128, Custom learning rate
$PYTHON -u demo_resnet.py --world-size 1 --dl-worker-type HABANA --batch-size 128 --model resnet152 --device hpu --workers 8 --print-freq 20 --channels-last True --dl-time-exclude False --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16 --custom-lr-values 0.1,0.01,0.001,0.0001 --custom-lr-milestones 0,30,60,80To run multi-node demo, make sure the host machine has 512 GB of RAM installed. Also ensure you followed the Gaudi Setup and Installation Guide to install and set up docker, so that the docker has access to all the 8 cards required for multi-node demo. Multinode configuration for ResNet 50 training up to 2 servers, each with 8 Gaudi cards, have been verified.
Before execution of the multi-node demo scripts, make sure all server network interfaces are up. You can change the state of each network interface managed by the habanalabs driver using the following command:
sudo ip link set <interface_name> up
To identify if a specific network interface is managed by the habanalabs driver type, run:
sudo ethtool -i <interface_name>
Multi-server training works by setting these environment variables:
MULTI_HLS_IPS: set this to a comma-separated list of host IP addresses. Network id is derived from first entry of this variable
This example shows how to setup for a 2-server training configuration. The IP addresses used are only examples:
export MULTI_HLS_IPS="10.10.100.101,10.10.100.102"
# When using demo script, export the following variable to the required port number (default 3022)
export DOCKER_SSHD_PORT=3022By default, the Habana docker uses port 22 for ssh. The default port configured in the demo script is port 3022. Run the following commands to configure the selected port number , port 3022 in example below.
sed -i 's/#Port 22/Port 3022/g' /etc/ssh/sshd_config
sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
service ssh restart-
Configure password-less ssh between all nodes:
Do the following in all the nodes' docker sessions:
mkdir ~/.ssh cd ~/.ssh ssh-keygen -t rsa -b 4096
Copy id_rsa.pub contents from every node's docker to every other node's docker's ~/.ssh/authorized_keys (all public keys need to be in all hosts' authorized_keys):
cat id_rsa.pub > authorized_keys vi authorized_keysCopy the contents from inside to other systems. Paste all hosts' public keys in all hosts' “authorized_keys” file.
-
On each system: Add all hosts (including itself) to known_hosts. The IP addresses used below are just for illustration:
ssh-keyscan -p 3022 -H 10.10.100.101 >> ~/.ssh/known_hosts ssh-keyscan -p 3022 -H 10.10.100.102 >> ~/.ssh/known_hosts
i. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 8, custom learning rate, 8x on single server, include dataloading time in throughput computation
$PYTHON -u demo_resnet.py --world-size 8 --batch-size 256 --model resnet50 --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16 --custom-lr-values 0.275,0.45,0.625,0.8,0.08,0.008,0.0008 --custom-lr-milestones 1,2,3,4,30,60,80 --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8ii. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 8, custom learning rate, 8x on single server, exclude dataloading time in throughput computation
$PYTHON -u demo_resnet.py --world-size 8 --batch-size 256 --model resnet50 --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16 --custom-lr-values 0.275,0.45,0.625,0.8,0.08,0.008,0.0008 --custom-lr-milestones 1,2,3,4,30,60,80 --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8iii. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 8, custom learning rate, 8x on single server, include dataloading time in throughput computation, use habana_dataloader, worker(decoder) instances 8
$PYTHON -u demo_resnet.py --world-size 8 --batch-size 256 --model resnet50 --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16 --custom-lr-values 0.275,0.45,0.625,0.8,0.08,0.008,0.0008 --custom-lr-milestones 1,2,3,4,30,60,80 --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8iv. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 8, custom learning rate, 8x on single server, include dataloading time in throughput computation, print-frequency 10
mpirun -n 8 --bind-to core --map-by slot:PE=7 --rank-by core --report-bindings --allow-run-as-root $PYTHON train.py --data-path=/root/software/data/pytorch/imagenet/ILSVRC2012 --model=resnet50 --device=hpu --batch-size=256 --epochs=90 --print-freq=10 --output-dir=. --channels-last=True --seed=123 --run-lazy-mode --hmp --hmp-bf16 ./ops_bf16_Resnet.txt --hmp-fp32 ./ops_fp32_Resnet.txt --custom-lr-values 0.275 0.45 0.625 0.8 0.08 0.008 0.0008 --custom-lr-milestones 1 2 3 4 30 60 80 --deterministic --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8v. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 16, custom learning rate, 16x on multiple servers, include dataloading time in throughput computation
MULTI_HLS_IPS="<node1 ipaddr>,<node2 ipaddr>" $PYTHON -u demo_resnet.py --world-size 16 --batch-size 256 --model resnet50 --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16 --custom-lr-values 0.475,0.85,1.225,1.6,0.16,0.016,0.0016 --custom-lr-milestones 1,2,3,4,30,60,80 --process-per-node 8 --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8vi. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 16, custom learning rate, 16x on multiple servers using Host-NIC, include dataloading time in throughput computation
export HCCL_OVER_TCP=1
export HCCL_COMM_ID=<rank0 host NIC ip address>:9696
export NSOCK_PERTHREAD=3
export HCCL_SOCKET_IFNAME="eth1,eth2,eth3,eth4" # host nic interfaces used
export HCCL_BOX_SIZE=8
export SOCKET_NTHREADS=2
$PYTHON -u demo_resnet.py --world-size 16 --batch-size 256 --model resnet50 --device hpu --workers 8 --print-freq 10 --channels-last True --dl-time-exclude False --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16 --dl-worker-type HABANA --custom-lr-values 0.475,0.85,1.225,1.6,0.16,0.016,0.0016 --custom-lr-milestones 1,2,3,4,30,60,80 --process-per-node 8vii. Example: ResNext101, lazy mode: bf16 mixed precision, Batch size 128, world-size 8, single server, include dataloading time in throughput computation
$PYTHON -u demo_resnet.py --world-size 8 --batch-size 128 --model resnext101_32x4d --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 100 --data-type bf16 --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8viii. Example: ResNet152, lazy mode: bf16 mixed precision, Batch size 128, world-size 8, custom learning rate, 8x on single server, include dataloading time in throughput computation
$PYTHON -u demo_resnet.py --world-size 8 --batch-size 128 --model resnet152 --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16 --custom-lr-values 0.275,0.45,0.625,0.8,0.08,0.008,0.0008 --custom-lr-milestones 1,2,3,4,30,60,80 --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8- channels-last = false is not supported. Train ResNet with channels-last=true. This will be fixed in a subsequent release.
- Placing mark_step() arbitrarily may lead to undefined behaviour. Recommend to keep mark_step() as shown in provided scripts.
- Habana_dataloader only supports Imagenet (JPEG) and 8 worker instances
- Only scripts & configurations mentioned in this README are supported and verified.
The following are the changes added to the training script (train.py) and utilities(utils.py):
-
Added support for Habana devices
a. Load Habana specific library.
b. Certain environment variables are defined for habana device.
c. Added support to run Resnet training in lazy mode in addition to the eager mode. mark_step() is performed to trigger execution of the graph.
d. Added multi-node (including multiple server) and mixed precision support.
c. Modified training script to use mpirun for distributed training.
f. Changes for dynamic loading of HCCL library.
g. Changed start_method for torch multiprocessing.
-
Dataloader related changes
a. Added –deterministic flag to make data loading deterministic with Pytorch dataloader.
b. Added support for including dataloader time in performance computation. To match behavior of base script, it is disabled by default.
c. Added CPU based habana_dataloader with faster image preprocessing (only JPEG) support. It is disabled by default.
-
To improve performance
a. Enhance with channels last data format (NHWC) support.
b. Permute convolution weight tensors & and any other dependent tensors like 'momentum' for better performance.
c. Checkpoint saving involves getting trainable params and other state variables to CPU and permuting the weight tensors. Hence, checkpoint saving is by default disabled. It can be enabled using –save-checkpoint option.
d. Optimized FusedSGD operator is used in place of torch.optim.SGD for lazy mode.
e. Added support for lowering print frequency of loss.
f. Modified script to avoid broadcast at the beginning of iteration.
g. Added additional parameter to DistributedDataParallel to increase the size of the bucket to combine all the all-reduce calls to a single call.
h. Use CPU based habana_dataloader.
i. Enable pinned memory for data loader.
j. Accuracy calculation is performed on the CPU as lower performance observed with "top-k" on device
-
Skip printing eval phase stats if eval steps is zero.
-
Added support in script to override default LR scheduler with custom schedule passed as parameters to script.
The model changes are listed below:
- Added support for resnext101_32x4d.