Name	Name	Last commit message	Last commit date
parent directory ..
model	model
LICENSE	LICENSE
README.md	README.md
demo_resnet.py	demo_resnet.py
ops_bf16_Resnet.txt	ops_bf16_Resnet.txt
ops_fp32_Resnet.txt	ops_fp32_Resnet.txt
train.py	train.py
utils.py	utils.py

ResNet for PyTorch

This folder contains scripts to train ResNet50, ResNet152 & ResNext101 model on Habana Gaudi^TM device to achieve state-of-the-art accuracy. Please visit this page for performance information.

The ResNet50 demos included in this release are Eager mode and Lazy mode training for BS128 with FP32 and BS256 with BF16 mixed precision. The ResNet152 demos included in this release are Eager mode and Lazy mode training for BS128 with BF16 mixed precision. The ResNext101 demos included in this release are Eager mode and Lazy mode training for BS64 with FP32 and BS128 with BF16 mixed precision.

ResNet for PyTorch
Known Issues
Training Script Modifications

Model Overview

The base model used is from GitHub: PyTorch-Vision. A copy of the model file resnet.py is used to make certain model changes functionally equivalent to the original model. Please refer to a below section for a summary of model changes, changes to training script and the original files.

Setup

Please follow the instructions given in the following link for setting up the environment including the $PYTHON environment variable: Gaudi Setup and Installation Guide. Please answer the questions in the guide according to your preferences. This guide will walk you through the process of setting up your system to run the model on Gaudi.

Set the MPI_ROOT environment variable to the directory where OpenMPI is installed.

For example, in Habana containers, use

export MPI_ROOT=/usr/local/openmpi/

The base training and modelling scripts for training are based on a clone of https://github.com/pytorch/vision.git with certain changes for modeling and training script. Please refer to later sections on training script and model modifications for a summary of modifications to the original files.

In the docker container, clone this repository and switch to the branch that matches your SynapseAI version. (Run the hl-smi utility to determine the SynapseAI version.)

git clone -b [SynapseAI version] https://github.com/HabanaAI/Model-References

Go to PyTorch ResNet directory:

cd Model-References/PyTorch/computer_vision/ImageClassification/ResNet

Note: If the repository is not in the PYTHONPATH, make sure you update it.

export PYTHONPATH=/path/to/Model-References:$PYTHONPATH

Set up dataset

Imagenet 2012 dataset needs to be organized as per PyTorch requirement. PyTorch requirements are specified in the link below which contains scripts to organize Imagenet data. https://github.com/soumith/imagenet-multiGPU.

Training the Model

The following commands assume that Imagenet dataset is available at /root/software/data/pytorch/imagenet/ILSVRC2012/ directory.

Please refer to the following command for available training parameters:

cd Model-References/PyTorch/computer_vision/ImageClassification/ResNet

i. Example: ResNet50, lazy mode, bf16 mixed precision, Batch Size 256, Custom learning rate

$PYTHON -u demo_resnet.py  --world-size 1 --dl-worker-type HABANA --batch-size 256 --model resnet50 --device hpu --workers 8 --print-freq 20 --channels-last True --dl-time-exclude False --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16  --custom-lr-values 0.1,0.01,0.001,0.0001 --custom-lr-milestones 0,30,60,80

ii. Example: ResNext101 lazy mode, bf16 mixed precision, Batch size 128, Custom learning rate

$PYTHON -u demo_resnet.py  --world-size 1 --dl-worker-type HABANA --batch-size 128 --model resnext101_32x4d --device hpu --workers 8 --print-freq 20 --channels-last True --dl-time-exclude False --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 100 --data-type bf16  --custom-lr-values 0.1,0.01,0.001,0.0001 --custom-lr-milestones 0,30,60,80

iii. Example: ResNet152, lazy mode, bf16 mixed precision, Batch Size 128, Custom learning rate

$PYTHON -u demo_resnet.py  --world-size 1 --dl-worker-type HABANA --batch-size 128 --model resnet152 --device hpu --workers 8 --print-freq 20 --channels-last True --dl-time-exclude False --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16  --custom-lr-values 0.1,0.01,0.001,0.0001 --custom-lr-milestones 0,30,60,80

Multinode Training

To run multi-node demo, make sure the host machine has 512 GB of RAM installed. Also ensure you followed the Gaudi Setup and Installation Guide to install and set up docker, so that the docker has access to all the 8 cards required for multi-node demo. Multinode configuration for ResNet 50 training up to 2 servers, each with 8 Gaudi cards, have been verified.

Before execution of the multi-node demo scripts, make sure all server network interfaces are up. You can change the state of each network interface managed by the habanalabs driver using the following command:

sudo ip link set <interface_name> up

To identify if a specific network interface is managed by the habanalabs driver type, run:

sudo ethtool -i <interface_name>

Docker ssh port setup for multi-server training

Multi-server training works by setting these environment variables:

MULTI_HLS_IPS: set this to a comma-separated list of host IP addresses. Network id is derived from first entry of this variable

This example shows how to setup for a 2-server training configuration. The IP addresses used are only examples:

export MULTI_HLS_IPS="10.10.100.101,10.10.100.102"

# When using demo script, export the following variable to the required port number (default 3022)
export DOCKER_SSHD_PORT=3022

By default, the Habana docker uses port 22 for ssh. The default port configured in the demo script is port 3022. Run the following commands to configure the selected port number , port 3022 in example below.

sed -i 's/#Port 22/Port 3022/g' /etc/ssh/sshd_config
sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
service ssh restart

Setup password-less ssh between all connected servers used in the scale-out training

Configure password-less ssh between all nodes:

Do the following in all the nodes' docker sessions:
```
mkdir ~/.ssh
cd ~/.ssh
ssh-keygen -t rsa -b 4096
```
Copy id_rsa.pub contents from every node's docker to every other node's docker's ~/.ssh/authorized_keys (all public keys need to be in all hosts' authorized_keys):
```
cat id_rsa.pub > authorized_keys
vi authorized_keys
```
Copy the contents from inside to other systems. Paste all hosts' public keys in all hosts' “authorized_keys” file.

On each system: Add all hosts (including itself) to known_hosts. The IP addresses used below are just for illustration:

ssh-keyscan -p 3022 -H 10.10.100.101 >> ~/.ssh/known_hosts
ssh-keyscan -p 3022 -H 10.10.100.102 >> ~/.ssh/known_hosts

i. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 8, custom learning rate, 8x on single server, include dataloading time in throughput computation

$PYTHON -u demo_resnet.py  --world-size 8 --batch-size 256 --model resnet50 --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16  --custom-lr-values 0.275,0.45,0.625,0.8,0.08,0.008,0.0008 --custom-lr-milestones 1,2,3,4,30,60,80 --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8

ii. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 8, custom learning rate, 8x on single server, exclude dataloading time in throughput computation

$PYTHON -u demo_resnet.py  --world-size 8 --batch-size 256 --model resnet50 --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16  --custom-lr-values 0.275,0.45,0.625,0.8,0.08,0.008,0.0008 --custom-lr-milestones 1,2,3,4,30,60,80 --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8

iii. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 8, custom learning rate, 8x on single server, include dataloading time in throughput computation, use habana_dataloader, worker(decoder) instances 8

$PYTHON -u demo_resnet.py  --world-size 8 --batch-size 256 --model resnet50 --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16  --custom-lr-values 0.275,0.45,0.625,0.8,0.08,0.008,0.0008 --custom-lr-milestones 1,2,3,4,30,60,80 --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8

iv. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 8, custom learning rate, 8x on single server, include dataloading time in throughput computation, print-frequency 10

mpirun -n 8 --bind-to core --map-by slot:PE=7 --rank-by core --report-bindings --allow-run-as-root $PYTHON train.py --data-path=/root/software/data/pytorch/imagenet/ILSVRC2012 --model=resnet50 --device=hpu --batch-size=256 --epochs=90 --print-freq=10 --output-dir=. --channels-last=True --seed=123 --run-lazy-mode --hmp --hmp-bf16 ./ops_bf16_Resnet.txt --hmp-fp32 ./ops_fp32_Resnet.txt --custom-lr-values 0.275 0.45 0.625 0.8 0.08 0.008 0.0008 --custom-lr-milestones 1 2 3 4 30 60 80 --deterministic --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8

v. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 16, custom learning rate, 16x on multiple servers, include dataloading time in throughput computation

MULTI_HLS_IPS="<node1 ipaddr>,<node2 ipaddr>" $PYTHON -u demo_resnet.py  --world-size 16 --batch-size 256 --model resnet50 --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16  --custom-lr-values 0.475,0.85,1.225,1.6,0.16,0.016,0.0016 --custom-lr-milestones 1,2,3,4,30,60,80 --process-per-node 8 --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8

vi. Example: ResNet50, lazy mode: bf16 mixed precision, Batch size 256, world-size 16, custom learning rate, 16x on multiple servers using Host-NIC, include dataloading time in throughput computation

export HCCL_OVER_TCP=1
export HCCL_COMM_ID=<rank0 host NIC ip address>:9696
export NSOCK_PERTHREAD=3
export HCCL_SOCKET_IFNAME="eth1,eth2,eth3,eth4" # host nic interfaces used
export HCCL_BOX_SIZE=8
export SOCKET_NTHREADS=2

$PYTHON -u demo_resnet.py --world-size 16 --batch-size 256 --model resnet50 --device hpu --workers 8 --print-freq 10 --channels-last True --dl-time-exclude False --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16 --dl-worker-type HABANA --custom-lr-values 0.475,0.85,1.225,1.6,0.16,0.016,0.0016 --custom-lr-milestones 1,2,3,4,30,60,80 --process-per-node 8

vii. Example: ResNext101, lazy mode: bf16 mixed precision, Batch size 128, world-size 8, single server, include dataloading time in throughput computation

$PYTHON -u demo_resnet.py --world-size 8 --batch-size 128 --model resnext101_32x4d --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 100 --data-type bf16  --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8

viii. Example: ResNet152, lazy mode: bf16 mixed precision, Batch size 128, world-size 8, custom learning rate, 8x on single server, include dataloading time in throughput computation

$PYTHON -u demo_resnet.py  --world-size 8 --batch-size 128 --model resnet152 --device hpu --print-freq 10 --channels-last True --deterministic --data-path /root/software/data/pytorch/imagenet/ILSVRC2012 --mode lazy --epochs 90 --data-type bf16  --custom-lr-values 0.275,0.45,0.625,0.8,0.08,0.008,0.0008 --custom-lr-milestones 1,2,3,4,30,60,80 --dl-time-exclude=False --dl-worker-type "HABANA" --workers 8

Known Issues

channels-last = false is not supported. Train ResNet with channels-last=true. This will be fixed in a subsequent release.
Placing mark_step() arbitrarily may lead to undefined behaviour. Recommend to keep mark_step() as shown in provided scripts.
Habana_dataloader only supports Imagenet (JPEG) and 8 worker instances
Only scripts & configurations mentioned in this README are supported and verified.

Training Script Modifications

The following are the changes added to the training script (train.py) and utilities(utils.py):

Added support for Habana devices

a. Load Habana specific library.

b. Certain environment variables are defined for habana device.

c. Added support to run Resnet training in lazy mode in addition to the eager mode. mark_step() is performed to trigger execution of the graph.

d. Added multi-node (including multiple server) and mixed precision support.

c. Modified training script to use mpirun for distributed training.

f. Changes for dynamic loading of HCCL library.

g. Changed start_method for torch multiprocessing.
Dataloader related changes

a. Added –deterministic flag to make data loading deterministic with Pytorch dataloader.

b. Added support for including dataloader time in performance computation. To match behavior of base script, it is disabled by default.

c. Added CPU based habana_dataloader with faster image preprocessing (only JPEG) support. It is disabled by default.
To improve performance

a. Enhance with channels last data format (NHWC) support.

b. Permute convolution weight tensors & and any other dependent tensors like 'momentum' for better performance.

c. Checkpoint saving involves getting trainable params and other state variables to CPU and permuting the weight tensors. Hence, checkpoint saving is by default disabled. It can be enabled using –save-checkpoint option.

d. Optimized FusedSGD operator is used in place of torch.optim.SGD for lazy mode.

e. Added support for lowering print frequency of loss.

f. Modified script to avoid broadcast at the beginning of iteration.

g. Added additional parameter to DistributedDataParallel to increase the size of the bucket to combine all the all-reduce calls to a single call.

h. Use CPU based habana_dataloader.

i. Enable pinned memory for data loader.

j. Accuracy calculation is performed on the CPU as lower performance observed with "top-k" on device
Skip printing eval phase stats if eval steps is zero.
Added support in script to override default LR scheduler with custom schedule passed as parameters to script.

The model changes are listed below:

Added support for resnext101_32x4d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

ResNet for PyTorch

Table of Contents

Model Overview

Setup

Set up dataset

Training the Model

Multinode Training

Docker ssh port setup for multi-server training

Setup password-less ssh between all connected servers used in the scale-out training

Known Issues

Training Script Modifications

FilesExpand file tree

ResNet

Directory actions

More options

Directory actions

More options

Latest commit

History

ResNet

Folders and files

parent directory

README.md

ResNet for PyTorch

Table of Contents

Model Overview

Setup

Set up dataset

Training the Model

Multinode Training

Docker ssh port setup for multi-server training

Setup password-less ssh between all connected servers used in the scale-out training

Known Issues

Training Script Modifications