Distributed PyTorch on Great Lakes / ITS Documentation

Distributed Data Parallel (DDP) Guide for Great Lakes

A step-by-step guide to setting up multi-node and multi-GPU training using PyTorch’s DistributedDataParallel (DDP) framework. The focus will be on configuring DDP to work efficiently on Great Lakes, which is a SLURM cluster. Ensuring scalability and efficient resource utilization, along with debugging techniques and GPU monitoring tips to diagnose and optimize performance. The implementation will include reusable code to streamline distributed training across multiple nodes.

PyTorch Build History

The following table lists the verified PyTorch versions that have been successfully built and tested on the Great Lakes HPC cluster. These versions were used in this software guide to ensure compatibility and optimal performance. Each entry includes details on the build time, compiler version, CUDA, cuSPARSELt, cuDNN, NCCL versions, and MPI support. Build Instructions & configuration details along with specific compiler flags, dependencies, and configuration settings, can be found in the installation section. These details ensure reproducibility and compatibility with the Great Lakes environment.

PyTorch Ver	Build Time	Compiler	CUDA	cuSPARSELt	cuDSS	cuDNN	NCCL Support	GLOO Support	MPI Support	CXX11 ABI
2.6.0	1.1 h	GCC/10.3.0	12.6.3	0.5.2	0.4.0	9.6.0	Yes	Yes	Yes	Yes
2.4.0	1.2 h	GCC/10.3.0	12.6.3	0.5.2	0.4.0	9.6.0	Yes	Yes	No	No
2.4.0	54 m	GCC/10.3.0	12.3.0	0.5.2	0.4.0	9.6.0	Yes	Yes	No	No
2.4.0	55 m	GCC/10.3.0	11.8.0	0.5.2	0.4.0	9.6.0	Yes	Yes	No	No

"PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation."