PyTorch Example
This directory contains the source code, definition files, and submission scripts needed to train a ResNet50 model on synthetic data in a containerized environment.
By default, the SLURM scripts will run a small distributed training workload on 2 nodes. They can be easily modified to run on a different number of nodes or on a single device.
For instructions on setting up the containers and running the example on MPCDF HPC systems, please refer to the raven and viper directories.