This directory contains the source code, definition files, and submission scripts needed to train a ResNet50 model on synthetic data in a containerized environment.
By default, the SLURM scripts will run a small distributed training workload on 2 nodes. They can be easily modified to run on a different number of nodes or on a single device.
For instructions on setting up the containers and running the example on MPCDF HPC systems, please refer to the [raven](raven/) and [viper](viper/) directories.
@@ -4,8 +4,6 @@ Use NGC containers: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorc
## BUILD
> HINT: Use a compute node, and export `APPTAINER_TMPDIR` and `APPTAINER_CACHEDIR` in `${JOB_SHMTMPDIR}/`. This will reduce **drastically** the creation of the SIF file and so the overall build time.