To run a multi-node multi-GPU training on our SLURM cluster we just need to specify the resources we want to use in your SLURM script:
```shell
#SBATCH --nodes=2 # <- Number of nodes we want to use for the training
#SBATCH --tasks-per-node=1 # <- Has to be 1, since torchrun takes care of spawning the processes for each GPU
#SBATCH --gres=gpu:a100:4 # <- Number of GPUs you want to use per node, here 4
```
Send the job to the queue:
```shell
sbatch example.slurm
```
Since we bind the container's home folder to the current working dir (`-B .:"$HOME"`), the pretrained model and the dataset preprocessing will be cached in the `./.cache` folder.
After the training, you can find the fine-tuned model in the `./model` folder.
## Notes
- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training! Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files.