This container uses Nvidia's PyTorch as base layer and adds Hugging Face's [transformers](https://github.com/huggingface/transformers), [accelerate](`https://github.com/huggingface/accelerate`) and [datasets](https://github.com/huggingface/datasets).
## Build
This container is built on top of the `nvidia_pytorch` container:
The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-gpu training with Hugging Face's transformers library.
The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-GPU training with [Hugging Face's transformers](https://github.com/huggingface/transformers) library.
Here we will fine-tune a distilled version of GPT2 on a small Shakespear dataset, using 2 nodes with 4 GPUs each.
The transformers `Trainer` (together with Accelerate) and the `torchrun` command basically automates everything.
The transformers [`Trainer`]()(together with [accelerate](https://github.com/huggingface/accelerate)) and the [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html) command automates almost everything.
We can use the same python script to run a single-device training on our laptop/machine, or a multi-device training on our SLURM cluster.
To run the example locally on our machine:
...
...
@@ -28,7 +32,7 @@ To run the example locally on our machine:
To run a multi-node multi-GPU training on our SLURM cluster we just need to specify the resources we want to use in the SLURM script:
To run a multi-node multi-GPU training on our SLURM cluster, we just need to specify the resources in the SLURM script:
```shell
#SBATCH --nodes=2 # <- Number of nodes we want to use for the training
#SBATCH --tasks-per-node=1 # <- Has to be 1, since torchrun takes care of spawning the processes for each GPU
...
...
@@ -44,4 +48,5 @@ Since we bind the container's home folder to the current working dir (`-B .:"$HO
After the training, you can find the fine-tuned model in the `./model` folder.
## Notes
- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training! Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files.
- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training. Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files.
- Data should be read from the faster `/ptmp` partition. Make sure the cached datasets from Hugging Face's [datasets](https://github.com/huggingface/datasets) use this partition.