Skip to content
Snippets Groups Projects
Commit 4f66c812 authored by David Carreto Fidalgo's avatar David Carreto Fidalgo
Browse files

Improve readme

parent 1a36af41
No related branches found
No related tags found
No related merge requests found
# transformers
This container uses Nvidia's PyTorch as base layer and adds Hugging Face's [transformers](https://github.com/huggingface/transformers), [accelerate](`https://github.com/huggingface/accelerate`) and [datasets](https://github.com/huggingface/datasets).
## Build
This container is built on top of the `nvidia_pytorch` container:
......@@ -12,7 +16,7 @@ apptainer exec --nv transformers.sif python -c "import transformers"
```
## Examples
The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-gpu training with Hugging Face's transformers library.
The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-GPU training with [Hugging Face's transformers](https://github.com/huggingface/transformers) library.
Here we will fine-tune a distilled version of GPT2 on a small Shakespear dataset, using 2 nodes with 4 GPUs each.
First, get the Shakespear data:
......@@ -20,7 +24,7 @@ First, get the Shakespear data:
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt .
```
The transformers `Trainer` (together with Accelerate) and the `torchrun` command basically automates everything.
The transformers [`Trainer`]() (together with [accelerate](https://github.com/huggingface/accelerate)) and the [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html) command automates almost everything.
We can use the same python script to run a single-device training on our laptop/machine, or a multi-device training on our SLURM cluster.
To run the example locally on our machine:
......@@ -28,7 +32,7 @@ To run the example locally on our machine:
apptainer exec --nv transformers.sif python example.py
```
To run a multi-node multi-GPU training on our SLURM cluster we just need to specify the resources we want to use in the SLURM script:
To run a multi-node multi-GPU training on our SLURM cluster, we just need to specify the resources in the SLURM script:
```shell
#SBATCH --nodes=2 # <- Number of nodes we want to use for the training
#SBATCH --tasks-per-node=1 # <- Has to be 1, since torchrun takes care of spawning the processes for each GPU
......@@ -44,4 +48,5 @@ Since we bind the container's home folder to the current working dir (`-B .:"$HO
After the training, you can find the fine-tuned model in the `./model` folder.
## Notes
- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training! Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files.
- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training. Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files.
- Data should be read from the faster `/ptmp` partition. Make sure the cached datasets from Hugging Face's [datasets](https://github.com/huggingface/datasets) use this partition.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment