From 4f66c812254a0a06fc64dbffe14daef3db28aecd Mon Sep 17 00:00:00 2001 From: David Carreto Fidalgo <david.carreto.fidalgo@mpcdf.mpg.de> Date: Fri, 20 Oct 2023 16:05:12 +0200 Subject: [PATCH] Improve readme --- transformers/README.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/transformers/README.md b/transformers/README.md index 240a10f..b410d22 100644 --- a/transformers/README.md +++ b/transformers/README.md @@ -1,3 +1,7 @@ +# transformers + +This container uses Nvidia's PyTorch as base layer and adds Hugging Face's [transformers](https://github.com/huggingface/transformers), [accelerate](`https://github.com/huggingface/accelerate`) and [datasets](https://github.com/huggingface/datasets). + ## Build This container is built on top of the `nvidia_pytorch` container: @@ -12,7 +16,7 @@ apptainer exec --nv transformers.sif python -c "import transformers" ``` ## Examples -The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-gpu training with Hugging Face's transformers library. +The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-GPU training with [Hugging Face's transformers](https://github.com/huggingface/transformers) library. Here we will fine-tune a distilled version of GPT2 on a small Shakespear dataset, using 2 nodes with 4 GPUs each. First, get the Shakespear data: @@ -20,7 +24,7 @@ First, get the Shakespear data: wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt . ``` -The transformers `Trainer` (together with Accelerate) and the `torchrun` command basically automates everything. +The transformers [`Trainer`]() (together with [accelerate](https://github.com/huggingface/accelerate)) and the [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html) command automates almost everything. We can use the same python script to run a single-device training on our laptop/machine, or a multi-device training on our SLURM cluster. To run the example locally on our machine: @@ -28,7 +32,7 @@ To run the example locally on our machine: apptainer exec --nv transformers.sif python example.py ``` -To run a multi-node multi-GPU training on our SLURM cluster we just need to specify the resources we want to use in the SLURM script: +To run a multi-node multi-GPU training on our SLURM cluster, we just need to specify the resources in the SLURM script: ```shell #SBATCH --nodes=2 # <- Number of nodes we want to use for the training #SBATCH --tasks-per-node=1 # <- Has to be 1, since torchrun takes care of spawning the processes for each GPU @@ -44,4 +48,5 @@ Since we bind the container's home folder to the current working dir (`-B .:"$HO After the training, you can find the fine-tuned model in the `./model` folder. ## Notes -- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training! Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files. +- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training. Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files. +- Data should be read from the faster `/ptmp` partition. Make sure the cached datasets from Hugging Face's [datasets](https://github.com/huggingface/datasets) use this partition. -- GitLab