Improve readme

4f66c812 · David Carreto Fidalgo · 1a36af41 · 4f66c812
Commit 4f66c812 authored Oct 20, 2023 by David Carreto Fidalgo
--- a/transformers/README.md
+++ b/transformers/README.md
+# transformers
+
+This container uses Nvidia's PyTorch as base layer and adds Hugging Face's [transformers](https://github.com/huggingface/transformers), [accelerate](`https://github.com/huggingface/accelerate`) and [datasets](https://github.com/huggingface/datasets).
+
 ## Build
 This container is built on top of the `nvidia_pytorch` container:

@@ -12,7 +16,7 @@ apptainer exec --nv transformers.sif python -c "import transformers"
 ```

 ## Examples
-The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-gpu training with Hugging Face's transformers library.
+The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-GPU training with [Hugging Face's transformers](https://github.com/huggingface/transformers) library.
 Here we will fine-tune a distilled version of GPT2 on a small Shakespear dataset, using 2 nodes with 4 GPUs each.

 First, get the Shakespear data:
@@ -20,7 +24,7 @@ First, get the Shakespear data:
 wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt .
 ```

-The transformers `Trainer` (together with Accelerate) and the `torchrun` command basically automates everything.
+The transformers [`Trainer`]() (together with [accelerate](https://github.com/huggingface/accelerate)) and the [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html) command automates almost everything.
 We can use the same python script to run a single-device training on our laptop/machine, or a multi-device training on our SLURM cluster.

 To run the example locally on our machine:
@@ -28,7 +32,7 @@ To run the example locally on our machine:
 apptainer exec --nv transformers.sif python example.py
 ```

-To run a multi-node multi-GPU training on our SLURM cluster we just need to specify the resources we want to use in the SLURM script:
+To run a multi-node multi-GPU training on our SLURM cluster, we just need to specify the resources in the SLURM script:
 ```shell
 #SBATCH --nodes=2  # <- Number of nodes we want to use for the training
 #SBATCH --tasks-per-node=1  # <- Has to be 1, since torchrun takes care of spawning the processes for each GPU
@@ -44,4 +48,5 @@ Since we bind the container's home folder to the current working dir (`-B .:"$HO
 After the training, you can find the fine-tuned model in the `./model` folder.

 ## Notes
- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training! Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files.
+- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training. Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files.
+- Data should be read from the faster `/ptmp` partition. Make sure the cached datasets from Hugging Face's [datasets](https://github.com/huggingface/datasets) use this partition.