From 4f66c812254a0a06fc64dbffe14daef3db28aecd Mon Sep 17 00:00:00 2001
From: David Carreto Fidalgo <david.carreto.fidalgo@mpcdf.mpg.de>
Date: Fri, 20 Oct 2023 16:05:12 +0200
Subject: [PATCH] Improve readme

---
 transformers/README.md | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/transformers/README.md b/transformers/README.md
index 240a10f..b410d22 100644
--- a/transformers/README.md
+++ b/transformers/README.md
@@ -1,3 +1,7 @@
+# transformers
+
+This container uses Nvidia's PyTorch as base layer and adds Hugging Face's [transformers](https://github.com/huggingface/transformers), [accelerate](`https://github.com/huggingface/accelerate`) and [datasets](https://github.com/huggingface/datasets).
+
 ## Build
 This container is built on top of the `nvidia_pytorch` container:
 
@@ -12,7 +16,7 @@ apptainer exec --nv transformers.sif python -c "import transformers"
 ```
 
 ## Examples
-The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-gpu training with Hugging Face's transformers library.
+The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-GPU training with [Hugging Face's transformers](https://github.com/huggingface/transformers) library.
 Here we will fine-tune a distilled version of GPT2 on a small Shakespear dataset, using 2 nodes with 4 GPUs each.
 
 First, get the Shakespear data:
@@ -20,7 +24,7 @@ First, get the Shakespear data:
 wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt .
 ```
 
-The transformers `Trainer` (together with Accelerate) and the `torchrun` command basically automates everything.
+The transformers [`Trainer`]() (together with [accelerate](https://github.com/huggingface/accelerate)) and the [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html) command automates almost everything.
 We can use the same python script to run a single-device training on our laptop/machine, or a multi-device training on our SLURM cluster.
 
 To run the example locally on our machine:
@@ -28,7 +32,7 @@ To run the example locally on our machine:
 apptainer exec --nv transformers.sif python example.py
 ```
 
-To run a multi-node multi-GPU training on our SLURM cluster we just need to specify the resources we want to use in the SLURM script:
+To run a multi-node multi-GPU training on our SLURM cluster, we just need to specify the resources in the SLURM script:
 ```shell
 #SBATCH --nodes=2  # <- Number of nodes we want to use for the training
 #SBATCH --tasks-per-node=1  # <- Has to be 1, since torchrun takes care of spawning the processes for each GPU
@@ -44,4 +48,5 @@ Since we bind the container's home folder to the current working dir (`-B .:"$HO
 After the training, you can find the fine-tuned model in the `./model` folder.
 
 ## Notes
-- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training! Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files.
+- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training. Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files.
+- Data should be read from the faster `/ptmp` partition. Make sure the cached datasets from Hugging Face's [datasets](https://github.com/huggingface/datasets) use this partition.
-- 
GitLab