save working examples

9d94eac8 · David Carreto Fidalgo · ad1477c7 · 9d94eac8 · 9d94eac8 · 9d94eac8
Commit 9d94eac8 authored 1 year ago by David Carreto Fidalgo
--- a/transformers/README.md
+++ b/transformers/README.md
@@ -12,4 +12,37 @@ apptainer exec --nv transformers.sif python -c "import transformers"
 ```

 ## Examples
-**TODO:** Provide an example of how to run a distributed transformer training (example python and slurm script)
+The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-gpu training with Hugging Face's transformers library.
+Here we will fine-tune a distilled version of GPT2 on a small Shakespear dataset, using 2 nodes with 4 GPUs per node.
+
+First, get the Shakespear data:
+```shell
+wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt .
+```
+
+The transformers `Trainer` (together with Accelerate) and the `torchrun` command basically automates everything.
+You can use the same python script to run a single-device training on your laptop, or a multi-device training on our SLURM cluster.
+
+To run the example locally on your machine:
+```shell
+apptainer exec --nv transformers.sif python example.py
+```
+
+To run a multi-node multi-GPU training on our SLURM cluster we just need to specify the resources we want to use in your SLURM script:
+```shell
+#SBATCH --nodes=2  # <- Number of nodes we want to use for the training
+#SBATCH --tasks-per-node=1  # <- Has to be 1, since torchrun takes care of spawning the processes for each GPU
+#SBATCH --gres=gpu:a100:4  # <- Number of GPUs you want to use per node, here 4
+```
+
+Send the job to the queue:
+```shell
+sbatch example.slurm
+```
+
+Since we bind the container's home folder to the current working dir (`-B .:"$HOME"`), the pretrained model and the dataset preprocessing will be cached in the `./.cache` folder.
+After the training, you can find the fine-tuned model in the `./model` folder.
+
+## Notes
+- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training! Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files.
+
--- a/transformers/example.py
+++ b/transformers/example.py
@@ -45,10 +45,12 @@ def main():
    training_args = TrainingArguments(
        output_dir="model",
        evaluation_strategy="epoch",
-        learning_rate=2e-5,
+        learning_rate=4e-5,
        weight_decay=0.01,
        log_level="info",
+        log_level_replica="warning",
        ddp_find_unused_parameters=False,
+        per_device_train_batch_size=2,
    )

    trainer = Trainer(

--- a/transformers/example.slurm
+++ b/transformers/example.slurm
@@ -5,13 +5,13 @@
 #SBATCH -D ./
 #SBATCH -J transformers
 #
-#SBATCH --nodes=1
+#SBATCH --nodes=2
 #SBATCH --tasks-per-node=1
-#SBATCH --cpus-per-task=36
+#SBATCH --cpus-per-task=72
 #SBATCH --mem=0
 #
 #SBATCH --constraint="gpu"
-#SBATCH --gres=gpu:a100:2
+#SBATCH --gres=gpu:a100:4
 #
 #SBATCH --mail-type=none
 #SBATCH --mail-user=david.carreto.fidalgo@gmail.com
@@ -22,16 +22,24 @@
 source /etc/profile.d/modules.sh
 module purge
 module load apptainer
-#export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
+
+# Avoid hyper-threading (in this case cpus-per-task // nr_of_gpus):
 export OMP_NUM_THREADS=18

 # For pinning threads correctly:
 export OMP_PLACES=cores

-nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory --format=csv -l 1 > nvidia_smi_monitoring.csv &
-NVIDIASMI_PID=$!
+# Useful for debugging:
+# export NCCL_DEBUG=INFO

-srun apptainer exec --nv -B .:"$HOME" transformers.sif torchrun --standalone --nnodes=1 --nproc-per-node=gpu example.py

-kill $NVIDIASMI_PID
+srun apptainer exec \
+	--nv -B .:"$HOME" \
+	transformers.sif torchrun \
+		--nnodes="$SLURM_NNODES" \
+		--nproc-per-node=gpu \
+		--rdzv-id="$SLURM_JOBID" \
+		--rdzv-endpoint=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) \
+		--rdzv-backend="c10d" \
+		example.py