Skip to content
Snippets Groups Projects
Commit 9d94eac8 authored by David Carreto Fidalgo's avatar David Carreto Fidalgo
Browse files

save working examples

parent ad1477c7
No related branches found
No related tags found
No related merge requests found
......@@ -12,4 +12,37 @@ apptainer exec --nv transformers.sif python -c "import transformers"
```
## Examples
**TODO:** Provide an example of how to run a distributed transformer training (example python and slurm script)
The `example.py` and `example.slurm` scripts showcase how to perform a multi-node multi-gpu training with Hugging Face's transformers library.
Here we will fine-tune a distilled version of GPT2 on a small Shakespear dataset, using 2 nodes with 4 GPUs per node.
First, get the Shakespear data:
```shell
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt .
```
The transformers `Trainer` (together with Accelerate) and the `torchrun` command basically automates everything.
You can use the same python script to run a single-device training on your laptop, or a multi-device training on our SLURM cluster.
To run the example locally on your machine:
```shell
apptainer exec --nv transformers.sif python example.py
```
To run a multi-node multi-GPU training on our SLURM cluster we just need to specify the resources we want to use in your SLURM script:
```shell
#SBATCH --nodes=2 # <- Number of nodes we want to use for the training
#SBATCH --tasks-per-node=1 # <- Has to be 1, since torchrun takes care of spawning the processes for each GPU
#SBATCH --gres=gpu:a100:4 # <- Number of GPUs you want to use per node, here 4
```
Send the job to the queue:
```shell
sbatch example.slurm
```
Since we bind the container's home folder to the current working dir (`-B .:"$HOME"`), the pretrained model and the dataset preprocessing will be cached in the `./.cache` folder.
After the training, you can find the fine-tuned model in the `./model` folder.
## Notes
- Be aware of HuggingFace's caching mechanisms when running a multi-gpu training! Maybe download the models/datasets first and instantiate the respective classes with the files directly to avoid concurrent downloads of the same files.
......@@ -45,10 +45,12 @@ def main():
training_args = TrainingArguments(
output_dir="model",
evaluation_strategy="epoch",
learning_rate=2e-5,
learning_rate=4e-5,
weight_decay=0.01,
log_level="info",
log_level_replica="warning",
ddp_find_unused_parameters=False,
per_device_train_batch_size=2,
)
trainer = Trainer(
......
......@@ -5,13 +5,13 @@
#SBATCH -D ./
#SBATCH -J transformers
#
#SBATCH --nodes=1
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=36
#SBATCH --cpus-per-task=72
#SBATCH --mem=0
#
#SBATCH --constraint="gpu"
#SBATCH --gres=gpu:a100:2
#SBATCH --gres=gpu:a100:4
#
#SBATCH --mail-type=none
#SBATCH --mail-user=david.carreto.fidalgo@gmail.com
......@@ -22,16 +22,24 @@
source /etc/profile.d/modules.sh
module purge
module load apptainer
#export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Avoid hyper-threading (in this case cpus-per-task // nr_of_gpus):
export OMP_NUM_THREADS=18
# For pinning threads correctly:
export OMP_PLACES=cores
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory --format=csv -l 1 > nvidia_smi_monitoring.csv &
NVIDIASMI_PID=$!
# Useful for debugging:
# export NCCL_DEBUG=INFO
srun apptainer exec --nv -B .:"$HOME" transformers.sif torchrun --standalone --nnodes=1 --nproc-per-node=gpu example.py
kill $NVIDIASMI_PID
srun apptainer exec \
--nv -B .:"$HOME" \
transformers.sif torchrun \
--nnodes="$SLURM_NNODES" \
--nproc-per-node=gpu \
--rdzv-id="$SLURM_JOBID" \
--rdzv-endpoint=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) \
--rdzv-backend="c10d" \
example.py
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment