Skip to content
Snippets Groups Projects
Commit 94460219 authored by Nastassya Horlava's avatar Nastassya Horlava
Browse files

viper readme and fixes

parent a61f2bf4
No related branches found
No related tags found
1 merge request!4Docs tensorflow
# Tensorflow on Viper
Use ROCM containers: https://hub.docker.com/r/rocm/tensorflow
## BUILD
1. Load the latest apptainer module, e.g.
```bash
module load apptainer/1.4.1
```
2. Build your container:
```
apptainer build amd-tensorflow.sif amd-tensorflow.def
```
## RUN
* To run the training script on synthetic data using ResNet50 in an **undistributed** fashion, execute:
```bash
sbatch run_undistributed.slurm
```
* To run it in a **distributed** fashion on a **single node with multiple GPUs**, execute:
```bash
sbatch run_distributed_1_node_multi_gpu.slurm
```
* To run it in a **distributed** fashion across **multiple nodes**, execute:
```bash
sbatch run_distributed_multi_node_multi_gpu.slurm
```
### Results
Container: rocm/tensorflow:rocm6.3.3-py3.12-tf2.16-dev
#### (local) batch_size=256
> `CONF` = 1.96 * std
|NNODES|NGPUS|IPS|CONF|
|-|-|-|-|
|1|1|1360.7|34.5|
|1|2|1204.8|32.7|
|2|4|2263.4|100.7|
...@@ -18,7 +18,7 @@ module load apptainer/1.4.1 ...@@ -18,7 +18,7 @@ module load apptainer/1.4.1
CONTAINER="amd-tensorflow.sif" CONTAINER="amd-tensorflow.sif"
export TF_FORCE_GPU_ALLOW_GROWTH=true export TF_FORCE_GPU_ALLOW_GROWTH=true
srun apptainer exec --rocm -B ../src/:/workspace/ ${CONTAINER} bash -c """ srun apptainer exec -B ../src/:/workspace/ ${CONTAINER} bash -c """
export RANK=\${SLURM_PROCID} export RANK=\${SLURM_PROCID}
python /workspace/train_synthetic.py train python /workspace/train_synthetic.py train
......
...@@ -18,7 +18,7 @@ module load apptainer/1.4.1 ...@@ -18,7 +18,7 @@ module load apptainer/1.4.1
CONTAINER="amd-tensorflow.sif" CONTAINER="amd-tensorflow.sif"
export TF_FORCE_GPU_ALLOW_GROWTH=true export TF_FORCE_GPU_ALLOW_GROWTH=true
PRE_RUN="source /workspace/set_tf_config_multiple_nodes.sh && echo \$TF_CONFIG && export RANK=\${SLURM_PROCID}" PRE_RUN="source ../src/set_tf_config_multiple_nodes.sh && echo \$TF_CONFIG && export RANK=\${SLURM_PROCID}"
srun bash -c """ srun bash -c """
${PRE_RUN} && ${PRE_RUN} &&
......
...@@ -19,7 +19,7 @@ module load apptainer/1.4.1 ...@@ -19,7 +19,7 @@ module load apptainer/1.4.1
CONTAINER="amd-tensorflow.sif" CONTAINER="amd-tensorflow.sif"
export TF_FORCE_GPU_ALLOW_GROWTH=true export TF_FORCE_GPU_ALLOW_GROWTH=true
srun apptainer exec --rocm -B ../src/:/workspace/ ${CONTAINER} bash -c """ srun apptainer exec -B ../src/:/workspace/ ${CONTAINER} bash -c """
export RANK=\${SLURM_PROCID} export RANK=\${SLURM_PROCID}
python /workspace/train_synthetic.py train python /workspace/train_synthetic.py train
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment