Skip to content
Snippets Groups Projects
Commit 94460219 authored by Nastassya Horlava's avatar Nastassya Horlava
Browse files

viper readme and fixes

parent a61f2bf4
Branches
Tags
1 merge request!4Docs tensorflow
# Tensorflow on Viper
Use ROCM containers: https://hub.docker.com/r/rocm/tensorflow
## BUILD
1. Load the latest apptainer module, e.g.
```bash
module load apptainer/1.4.1
```
2. Build your container:
```
apptainer build amd-tensorflow.sif amd-tensorflow.def
```
## RUN
* To run the training script on synthetic data using ResNet50 in an **undistributed** fashion, execute:
```bash
sbatch run_undistributed.slurm
```
* To run it in a **distributed** fashion on a **single node with multiple GPUs**, execute:
```bash
sbatch run_distributed_1_node_multi_gpu.slurm
```
* To run it in a **distributed** fashion across **multiple nodes**, execute:
```bash
sbatch run_distributed_multi_node_multi_gpu.slurm
```
### Results
Container: rocm/tensorflow:rocm6.3.3-py3.12-tf2.16-dev
#### (local) batch_size=256
> `CONF` = 1.96 * std
|NNODES|NGPUS|IPS|CONF|
|-|-|-|-|
|1|1|1360.7|34.5|
|1|2|1204.8|32.7|
|2|4|2263.4|100.7|
......@@ -18,7 +18,7 @@ module load apptainer/1.4.1
CONTAINER="amd-tensorflow.sif"
export TF_FORCE_GPU_ALLOW_GROWTH=true
srun apptainer exec --rocm -B ../src/:/workspace/ ${CONTAINER} bash -c """
srun apptainer exec -B ../src/:/workspace/ ${CONTAINER} bash -c """
export RANK=\${SLURM_PROCID}
python /workspace/train_synthetic.py train
......
......@@ -18,7 +18,7 @@ module load apptainer/1.4.1
CONTAINER="amd-tensorflow.sif"
export TF_FORCE_GPU_ALLOW_GROWTH=true
PRE_RUN="source /workspace/set_tf_config_multiple_nodes.sh && echo \$TF_CONFIG && export RANK=\${SLURM_PROCID}"
PRE_RUN="source ../src/set_tf_config_multiple_nodes.sh && echo \$TF_CONFIG && export RANK=\${SLURM_PROCID}"
srun bash -c """
${PRE_RUN} &&
......
......@@ -19,7 +19,7 @@ module load apptainer/1.4.1
CONTAINER="amd-tensorflow.sif"
export TF_FORCE_GPU_ALLOW_GROWTH=true
srun apptainer exec --rocm -B ../src/:/workspace/ ${CONTAINER} bash -c """
srun apptainer exec -B ../src/:/workspace/ ${CONTAINER} bash -c """
export RANK=\${SLURM_PROCID}
python /workspace/train_synthetic.py train
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment