Goal: compile definition files of containers for AI use cases.
Also provide documentation on how to use them on our HPC systems.
<!---
[TODO]:ADD example of distributed training with containers
-->
## Getting started
We use [Apptainer](https://apptainer.org/docs/user/main/index.html) to build/run containers on our HPC systems.
You will need a Linux system to run Apptainer natively on your machine, and it’s easiest to [install](https://apptainer.org/docs/user/main/quick_start.html) if you have root access.
### Convert from Docker Daemon or Docker Archive files
<!---
Piero: I would stress LOCALLY here. Docker is not, and will never be, available on our systems.
In any case this is usefull. What about adding it to the "Local-to-HPC Workflow" section at the bottom?
-->
You can also [convert images/containers](https://apptainer.org/docs/user/latest/docker_and_oci.html#containers-from-docker-hub) running in your Docker Daemon:
- mention important flags, like `--nv` for example
- how to run the containers on our SLURM cluster
> **_NOTE:_** The following code snippets assume that you have loaded an Apptainer image using environmental modules (see [TODO](link/to/docs/images/modules)). For example:
>
> `module load image_pytorch`
>
> Replace `$IMAGE_SIF` with the path to a SIF file or a reference to an OCI registry to use the same commands with your own images.
### Interactive Shell
To run an interactive shell in a container, use:
```shell
$ apptainer shell $IMAGE_SIF
Apptainer>
```
The prompt will change from `$` to `Apptainer>`, indicating that you are now running commands inside the container. The shell command is useful for interactively inspecting the content of an image.
### Executing commands
To execute a single program, use the exec command:
```shell
$ apptainer exec$IMAGE_SIFecho"Hallo Welt!"
Hallo Welt!
```
For more details, refer to the [Apptainer documentation](https://apptainer.org/docs/user/latest/quick_start.html#interacting-with-images).
### GPU support
Apptainer natively supports running application containers that use NVIDIA’s CUDA or AMD’s ROCm GPU frameworks. To utilise these accelarators on a GPU compute node, simply add the `--nv` flag to the apptainer command. For example, on Raven:
For more details refer to Apptainer documentation on (GPU support)[https://apptainer.org/docs/user/latest/gpu.html]
<!---
[TODO]:How to run containers on our SLURM cluster
Piero: personally I would remove this subsection. Focus always on HPC systems
-->
## Example: submitting a multi-node distributed training with pytorch lightning
<!---
[TODO]:
-->
### python script
<!---
[TODO]:Add the most simple training script (Lightning?) - Or reference to scripts from somewhere else!
-->
### Slurm batch script
<!---
[TODO]:Add multinode, multi GPU batch script
-->
## Using containers with RVS
...
...
@@ -107,7 +164,7 @@ For example, in the kernel spec file above we bind your `ptmp` folder.
**TODO: The sandbox option does not work 100% correctly for VSCode or PyCharm, use docker images instead! Need to update this guide!**
A nice workflow to develop a python library locally and deploy it on our HPO systems (sharing exactly the same environment) is to use the [*sandbox* feature](https://apptainer.org/docs/user/main/build_a_container.html#sandbox) of Apptainer.
A nice workflow to develop a python library locally and deploy it on our HPC systems (sharing exactly the same environment) is to use the [*sandbox* feature](https://apptainer.org/docs/user/main/build_a_container.html#sandbox) of Apptainer.
We are still investigating if something similar is possible with `Docker` (please let us know if you find a way :) ).