diff --git a/README.md b/README.md index cee6bcd5a8db6b19849c49d21f09edc3cc222526..41f71ba5657be666e012c4e391baa22952a68505 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,10 @@ Goal: compile definition files of containers for AI use cases. Also provide documentation on how to use them on our HPC systems. +<!--- +[TODO]: ADD example of distributed training with containers +--> + ## Getting started We use [Apptainer](https://apptainer.org/docs/user/main/index.html) to build/run containers on our HPC systems. You will need a Linux system to run Apptainer natively on your machine, and it’s easiest to [install](https://apptainer.org/docs/user/main/quick_start.html) if you have root access. @@ -27,7 +31,10 @@ $ apptainer pull my_apptainer.sif docker://sylabsio/lolcow:latest ``` ### Convert from Docker Daemon or Docker Archive files - +<!--- +Piero: I would stress LOCALLY here. Docker is not, and will never be, available on our systems. + In any case this is usefull. What about adding it to the "Local-to-HPC Workflow" section at the bottom? +--> You can also [convert images/containers](https://apptainer.org/docs/user/latest/docker_and_oci.html#containers-from-docker-hub) running in your Docker Daemon: ```shell $ apptainer build my_apptainer.sif docker-daemon:sylabsio/lolcow:latest @@ -45,9 +52,59 @@ $ apptainer build my_apptainer.sif docker-archive:lolcow.tar ## Running containers -**TODO:** -- mention important flags, like `--nv` for example -- how to run the containers on our SLURM cluster +> **_NOTE:_** The following code snippets assume that you have loaded an Apptainer image using environmental modules (see [TODO](link/to/docs/images/modules)). For example: +> +> `module load image_pytorch` +> +> Replace `$IMAGE_SIF` with the path to a SIF file or a reference to an OCI registry to use the same commands with your own images. + +### Interactive Shell + +To run an interactive shell in a container, use: +```shell +$ apptainer shell $IMAGE_SIF +Apptainer> +``` +The prompt will change from `$` to `Apptainer>`, indicating that you are now running commands inside the container. The shell command is useful for interactively inspecting the content of an image. + +### Executing commands + +To execute a single program, use the exec command: + +```shell +$ apptainer exec $IMAGE_SIF echo "Hallo Welt!" +Hallo Welt! +``` + +For more details, refer to the [Apptainer documentation](https://apptainer.org/docs/user/latest/quick_start.html#interacting-with-images). + +### GPU support + +Apptainer natively supports running application containers that use NVIDIA’s CUDA or AMD’s ROCm GPU frameworks. To utilise these accelarators on a GPU compute node, simply add the `--nv` flag to the apptainer command. For example, on Raven: + +```shell +$ apptainer exec --nv $IMAGE_SIF nvidia-smi --version +GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXX) +``` + +For more details refer to Apptainer documentation on (GPU support)[https://apptainer.org/docs/user/latest/gpu.html] +<!--- +[TODO]: How to run containers on our SLURM cluster +Piero: personally I would remove this subsection. Focus always on HPC systems +--> + +## Example: submitting a multi-node distributed training with pytorch lightning +<!--- +[TODO]: +--> +### python script +<!--- +[TODO]: Add the most simple training script (Lightning?) - Or reference to scripts from somewhere else! +--> +### Slurm batch script +<!--- +[TODO]: Add multinode, multi GPU batch script +--> ## Using containers with RVS @@ -107,7 +164,7 @@ For example, in the kernel spec file above we bind your `ptmp` folder. **TODO: The sandbox option does not work 100% correctly for VSCode or PyCharm, use docker images instead! Need to update this guide!** -A nice workflow to develop a python library locally and deploy it on our HPO systems (sharing exactly the same environment) is to use the [*sandbox* feature](https://apptainer.org/docs/user/main/build_a_container.html#sandbox) of Apptainer. +A nice workflow to develop a python library locally and deploy it on our HPC systems (sharing exactly the same environment) is to use the [*sandbox* feature](https://apptainer.org/docs/user/main/build_a_container.html#sandbox) of Apptainer. We are still investigating if something similar is possible with `Docker` (please let us know if you find a way :) ).