David Carreto Fidalgo · 415beac0
--- a/README.md

+ 148

− 58
+++ b/README.md

+ 148

− 58
-# AI Containers
+<h1>AI Containers</h1>
-Goal: compile definition files of containers for AI use cases.
+👋 Welcome to the "AI Containers" repository of MPCDF's AI Group.
-Also provide documentation on how to use them on our HPC systems.
+ Here we give you a short introduction of how to use containers on our HPC systems, and provide example scripts you can use as blueprints for your projects.
+[[_TOC_]]
+# Getting started
+Let's go through a minimal example of how to run a Python script using a container on our [Raven system](link/to/official/docs).
+A fun little experiment is to benchmark a matrix multiplication on the CPU and the GPU.
+To perform the computation on the GPU, we can use the popular framework [JAX](link).
+So let's start by creating the `example.py` file and write following little Python script
+```python
+"""Content of the example.py script"""
+import numpy as np
+import jax
+import time
+def benchmark(m):
+    m@m  # warmup
+    start = time.time()
+    m@m.block_until_ready()
+    return time.time() - start
+m = np.random.randn(1000, 1000)
+m_cpu = jax.device_put(m, jax.devices("cpu")[0])
+m_gpu = jax.device_put(m, jax.devices("gpu")[0])
+print("CPU:", benchmark(m_cpu))
+print("GPU:", benchmark(m_gpu))
+```
+Our goal is to run this script inside a container on a compute node with a GPU.
+Since the Raven system has [Nvidia GPUs](link/to/mpcdf/raven/docs), we will use Nvidia's official JAX container where Python and JAX is already installed, and that ships with all the necessary software to optimally utilize their GPUs.
+> [!tip]
+> See the [build section]() below to learn how to customize containers.
+To download (or *pull*) the container from their website, we first need to load the container system [Apptainer]() via our [module system]().
+Pulling and creating this container locally can take up to 15 minutes.
+```shell
+module load apptainer/1.3.6
+apptainer pull docker://nvcr.io/nvidia/jax:24.04-py3
+```
+Now you should have a file called `jax_24.04-py3.sif` in your folder.
+Let's try to run our little script inside the container on the login node.
+For this, we will first *shell* inside the container.
+```shell
+apptainer shell jax_24.04-py3.sif
+```
+Once inside the container (notice the `Apptainer>` prompt in your terminal), we will run our script which will throw an error since we do not have a GPU available
+```shell
+Apptainer> python example.py
+...
+RuntimeError: Unknown backend: 'gpu' requested, but no platforms that are instances of gpu are present. Platforms present are: cpu
+```
+Ok, so let's exit the container with `exit` and send a short [Slurm job]() to the cluster via `srun` requesting a single A100 GPU.
+Instead of shelling into the container we only *execute* the `python` command inside the container.
+Depending on the current workload of the system, it can take a couple of minutes until the requested compute resources are allocated for our job.
+```shell
+srun --time=00:01:00 --gres=gpu:a100:1 --mem=16GB apptainer exec jax_24.04-py3.sif python example.py
+```
+You should see the output of the script in your terminal and, to everyone's surprise, the GPU performed the computation much faster than the CPU. 
+We can also send the Slurm job via `sbatch` and write a little [job script](https://docs.mpcdf.mpg.de/doc/computing/raven-user-guide.html#batch-jobs-using-gpus)
+```bash
+# content of the 'example.slurm' job script
+#SBATCH --gres=gpu:a100:1
+#SBATCH --mem=16GB
+#SBATCH --time=00:01:00
+module purge
+module load apptainer/1.3.6
+srun apptainer exec jax_24.04-py3.sif python example.py
+```
+And submit it to the cluster with `sbatch example.slurm`
+That's it, congratulations, you ran your first Python script inside a container using a GPU! 🎉
+# Examples and blueprints
+In this repository you find following examples or blueprints that you can adapt and use for your projects:
+- [transformers](transformers): ...
+- [pytorch](nvidia_pytorch): ...
+- [tensorflow](): ...
+...
+# Official base containers
+You probably want to build your own container with your own custom software stack and environment in it.
+We recommend, however, that you alway build on top of certain *base containers* depending on your application and the kind of GPUs you want to use.
+Nvidia and AMD both provide containers for common AI frameworks that come with all the necessary software to optimally use their hardware.
+Below is a list of links to these containers sorted by vendor and application.
+To use these in your custom container you have to specify their path and tag in the [Apptainer definition file]().
+For example, to build on top of Nvidia's PyTorch container your definition file should start like this (see also the [build section]() below)
+```bash
+BootStrap: docker
+From: nvcr.io/nvidia/pytorch:25.04-py3
+...
+```
+> [!warning]
+> Most of the these base containers are quite large and it will take some time to download and build them.
+## Nvidia
+> [!tip]
+> Use the [release notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) to match Nvidia's container tag to the actual PyTorch version installed in the container. 
+- [PyTorch](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
+- [TensorFlow](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
+- [JAX](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/jax)
+## AMD
+- [PyTorch](https://hub.docker.com/r/rocm/pytorch/tags)
+- [TensorFlow](https://hub.docker.com/r/rocm/tensorflow/tags)
+- [JAX](https://hub.docker.com/r/rocm/jax/tags)
+# Working with Apptainer
-## Getting started
 We use [Apptainer](https://apptainer.org/docs/user/main/index.html) to build/run containers on our HPC systems.
 You will need a Linux system to run Apptainer natively on your machine, and it’s easiest to [install](https://apptainer.org/docs/user/main/quick_start.html) if you have root access.
 @@ -17,7 +156,7 @@ Containers are built via a [definition file](https://apptainer.org/docs/user/lat
 @@ -17,7 +156,7 @@ Containers are built via a [definition file](https://apptainer.org/docs/user/lat
 In each folder of this repo you will find a definition `.def` file and a `README.md` that describes the exact build command.
-### Pull from Dokcer Hub
+## Pull from Dokcer Hub
 You can easily [pull containers](https://apptainer.org/docs/user/latest/docker_and_oci.html#containers-from-docker-hub) 
 from the [Docker Hub](https://hub.docker.com/) or other OCI registries:
 @@ -26,7 +165,7 @@ from the [Docker Hub](https://hub.docker.com/) or other OCI registries:
 @@ -26,7 +165,7 @@ from the [Docker Hub](https://hub.docker.com/) or other OCI registries:
 $ apptainer pull my_apptainer.sif docker://sylabsio/lolcow:latest
 ```
-### Convert from Docker Daemon or Docker Archive files
+## Convert from Docker Daemon or Docker Archive files
 You can also [convert images/containers](https://apptainer.org/docs/user/latest/docker_and_oci.html#containers-from-docker-hub) running in your Docker Daemon:
 ```shell
 @@ -76,18 +215,18 @@ You can apply overlays with the `run`, `exec`, `shell` and `instance start` comm
 @@ -76,18 +215,18 @@ You can apply overlays with the `run`, `exec`, `shell` and `instance start` comm
 The [Remote Visualisation Service (RVS)](https://docs.mpcdf.mpg.de/doc/visualization/index.html) allows you to run Jupyter sessions on the HPC systems.
 You can use your container as a kernel within such a session by providing a `kernel.json` spec file.
-### 1. Setting up the container
+ **1. Setting up the container**
 Make sure you install ipython and ipykernel in your container:
 ```
 pip install ipython ipykernel
 ```
-### 2. Setting up RVS
+**2. Setting up RVS**
 Load apptainer module when initializing your RVS session.
-### 3. Creating the kernel
+**3. Creating the kernel**
 Create a kernel spec file
 ```bash
 @@ -124,52 +263,3 @@ The next time you request a jupyter session, you can choose the generic jupyter
 @@ -124,52 +263,3 @@ The next time you request a jupyter session, you can choose the generic jupyter
 Keep in mind that you are inside the container.
 If you want to access files outside your home directory, you have to bind them explicitly in the kernel spec file when calling the apptainer command.
 For example, in the kernel spec file above we bind your `ptmp` folder.
-## Local-to-HPC Workflow
-**TODO: The sandbox option does not work 100% correctly for VSCode or PyCharm, use docker images instead! Need to update this guide!**
-A nice workflow to develop a python library locally and deploy it on our HPO systems (sharing exactly the same environment) is to use the [*sandbox* feature](https://apptainer.org/docs/user/main/build_a_container.html#sandbox) of Apptainer.
-We are still investigating if something similar is possible with `Docker` (please let us know if you find a way :) ).
-### 1. Create a definition file
-In the root directory of your library (repository) create a *definition* `*.def` file.
-This definition file should reflect your environment in which you want your library to develop and use.
-You can leverage base environments, such as docker images on DockerHub, or existing apptainers.
-### 2. Build the sandbox
-Build the sandbox (container in a directory) instead of the default SIF format:
-```shell
-apptainer build --fakeroot --sandbox my_container my_container.def
-```
-### 3. Install your library in the sandbox
-Now we can add our library that we develop to the sandbox environment and install it in [`editable`](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) mode:
-```shell
-apptainer exec --writable my_container python -m pip install -e .
-```
-### 4. Point your IDE's interpreter to the sandbox
-You should be able to point the interpreter of your IDE (VSC, PyCharm, etc.) to the python executable inside the sandbox folder.
-### 5. Add your developed library to the my_container.def file
-While in principle you could build a SIF container directly from your sandbox, it is better to modify your *definition* `*.def` file to include your library/package.
-In this way, your container is fully reproducible using only the definition file.
-### 6. Build your *.sif apptainer, deploy on our HPC systems
-Once you built the SIF container, you can copy it to our HPC systems and use it there.
-```shell
-apptainer build --fakeroot my_container.sif my_container.def
-```