Skip to content
Snippets Groups Projects

Feat/readme

Merged David Carreto Fidalgo requested to merge feat/readme into main
1 file
+ 148
58
Compare changes
  • Side-by-side
  • Inline
+ 148
58
# AI Containers
<h1>AI Containers</h1>
Goal: compile definition files of containers for AI use cases.
👋 Welcome to the "AI Containers" repository of MPCDF's AI Group.
Also provide documentation on how to use them on our HPC systems.
Here we give you a short introduction of how to use containers on our HPC systems, and provide example scripts you can use as blueprints for your projects.
 
 
[[_TOC_]]
 
 
# Getting started
 
 
Let's go through a minimal example of how to run a Python script using a container on our [Raven system](link/to/official/docs).
 
 
A fun little experiment is to benchmark a matrix multiplication on the CPU and the GPU.
 
To perform the computation on the GPU, we can use the popular framework [JAX](link).
 
 
So let's start by creating the `example.py` file and write following little Python script
 
 
```python
 
"""Content of the example.py script"""
 
import numpy as np
 
import jax
 
import time
 
 
def benchmark(m):
 
m@m # warmup
 
start = time.time()
 
m@m.block_until_ready()
 
return time.time() - start
 
 
m = np.random.randn(1000, 1000)
 
m_cpu = jax.device_put(m, jax.devices("cpu")[0])
 
m_gpu = jax.device_put(m, jax.devices("gpu")[0])
 
print("CPU:", benchmark(m_cpu))
 
print("GPU:", benchmark(m_gpu))
 
```
 
 
Our goal is to run this script inside a container on a compute node with a GPU.
 
Since the Raven system has [Nvidia GPUs](link/to/mpcdf/raven/docs), we will use Nvidia's official JAX container where Python and JAX is already installed, and that ships with all the necessary software to optimally utilize their GPUs.
 
 
> [!tip]
 
> See the [build section]() below to learn how to customize containers.
 
 
To download (or *pull*) the container from their website, we first need to load the container system [Apptainer]() via our [module system]().
 
Pulling and creating this container locally can take up to 15 minutes.
 
 
```shell
 
module load apptainer/1.3.6
 
apptainer pull docker://nvcr.io/nvidia/jax:24.04-py3
 
```
 
 
Now you should have a file called `jax_24.04-py3.sif` in your folder.
 
 
Let's try to run our little script inside the container on the login node.
 
For this, we will first *shell* inside the container.
 
 
```shell
 
apptainer shell jax_24.04-py3.sif
 
```
 
 
Once inside the container (notice the `Apptainer>` prompt in your terminal), we will run our script which will throw an error since we do not have a GPU available
 
 
```shell
 
Apptainer> python example.py
 
...
 
RuntimeError: Unknown backend: 'gpu' requested, but no platforms that are instances of gpu are present. Platforms present are: cpu
 
```
 
 
Ok, so let's exit the container with `exit` and send a short [Slurm job]() to the cluster via `srun` requesting a single A100 GPU.
 
Instead of shelling into the container we only *execute* the `python` command inside the container.
 
Depending on the current workload of the system, it can take a couple of minutes until the requested compute resources are allocated for our job.
 
 
```shell
 
srun --time=00:01:00 --gres=gpu:a100:1 --mem=16GB apptainer exec jax_24.04-py3.sif python example.py
 
```
 
 
You should see the output of the script in your terminal and, to everyone's surprise, the GPU performed the computation much faster than the CPU.
 
 
We can also send the Slurm job via `sbatch` and write a little [job script](https://docs.mpcdf.mpg.de/doc/computing/raven-user-guide.html#batch-jobs-using-gpus)
 
 
```bash
 
# content of the 'example.slurm' job script
 
#SBATCH --gres=gpu:a100:1
 
#SBATCH --mem=16GB
 
#SBATCH --time=00:01:00
 
 
module purge
 
module load apptainer/1.3.6
 
 
srun apptainer exec jax_24.04-py3.sif python example.py
 
```
 
 
And submit it to the cluster with `sbatch example.slurm`
 
 
That's it, congratulations, you ran your first Python script inside a container using a GPU! 🎉
 
 
 
# Examples and blueprints
 
 
In this repository you find following examples or blueprints that you can adapt and use for your projects:
 
 
- [transformers](transformers): ...
 
- [pytorch](nvidia_pytorch): ...
 
- [tensorflow](): ...
 
...
 
 
 
# Official base containers
 
 
You probably want to build your own container with your own custom software stack and environment in it.
 
We recommend, however, that you alway build on top of certain *base containers* depending on your application and the kind of GPUs you want to use.
 
 
Nvidia and AMD both provide containers for common AI frameworks that come with all the necessary software to optimally use their hardware.
 
Below is a list of links to these containers sorted by vendor and application.
 
 
To use these in your custom container you have to specify their path and tag in the [Apptainer definition file]().
 
For example, to build on top of Nvidia's PyTorch container your definition file should start like this (see also the [build section]() below)
 
 
```bash
 
BootStrap: docker
 
From: nvcr.io/nvidia/pytorch:25.04-py3
 
 
...
 
```
 
> [!warning]
 
> Most of the these base containers are quite large and it will take some time to download and build them.
 
 
 
## Nvidia
 
 
> [!tip]
 
> Use the [release notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) to match Nvidia's container tag to the actual PyTorch version installed in the container.
 
 
- [PyTorch](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
 
- [TensorFlow](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
 
- [JAX](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/jax)
 
 
 
## AMD
 
 
- [PyTorch](https://hub.docker.com/r/rocm/pytorch/tags)
 
- [TensorFlow](https://hub.docker.com/r/rocm/tensorflow/tags)
 
- [JAX](https://hub.docker.com/r/rocm/jax/tags)
 
 
 
# Working with Apptainer
## Getting started
We use [Apptainer](https://apptainer.org/docs/user/main/index.html) to build/run containers on our HPC systems.
We use [Apptainer](https://apptainer.org/docs/user/main/index.html) to build/run containers on our HPC systems.
You will need a Linux system to run Apptainer natively on your machine, and it’s easiest to [install](https://apptainer.org/docs/user/main/quick_start.html) if you have root access.
You will need a Linux system to run Apptainer natively on your machine, and it’s easiest to [install](https://apptainer.org/docs/user/main/quick_start.html) if you have root access.
@@ -17,7 +156,7 @@ Containers are built via a [definition file](https://apptainer.org/docs/user/lat
@@ -17,7 +156,7 @@ Containers are built via a [definition file](https://apptainer.org/docs/user/lat
In each folder of this repo you will find a definition `.def` file and a `README.md` that describes the exact build command.
In each folder of this repo you will find a definition `.def` file and a `README.md` that describes the exact build command.
### Pull from Dokcer Hub
## Pull from Dokcer Hub
You can easily [pull containers](https://apptainer.org/docs/user/latest/docker_and_oci.html#containers-from-docker-hub)
You can easily [pull containers](https://apptainer.org/docs/user/latest/docker_and_oci.html#containers-from-docker-hub)
from the [Docker Hub](https://hub.docker.com/) or other OCI registries:
from the [Docker Hub](https://hub.docker.com/) or other OCI registries:
@@ -26,7 +165,7 @@ from the [Docker Hub](https://hub.docker.com/) or other OCI registries:
@@ -26,7 +165,7 @@ from the [Docker Hub](https://hub.docker.com/) or other OCI registries:
$ apptainer pull my_apptainer.sif docker://sylabsio/lolcow:latest
$ apptainer pull my_apptainer.sif docker://sylabsio/lolcow:latest
```
```
### Convert from Docker Daemon or Docker Archive files
## Convert from Docker Daemon or Docker Archive files
You can also [convert images/containers](https://apptainer.org/docs/user/latest/docker_and_oci.html#containers-from-docker-hub) running in your Docker Daemon:
You can also [convert images/containers](https://apptainer.org/docs/user/latest/docker_and_oci.html#containers-from-docker-hub) running in your Docker Daemon:
```shell
```shell
@@ -76,18 +215,18 @@ You can apply overlays with the `run`, `exec`, `shell` and `instance start` comm
@@ -76,18 +215,18 @@ You can apply overlays with the `run`, `exec`, `shell` and `instance start` comm
The [Remote Visualisation Service (RVS)](https://docs.mpcdf.mpg.de/doc/visualization/index.html) allows you to run Jupyter sessions on the HPC systems.
The [Remote Visualisation Service (RVS)](https://docs.mpcdf.mpg.de/doc/visualization/index.html) allows you to run Jupyter sessions on the HPC systems.
You can use your container as a kernel within such a session by providing a `kernel.json` spec file.
You can use your container as a kernel within such a session by providing a `kernel.json` spec file.
### 1. Setting up the container
**1. Setting up the container**
Make sure you install ipython and ipykernel in your container:
Make sure you install ipython and ipykernel in your container:
```
```
pip install ipython ipykernel
pip install ipython ipykernel
```
```
### 2. Setting up RVS
**2. Setting up RVS**
Load apptainer module when initializing your RVS session.
Load apptainer module when initializing your RVS session.
### 3. Creating the kernel
**3. Creating the kernel**
Create a kernel spec file
Create a kernel spec file
```bash
```bash
@@ -124,52 +263,3 @@ The next time you request a jupyter session, you can choose the generic jupyter
@@ -124,52 +263,3 @@ The next time you request a jupyter session, you can choose the generic jupyter
Keep in mind that you are inside the container.
Keep in mind that you are inside the container.
If you want to access files outside your home directory, you have to bind them explicitly in the kernel spec file when calling the apptainer command.
If you want to access files outside your home directory, you have to bind them explicitly in the kernel spec file when calling the apptainer command.
For example, in the kernel spec file above we bind your `ptmp` folder.
For example, in the kernel spec file above we bind your `ptmp` folder.
## Local-to-HPC Workflow
**TODO: The sandbox option does not work 100% correctly for VSCode or PyCharm, use docker images instead! Need to update this guide!**
A nice workflow to develop a python library locally and deploy it on our HPO systems (sharing exactly the same environment) is to use the [*sandbox* feature](https://apptainer.org/docs/user/main/build_a_container.html#sandbox) of Apptainer.
We are still investigating if something similar is possible with `Docker` (please let us know if you find a way :) ).
### 1. Create a definition file
In the root directory of your library (repository) create a *definition* `*.def` file.
This definition file should reflect your environment in which you want your library to develop and use.
You can leverage base environments, such as docker images on DockerHub, or existing apptainers.
### 2. Build the sandbox
Build the sandbox (container in a directory) instead of the default SIF format:
```shell
apptainer build --fakeroot --sandbox my_container my_container.def
```
### 3. Install your library in the sandbox
Now we can add our library that we develop to the sandbox environment and install it in [`editable`](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) mode:
```shell
apptainer exec --writable my_container python -m pip install -e .
```
### 4. Point your IDE's interpreter to the sandbox
You should be able to point the interpreter of your IDE (VSC, PyCharm, etc.) to the python executable inside the sandbox folder.
### 5. Add your developed library to the my_container.def file
While in principle you could build a SIF container directly from your sandbox, it is better to modify your *definition* `*.def` file to include your library/package.
In this way, your container is fully reproducible using only the definition file.
### 6. Build your *.sif apptainer, deploy on our HPC systems
Once you built the SIF container, you can copy it to our HPC systems and use it there.
```shell
apptainer build --fakeroot my_container.sif my_container.def
```
Loading