Skip to content
Snippets Groups Projects
Select Git revision
  • main default protected
  • dev-piero
  • docs_tensorflow
  • viper
  • docs
5 results

ai_containers

  • Clone with SSH
  • Clone with HTTPS
  • Nastassya Horlava's avatar
    Nastassya Horlava authored
    Docs lightning
    
    See merge request !8
    3a4ac03d
    History

    AI Containers

    👋 Welcome to the "AI Containers" repository of MPCDF's AI Group. This guide provides a brief introduction to using containers on our HPC systems, along with example scripts to serve as blueprints for your projects.

    Getting started

    Let's go through a minimal example of running a Python script using a container on our Raven system.

    A fun little experiment is to benchmark matrix multiplication on both the CPU and the GPU. To perform the computation on the GPU, we use the popular framework JAX.

    So let's start by creating the example.py file with the following Python script

    """Content of the example.py script"""
    import numpy as np
    import jax
    import time
    
    def benchmark(m):
        m@m  # warmup
        start = time.time()
                m@m.block_until_ready()gi
        return time.time() - start
    
    m = np.random.randn(1000, 1000)
    m_cpu = jax.device_put(m, jax.devices("cpu")[0])
    m_gpu = jax.device_put(m, jax.devices("gpu")[0])
    print("CPU:", benchmark(m_cpu))
    print("GPU:", benchmark(m_gpu))

    Our goal is to run this script inside a container on a GPU-equipped compute node. Since the Raven system uses NVIDIA GPUs, we'll use NVIDIA's official JAX container, which includes Python and JAX, and ships with all the necessary software to optimally utilize their GPUs.

    Tip

    See the build section below to learn how to customize containers.

    To download (or pull) the container, first load Apptainer via our module system. Pulling and creating this container locally can take up to 15 minutes.

    module load apptainer/1.3.6
    apptainer pull docker://nvcr.io/nvidia/jax:24.04-py3

    You should now have a file called jax_24.04-py3.sif in your directory.

    Let's try to run our little script inside the container on the login node. For this, we'll shell inside the container.

    apptainer shell jax_24.04-py3.sif

    Once inside the container (indicated by the Apptainer> prompt), run the script. It will throw an error due to the absence of a GPU:

    Apptainer> python example.py
    ...
    RuntimeError: Unknown backend: 'gpu' requested, but no platforms that are instances of gpu are present. Platforms present are: cpu

    Ok, so let's exit the container with exit and send a short Slurm job to the cluster via srun, requesting a single A100 GPU. Instead of shelling into the container, execute the python command inside it. Depending on the current system workload, it can take a couple of minutes until the requested compute resources are allocated for our job.

    srun --time=00:01:00 --gres=gpu:a100:1 --mem=16GB apptainer exec jax_24.04-py3.sif python example.py

    You should see the output of the script in your terminal and, to everyone's surprise, the GPU outperforms the CPU.

    Alternatively, we can also send the Slurm job via sbatch using a job script:

    # content of the 'example.slurm' job script
    #SBATCH --gres=gpu:a100:1
    #SBATCH --mem=16GB
    #SBATCH --time=00:01:00
    
    module purge
    module load apptainer/1.3.6
    
    srun apptainer exec jax_24.04-py3.sif python example.py

    Submit it with sbatch example.slurm

    That's it. Congratulations, you ran your first Python script inside a container using a GPU! 🎉

    Examples and blueprints

    This repository includes the following examples and blueprints for adaptation in your projects:

    • transformers: Use Hugging Face's transformers library with PyTorch to run a multi-node fine-tuning of an LLM.
    • pytorch: Distribute the training of a ResNet50 model on synthetic data with PyTorch and report the performances.
    • tensorflow: Training a ResNet50 model on synthetic data using TensorFlow and report the performances, covering both single-gpu (undistributed) and multi-gpu / multi-node (distributed) setups.
    • accelerate: Apply minimal changes to the PyTorch example to use Accelerate for distributed training of a ResNet50 model on synthetic data and report the performance metrics.
    • lightning: Training a ResNet50 model on synthetic data using Lightning and report the performance metrics.

    Official base containers

    To build your own container with a custom software stack, we recommend starting with specific base containers based on your application and the GPUs you plan to use.

    NVIDIA and AMD provide containers for common AI frameworks, optimized for their hardware. Below are links to these containers, sorted by vendor and application.

    To use these in your custom container, specify their path and tag in the Apptainer definition file. For example, to build on NVIDIA's PyTorch container, start your definition file like this (see also the build section below)

    BootStrap: docker
    From: nvcr.io/nvidia/pytorch:25.04-py3
    
    ...

    Warning

    Most of these base containers are large and may take time to download and build.

    NVIDIA

    Tip

    Use the release notes to match NVIDIA's container tag to the actual PyTorch version installed in the container.

    AMD

    Working with Apptainer

    We use the container platform Apptainer to build and run containers on our HPC systems. Unlike Docker, Apptainer was built specifically for HPC cluster environments with a fitting security model and a focus on reproducibility.

    On our HPC systems you can load Apptainer via our module system:

    find-module apptainer
    module load apptainer/1.4.1

    To run Apptainer natively on your machine, you will need a Linux system. Installation is straightforward if you have root access — see the official quick start guide for details.

    Apptainer aims for maximum compatibility with Docker and it's easy to convert and use Docker images.

    Apptainer's documentation is extensive, well structured, and we highly recommend it. This section here is merely thought as a cheat sheet for common commands and flags you will likely use when working with AI workloads at the MPCDF.

    For a nice introduction to Apptainer on our HPC systems, have a look at the awesome presentation by Michele. You can also browse our technical documentation.

    Pull containers from Docker Hub

    You can easily pull containers from the Docker Hub or other OCI registries, for example:

    $ apptainer pull docker://sylabsio/lolcow:latest
    $ apptainer run lolcow_latest.sif

    Convert containers from Docker Daemon or Docker Archive

    You can also convert images/containers running in your Docker Daemon:

    $ apptainer build my_apptainer.sif docker-daemon:sylabsio/lolcow:latest

    Or convert a Docker Archive file:

    $ sudo docker images
    REPOSITORY                        TAG               IMAGE ID       CREATED          SIZE
    sylabsio/lolcow                   latest            5a15b484bc65   2 hours ago      188MB
    
    $ docker save 5a15b484bc65 -o lolcow.tar
    $ apptainer build my_apptainer.sif docker-archive:lolcow.tar

    Tip

    Checkout the transformers example for a local-to-HPC workflow with containers.

    Build containers

    Containers are built via a definition file and the apptainer build command.

    In the definition file you usually start with a base container and build your environment on top of it in the post section.

    The following simple example defines a container in which WandB is installed on top of NVIDIA's JAX container.

    # example.def
    BootStrap: docker
    From: nvcr.io/nvidia/jax:24.04-py3
    
    %post
    pip install wandb

    Now you can use the build command to create a custom container which we will call example.sif:

    apptainer build example.sif example.def

    Tip

    Use the --verbose flag to see a progress bar when creating the SIF file (Apptainer version >1.4.1): apptainer -v build ...

    Tip

    If you encounter the error ... not found (required by /.singularity.d/libs/faked) fakeroot: error while starting the 'faked' daemon. you can try to run the build command with the apptainer build --ignore-fakeroot-command ... flag.

    Have a look at the examples and blueprints in this repository for more advanced examples of definition files.

    For a deep dive into definition files and building containers we highly recommend the official Apptainer documentation.

    Running containers

    There are basically three commands you can use to run something inside a container:

    • apptainer shell: Opens a shell within the container. This is the most interactive option and allows you to explore the environment of the container, try out bind mounts and run tests.
    • apptainer exec: Executes a specific command inside the container: apptainer exec my_container python my_script.py
    • apptainer run: Runs the commands defined in the definition file of the container.

    Note

    When converting a docker image with apptainer pull, the run command executes the CMD instruction of the docker file.

    Read and write data on the host system

    By default, Apptainer bind mounts your $HOME, $PWD and several other system directories when running a container via shell, exec or run. If you want to bind mount additional paths you can use the -B option:

    apptainer shell -B /ptmp/$USER/my_data:/mnt/my_data my_container.sif
    Apptainer> ls /mnt/my_data
    Apptainer> touch /mnt/my_data/mock

    Environment variables

    By default, all env variables of the host are exposed inside the container.

    Note

    An env variable set on the host will be overwritten by a variable of the same name set inside the container.

    If you want to set additional env variables you can use the --env flag when running the container via shell, exec or run.

    apptainer shell --env MYVAR=Hello my_container.sif
    Apptainer> echo $MYVAR
    Hello

    Docker-like behavior

    By default, Apptainer exposes much more of the host system within a container than Docker:

    • it automatically bind mounts your $HOME, $PWD, /tmp etc. directories
    • it exposes most of the host devices in /dev
    • it includes all the environment variables of your host system

    If you aim for a more Docker-like behavior when running SIF containers, you can use the --compat flag.

    GPU support

    Apptainer natively supports NVIDIA and AMD GPUs.

    There are two important options you need to be aware of when running containers with GPU support. However, since Apptainer by default exposes much more of the host system, they may not be necessary.

    Important

    In general we recommend NOT SETTING following options, especially if you use one of the official base containers. Only if you encounter issues with the GPU support, you should try to set them.

    --nv

    This option is for NVIDIA GPUs and ensures that the /dev/nvidiaX device entries are available inside the container, and locates and binds the basic CUDA libraries from the host into the container.

    --rocm

    This option is for AMD GPUs and ensures that the /dev/dri device entries are available inside the container, and locates and binds the basic ROCm libraries from the host into the container.

    Warning

    When using the --compat flag, you must use these options for GPU support.

    Patching containers

    By default, your SIF container is read-only. If you want to make small temporary changes to try things out, you can use the --writable-tmpfs flag when shelling into it.

    apptainer shell --writable-tmpfs my_container.sif

    If you want to permanently modify files in your container, you can use persistent "overlays". These are writable file system images that sit on top of your immutable SIF container.

    Here, we provide a quick overview of how to use overlays. Please look at the dedicated section of the official apptainer documentation for more details about using overlays or making your SIF container writable and persisting changes.

    Warning

    Using overlays only works on the Viper system!

    Create an overlay

    $ apptainer overlay create --size 1024 ./overlay.img

    Apply an overlay

    $ apptainer shell --overlay overlay.img my_container.sif

    Now you can modify files inside the container and the modifications will be stored in the overlay.img file.

    You can apply overlays with the run, exec, shell and instance start commands.

    Developing with containers

    In this section we will provide convenient workflows when developing your code with containers and running your AI workload on our HPC systems.

    Local-to-HPC workflow

    Our general recommendation when developing your code is to stay on your local machine as long as possible. It's normally easier to modify your environment and iterate over your code locally. For AI workloads this often means that your code should include a CPU/GPU switch.

    Once you need to move to our HPC systems, replicate your local environment in a container. Write a definition file that starts with the right base container and installs all your required dependencies in the post section. You can build the container on our HPC systems, or locally if you have Apptainer available on your machine.

    Since some IDEs have also good Docker support, you could also start using a dockerized environment right from the start and convert the docker container to an apptainer when moving to our HPC systems. Checkout the transformers example for more detailed steps.

    Developing a Python package

    The easiest way to make your Python package available within the container, is to bind mount its directory and modify the PYTHONPATH environment variable. So let's assume you are developing a Python package my_package and you have a container my_container.sif that includes all the dependencies of your package. To run a script that uses your package, you can do:

    apptainer exec -B /path/to/my_package:/tmp/my_package --env PYTHONPATH=/tmp/my_package my_container.sif python my_script.py

    When the development of your package is done you can ship it with the container by copying and installing it during the built.

    Using containers with RVS

    The Remote Visualisation Service (RVS) allows you to run Jupyter sessions on our HPC systems. You can use your container as a kernel within such a session by providing a kernel.json spec file.

    1. Setting up the container

    Make sure you install ipykernel in your container:

    pip install ipykernel

    2. Setting up RVS

    Load an apptainer module when initializing your RVS session.

    If you prefer to use the CLI, you can just append the load command to the $HOME/.jupyter/modules.conf file. From a login node execute:

    echo "module load apptainer/1.4.1" >> $HOME/.jupyter/modules.conf

    3. Creating the kernel

    Create a new directory in $HOME/.local/share/jupyter/kernels/:

    mkdir $HOME/.local/share/jupyter/kernels/my-container

    The name of the directory is only used internally by jupyter, you can name it as you want.

    Add to it a kernel spec file:

    vim ~/.local/share/jupyter/kernels/my-container/kernel.json

    that should look something like this

    {
     "argv": [
      "apptainer",
      "exec",
      "--nv",
      "--bind",
      "{connection_file}:/tmp/connection_spec,/ptmp/<your user name>",
      "/absolute/path/to/your/container.sif",
      "python",
      "-m",
      "ipykernel_launcher",
      "-f",
      "/tmp/connection_spec"
     ],
     "display_name": "Name of your kernel",
     "language": "python",
     "metadata": {
      "debugger": true
     },
     "env": {
      "PYTHONPATH": "/add/custom/packages/here/if/you/want"
     }
    }

    The next time you request a jupyter session, you can choose the generic jupyter version, and use your custom kernel. Keep in mind that you are inside the container. If you want to access files outside your home directory, you have to bind them explicitly in the kernel spec file when calling the apptainer command. For example, in the kernel spec file above we bind your ptmp folder.