diff --git a/README.md b/README.md index c10f8587f22cee93e5a7bfa674156e9edea931f9..552b4b7654e522d4b7168c21adc7c4d85f4f1440 100644 --- a/README.md +++ b/README.md @@ -7,12 +7,12 @@ # Getting started -Let's go through a minimal example of how to run a Python script using a container on our [Raven system](link/to/official/docs). +Let's go through a minimal example of running a Python script using a container on our [Raven system](link/to/official/docs). -A fun little experiment is to benchmark a matrix multiplication on the CPU and the GPU. -To perform the computation on the GPU, we can use the popular framework [JAX](link). +A fun little experiment is to benchmark matrix multiplication on both the CPU and the GPU. +To perform the computation on the GPU, we use the popular framework [JAX](link). -So let's start by creating the `example.py` file and write following little Python script +So let's start by creating the `example.py` file with the follwing Python script ```python """Content of the example.py script""" @@ -33,13 +33,13 @@ print("CPU:", benchmark(m_cpu)) print("GPU:", benchmark(m_gpu)) ``` -Our goal is to run this script inside a container on a compute node with a GPU. -Since the Raven system has [Nvidia GPUs](link/to/mpcdf/raven/docs), we will use Nvidia's official JAX container where Python and JAX is already installed, and that ships with all the necessary software to optimally utilize their GPUs. +Our goal is to run this script inside a container on a GPU-equipped compute node. +Since the Raven system uses [Nvidia GPUs](link/to/mpcdf/raven/docs), we'll use Nvidia's official JAX container, which includes Python and JAX, and ships with all the necessary software to optimally utilize their GPUs. > [!tip] > See the [build section]() below to learn how to customize containers. -To download (or *pull*) the container from their website, we first need to load the container system [Apptainer]() via our [module system](). +To download (or *pull*) the container, first load [Apptainer]() via our [module system](). Pulling and creating this container locally can take up to 15 minutes. ```shell @@ -47,16 +47,16 @@ module load apptainer/1.3.6 apptainer pull docker://nvcr.io/nvidia/jax:24.04-py3 ``` -Now you should have a file called `jax_24.04-py3.sif` in your folder. +You should now have a file called `jax_24.04-py3.sif` in your directory. Let's try to run our little script inside the container on the login node. -For this, we will first *shell* inside the container. +For this, we'll *shell* inside the container. ```shell apptainer shell jax_24.04-py3.sif ``` -Once inside the container (notice the `Apptainer>` prompt in your terminal), we will run our script which will throw an error since we do not have a GPU available +Once inside the container (indicated by the `Apptainer>` prompt), run the script. It will throw an error due to the absence of a GPU: ```shell Apptainer> python example.py @@ -64,17 +64,17 @@ Apptainer> python example.py RuntimeError: Unknown backend: 'gpu' requested, but no platforms that are instances of gpu are present. Platforms present are: cpu ``` -Ok, so let's exit the container with `exit` and send a short [Slurm job]() to the cluster via `srun` requesting a single A100 GPU. -Instead of shelling into the container we only *execute* the `python` command inside the container. -Depending on the current workload of the system, it can take a couple of minutes until the requested compute resources are allocated for our job. +Ok, so let's exit the container with `exit` and send a short [Slurm job]() to the cluster via `srun`, requesting a single A100 GPU. +Instead of shelling into the container, *execute* the `python` command inside it. +Depending on the current system workload, it can take a couple of minutes until the requested compute resources are allocated for our job. ```shell srun --time=00:01:00 --gres=gpu:a100:1 --mem=16GB apptainer exec jax_24.04-py3.sif python example.py ``` -You should see the output of the script in your terminal and, to everyone's surprise, the GPU performed the computation much faster than the CPU. +You should see the output of the script in your terminal and, to everyone's surprise, the GPU outperforms the CPU. -We can also send the Slurm job via `sbatch` and write a little [job script](https://docs.mpcdf.mpg.de/doc/computing/raven-user-guide.html#batch-jobs-using-gpus) +Alternatively, we can also send the Slurm job via `sbatch` using a [job script](https://docs.mpcdf.mpg.de/doc/computing/raven-user-guide.html#batch-jobs-using-gpus): ```bash # content of the 'example.slurm' job script @@ -88,14 +88,15 @@ module load apptainer/1.3.6 srun apptainer exec jax_24.04-py3.sif python example.py ``` -And submit it to the cluster with `sbatch example.slurm` +Submit it with `sbatch example.slurm` -That's it, congratulations, you ran your first Python script inside a container using a GPU! 🎉 +That's it. +Congratulations, you ran your first Python script inside a container using a GPU! 🎉 # Examples and blueprints -In this repository you find following examples or blueprints that you can adapt and use for your projects: +This repository includes the following examples and blueprints for adaptation in your projects: - [transformers](transformers): ... - [pytorch](nvidia_pytorch): ... @@ -105,14 +106,13 @@ In this repository you find following examples or blueprints that you can adapt # Official base containers -You probably want to build your own container with your own custom software stack and environment in it. -We recommend, however, that you alway build on top of certain *base containers* depending on your application and the kind of GPUs you want to use. +To build your own container with a custom software stack, we recommend starting with specific *base containers* based on your application and the GPUs you plan to use. -Nvidia and AMD both provide containers for common AI frameworks that come with all the necessary software to optimally use their hardware. -Below is a list of links to these containers sorted by vendor and application. +Nvidia and AMD provide containers for common AI frameworks, optimized for their hardware. +Below are links to these containers, sorted by vendor and application. -To use these in your custom container you have to specify their path and tag in the [Apptainer definition file](). -For example, to build on top of Nvidia's PyTorch container your definition file should start like this (see also the [build section]() below) +To use these in your custom container, specify their path and tag in the [Apptainer definition file](). +For example, to build on Nvidia's PyTorch container, start your definition file like this (see also the [build section]() below) ```bash BootStrap: docker @@ -121,7 +121,7 @@ From: nvcr.io/nvidia/pytorch:25.04-py3 ... ``` > [!warning] -> Most of the these base containers are quite large and it will take some time to download and build them. +> Most of these base containers are large and may take time to download and build. ## Nvidia