u,s,vh=sparse.linalg.svds(A,2)# 2 largest singular values
u,s,vh=sparse.linalg.svds(A,2)# 2 largest singular values
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## SciPy Example 5: Interpolation
## SciPy Example 5: Interpolation
* Interpolation of 1D and multi-D data (structured grid and unstructured points)
* Interpolation of 1D and multi-D data (structured grid and unstructured points)
* Splines and other polynomials
* Splines and other polynomials
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
importscipy.interpolateasinterpolate
importscipy.interpolateasinterpolate
x=np.linspace(0,10.0,12)
x=np.linspace(0,10.0,12)
y=np.sin(x)
y=np.sin(x)
y_int1=interpolate.interp1d(x,y)
y_int1=interpolate.interp1d(x,y)
y_int2=interpolate.PchipInterpolator(x,y)
y_int2=interpolate.PchipInterpolator(x,y)
importmatplotlib.pyplotasplt
importmatplotlib.pyplotasplt
plt.figure(figsize=(20,6))
plt.figure(figsize=(20,6))
plt.subplot(1,2,1)
plt.subplot(1,2,1)
plt.plot(x,y)
plt.plot(x,y)
x2=np.linspace(0,10.0,50)
x2=np.linspace(0,10.0,50)
plt.plot(x2,y_int1(x2))
plt.plot(x2,y_int1(x2))
plt.subplot(1,2,2)
plt.subplot(1,2,2)
plt.plot(x,y)
plt.plot(x,y)
im=plt.plot(x2,y_int2(x2))
im=plt.plot(x2,y_int2(x2))
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## NumPy/SciPy Summary
## NumPy/SciPy Summary
* NumPy: efficient handling of arrays
* NumPy: efficient handling of arrays
* SciPy: more advanced mathematical and numerical routines; uses NumPy under the hood *plus* other C/F libraries
* SciPy: more advanced mathematical and numerical routines; uses NumPy under the hood *plus* other C/F libraries
* Performance tips:
* Performance tips:
* Work on full arrays (slicing, NumPy routines...)
* Work on full arrays (slicing, NumPy routines...)
* identify hotspots and optimize them
* identify hotspots and optimize them
* profile
* profile
* code in Cython or C/Fortran + Python interface -- or try Numba
* code in Cython or C/Fortran + Python interface -- or try Numba
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Futher reading
## Futher reading
* Harris, C.R., Millman, K.J., van der Walt, S.J. et al. *Array programming with NumPy.***Nature** 585, 357–362 (2020). (https://doi.org/10.1038/s41586-020-2649-2)
* Harris, C.R., Millman, K.J., van der Walt, S.J. et al. *Array programming with NumPy.***Nature** 585, 357–362 (2020). (https://doi.org/10.1038/s41586-020-2649-2)
* Bressert, E. (2012). SciPy and NumPy (1st edition.). O'Reilly Media, Inc. (https://ebooks.mpdl.mpg.de/ebooks/Record/EB001944176)
* Bressert, E. (2012). SciPy and NumPy (1st edition.). O'Reilly Media, Inc. (https://ebooks.mpdl.mpg.de/ebooks/Record/EB001944176)
* There are numerous books on the topic available: https://ebooks.mpdl.mpg.de/ebooks/Search/Results?type=AllFields&lookfor=numpy
* There are numerous books on the topic available: https://ebooks.mpdl.mpg.de/ebooks/Search/Results?type=AllFields&lookfor=numpy
(MPG.eBooks work from any Max Planck IP address.)
(MPG.eBooks work from any Max Planck IP address.)
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Numpy Input and Output
# Numpy Input and Output
## Reading and writing NumPy arrays
## Reading and writing NumPy arrays
Possibilities:
Possibilities:
* NumPy's native IO facilities
* NumPy's native IO facilities
* HDF5 using H5PY
* HDF5 using H5PY
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Native NumPy Input/Output
## Native NumPy Input/Output
### Saving NumPy arrays to files
### Saving NumPy arrays to files
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
importos
importos
importnumpyasnp
importnumpyasnp
x=np.linspace(0.0,100.0,101)
x=np.linspace(0.0,100.0,101)
# save to text file (for debugging and small files only)
# save to text file (for debugging and small files only)
"## Why Python frameworks for parallel computing?\n",
"## Why Python frameworks for parallel computing?\n",
"\n",
"\n",
...
@@ -40,63 +48,562 @@
...
@@ -40,63 +48,562 @@
},
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"source": [
"## Comparison of parallel frameworks (selection)\n",
"## Overview on parallel frameworks (selection)\n",
"\n",
"\n",
"* [Apache Spark](https://spark.apache.org): designed for distributed big data analytics, features include SQL, distributed caching, multi-language bindings including Python\n",
"* [Apache Spark](https://spark.apache.org): designed for distributed big data analytics, features include SQL, distributed caching, multi-language bindings including Python\n",
"* [Dask](https://www.dask.org): parallel distributed computing, based on a reimplementation of the NumPy API (similarly for Pandas and scikit-learn) in combination with a powerful task scheduler\n",
"* [Dask](https://www.dask.org): parallel distributed computing, provides an implementation of the NumPy API (similarly for Pandas and scikit-learn) in combination with a powerful task scheduler, feels quite *pythonic*\n",
"* [Ray](https://www.ray.io): core library for distributed computing, plus growing ecosystem with specific libraries (often from AI)"
"* [Ray](https://www.ray.io): core library for distributed computing, plus growing ecosystem with specific libraries (very often for AI applications)"
]
]
},
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"source": [
"## Cloud vs HPC environments\n",
"## Background: Cloud vs HPC environments\n",
"\n",
"### How to run and scale Python codes using such parallel frameworks?\n",
"\n",
"\n",
"* Cloud\n",
"* Cloud\n",
" * (software) design often centered around web services\n",
" * (software) design often centered around web services\n",
" * scaling works typically via container orchestration systems (e.g. Kubernetes)\n",
" * scaling works e.g. via container orchestration systems (e.g. Kubernetes)\n",
" * Python frameworks for parallel computing are often designed with Cloud environments in mind (as these are available to a broader audience in contrast to HPC systems)\n",
" * Python frameworks for parallel computing are often designed with Cloud environments in mind (as these are available to a broader audience in contrast to 'real' HPC systems)\n",
"\n",
"\n",
"* HPC\n",
"* HPC\n",
" * workloads managed via batch jobs\n",
" * not per-se designed to run web services (e.g. interactive dashboards, dedicated scheduler services)\n",
" * non-interactive use preferred\n",
" * workloads are submitted via batch jobs\n",
" * Practical challenge: How to get Python parallel frameworks to operate in concert with a batch scheduler?"
" * non-interactive use highly preferred\n",
"\n",
"**Practical challenge: Need get Python parallel frameworks to operate in concert with a HPC batch scheduler and infrastructure**"
"# visualize the task graph necessary to perform the computation\n",
"a_mean.visualize()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"49999.5"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# perform the computation on the Dask array\n",
"a_mean.compute()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"49999.5"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# perform the computation on the original NumPy array\n",
"data.mean()"
]
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"source": [
"### Dask futures"
"### Dask futures"
]
]
},
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"source": [
"### Dask-MPI"
"### Dask-MPI"
]
]
},
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"source": [
"### Case Study"
"### Case Study"
]
]
...
@@ -110,6 +617,7 @@
...
@@ -110,6 +617,7 @@
}
}
],
],
"metadata": {
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"language": "python",
...
...
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Frameworks for parallel computing
# Frameworks for parallel computing
**Python for HPC course**
**Python for HPC course**
Max Planck Computing and Data Facility, Garching
Max Planck Computing and Data Facility, Garching
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Outline
## Outline
* Motivation
* Motivation
* Overview on Parallel Frameworks
* Overview on Parallel Frameworks
* Example: Dask
* Example: Dask
* Concepts
* Concepts
* Dask-MPI on a Slurm cluster
* Dask-MPI on a Slurm cluster
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Why Python frameworks for parallel computing?
## Why Python frameworks for parallel computing?
* Scale from a single local computer to large parallel resources, (ideally) with a minimum of code modifications
* Scale from a single local computer to large parallel resources, (ideally) with a minimum of code modifications
* Avoid the complexity of handling interprocess/internode communication explicitly ($\to$ MPI), better let the framework handle this!
* Avoid the complexity of handling interprocess/internode communication explicitly ($\to$ MPI), better let the framework handle this!
* Use cases: Data parallel problems that can be decomposed into tasks, e.g. processing, reduction, analysis of large amounts of data, training of certain neural networks, etc.
* Use cases: Data parallel problems that can be decomposed into tasks, e.g. processing, reduction, analysis of large amounts of data, training of certain neural networks, etc.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Comparison of parallel frameworks (selection)
## Overview on parallel frameworks (selection)
*[Apache Spark](https://spark.apache.org): designed for distributed big data analytics, features include SQL, distributed caching, multi-language bindings including Python
*[Apache Spark](https://spark.apache.org): designed for distributed big data analytics, features include SQL, distributed caching, multi-language bindings including Python
*[Dask](https://www.dask.org): parallel distributed computing, based on a reimplementation of the NumPy API (similarly for Pandas and scikit-learn) in combination with a powerful task scheduler
*[Dask](https://www.dask.org): parallel distributed computing, provides an implementation of the NumPy API (similarly for Pandas and scikit-learn) in combination with a powerful task scheduler, feels quite *pythonic*
*[Ray](https://www.ray.io): core library for distributed computing, plus growing ecosystem with specific libraries (often from AI)
*[Ray](https://www.ray.io): core library for distributed computing, plus growing ecosystem with specific libraries (very often for AI applications)
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Cloud vs HPC environments
## Background: Cloud vs HPC environments
### How to run and scale Python codes using such parallel frameworks?
* Cloud
* Cloud
* (software) design often centered around web services
* (software) design often centered around web services
* scaling works typically via container orchestration systems (e.g. Kubernetes)
* scaling works e.g. via container orchestration systems (e.g. Kubernetes)
* Python frameworks for parallel computing are often designed with Cloud environments in mind (as these are available to a broader audience in contrast to HPC systems)
* Python frameworks for parallel computing are often designed with Cloud environments in mind (as these are available to a broader audience in contrast to 'real' HPC systems)
* HPC
* HPC
* workloads managed via batch jobs
* not per-se designed to run web services (e.g. interactive dashboards, dedicated scheduler services)
* non-interactive use preferred
* workloads are submitted via batch jobs
* Practical challenge: How to get Python parallel frameworks to operate in concert with a batch scheduler?
* non-interactive use highly preferred
**Practical challenge: Need get Python parallel frameworks to operate in concert with a HPC batch scheduler and infrastructure**