Pilar Cossio · 8bfe977d
--- a/Performance-Variables.md
+++ b/Performance-Variables.md
@@ -5,10 +5,16 @@ We provide a list of the performance variables that enhance or modify the code

 - **GPU**=[x] (Default: 0) Set to 1 to enable GPU usage, set to 0 to use only the CPU. The GPU will be used to calculate the probabilities for all particle-images. The preparation of projections and PSF convolutions will be processed by the CPU. This is arranged in a pipeline to ensure continuous GPU utilization.

+- **GPUDEVICE**=[x] (Default: fastest) Only relevant if GPU=1
+    – If this is not set, BioEM will autodetect the fastest GPU. Only possible if MPI is not used.
+    – If x >= 0, BioEM will use GPU number x. Only possible if MPI is not used.
+    – If x = -1, BioEM runs with N MPI threads, and the system has G GPUs, then BioEM will use GPU with number (N % G). The idea is that one can place multiple MPI processes on one node, and each will use a different GPU. For a multi-node configuration, one must make sure that consecutive MPI ranks are placed on the same node, i.e. four processes on two nodes (node0 and node1) must be placed as (node0, node0, node1, node1) and not as (node0, node1, node0, node1), because in the latter case only 1 GPU per node will be used (by two MPI processes each).
+
+
 - **GPUALGO**=[x] (Default: 2) This option is only relevant if GPU=1 and FFTALGO=0. Hence, it is commonly not used, since FFTALGO defaults to 1. For the non-Fourier-algorithm there are three GPUALGO implementations:

    – x=2: This will parallelize over the particle-images, and over the center displacements. The approach requires less memory bandwidth than GPUALGO=0 or GPUALGO=1. However, it has several constraints on the problem configuration: i) the number of center displacements per dimension must be a power of 2, and ii) must be a factor of the number of CUDA threads per block.
 
-– x=1: This will parallelize over the particle-images, and then loop over the center displacements on the GPU. It is usually slower than GPUALGO=2 but there are no constraints on the problem configuration. The particle-images are not processed all at once but in chunks.
+    – x=1: This will parallelize over the particle-images, and then loop over the center displacements on the GPU. It is usually slower than GPUALGO=2 but there are no constraints on the problem configuration. The particle-images are not processed all at once but in chunks.

- – x=0: As GPUALGO=1, all particle images are processed at once. It is always slower than GPUALGO=1 and should not be used anymore.
\ No newline at end of file
+    – x=0: As GPUALGO=1, all particle images are processed at once. It is always slower than GPUALGO=1 and should not be used anymore.
\ No newline at end of file