Commit 37f46999 authored by Pilar Cossio's avatar Pilar Cossio
Browse files

removed correct unnecessary files

parent aba3be7a
Pipeline #4756 passed with stage
cmake_minimum_required(VERSION 2.6)
###Set up options
option (INCLUDE_CUDA "Build BioEM with CUDA support" ON)
option (INCLUDE_OPENMP "Build BioEM with OpenMP support" ON)
option (INCLUDE_MPI "Build BioEM with MPI support" ON)
option (PRINT_CMAKE_VARIABLES "List all CMAKE Variables" OFF)
option (CUDA_FORCE_GCC "Force GCC as host compiler for CUDA part (If standard host compiler is incompatible with CUDA)" ON)
###Set up general variables
set (BIOEM_ICC_FLAGS "-xHost -O3 -fno-alias -fno-fnalias -unroll -g0 -ipo")
set (BIOEM_GCC_FLAGS "-O3 -march=native -fweb -mfpmath=sse -frename-registers -minline-all-stringops -ftracer -funroll-loops -fpeel-loops -fprefetch-loop-arrays -ffast-math -ggdb")
set (BIOEM_SOURCE_FILES "bioem.cpp" "main.cpp" "map.cpp" "model.cpp" "param.cpp" "cmodules/timer.cpp")
###Find Required Packages
pkg_check_modules(FFTW fftw3)
find_package(FFTW 3 REQUIRED)
find_package(Boost 1.43 REQUIRED)
###Find Optional Packages
###Find CUDA
set (BIOEM_CUDA_STATUS "Disabled")
set (BIOEM_CUDA_STATUS "Not Found")
#Use GCC as host compiler for CUDA even though host compiler for other files is not GCC
set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS};--use_fast_math;-ftz=true;-O4;-Xptxas -O4")
list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_13,code=sm_13")
list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_20,code=sm_20")
list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_20,code=sm_21")
list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_30,code=sm_30")
list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_35,code=sm_35")
###Find OpenMP
set (BIOEM_OPENMP_STATUS "Disabled")
###Find MPI
set (BIOEM_MPI_STATUS "Disabled")
set (BIOEM_MPI_STATUS "Not Found")
set (BIOEM_MPI_STATUS "Found")
###Build Executable
#Hack to use GCC flags for GCC host compiler during NVCC compilation, although host compiler is in fact not GCC for other files
cuda_add_executable(bioEM ${BIOEM_SOURCE_FILES}
add_executable(bioEM ${BIOEM_SOURCE_FILES})
#Additional CXX Flags not used by CUDA compiler
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wall -pedantic")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Wno-vla -Wno-unused-result -Wno-unused-local-typedefs -pedantic")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-unknown-pragmas")
###Add Libraries
target_link_libraries(bioEM ${CUDA_CUDA_LIBRARY})
target_link_libraries(bioEM ${FFTWF_LIBRARIES})
target_link_libraries(bioEM -L${FFTW_LIBDIR} -lfftw3 -lfftw3f)
target_link_libraries(bioEM -L${Boost_LIBRARY_DIRS} -lboost_program_options)
target_link_libraries(bioEM ${MPI_LIBRARIES})
###Show Status
message(STATUS "Build Status")
message(STATUS "FFTW library: ${FFTW_LIBDIR}")
message(STATUS "Boost directory: ${Boost_LIBRARY_DIRS}")
message(STATUS "FFTW includedir: ${FFTW_INCLUDEDIR}")
get_cmake_property(_variableNames VARIABLES)
foreach (_variableName ${_variableNames})
message(STATUS "${_variableName}=${${_variableName}}")
BioEM has two sets of parameters:
- One set describes the problem, like tjenumber of Pixels.
These parameters are specified in a file passed to BioEM via the --Inputfile command line option.
- Another set describes the runtime configuration, which involves how to parallelize, whether to
use a GPU, and some other algorithmic settings. These parameters do not (or only slightly)
change the output, but have a large influence on the compute performance. They are passed to
BioEM via environment variables.
The two sets are treated differently, because the first set is related to the actual problem, while
the second set belongs to the compute node where the problem is processed. These runtime parameters
should be tuned for every particular architecture where BioEM is executed.
This document will explain the types of parallelization used within BioEM (1), list all relevant
environment variables (2), and provide some suggestions for runtime configurations in different
situations (3).
1. Ways of parallelization
BioEM compares various projections of a model to a set of reference maps.
Technically, the model is first projected along a given angular orientation,
then it is convoluted with a convolution kernel in Fourier space to cope with imaging artifacts,
next it is shifted by a certain number of pixels to tread find the best position for a match,
and finally this modified projection is compared to a reference map.
From a computational complexity perspective, the comparison of a given projection and convolution
with all the reference map is by far the most time consuming part. Calculating the projectiond and
performing the convolution both make up for only 0-5% of the overall execution time (depending on
the number of maps and convolution kernels).
BioEM facilitates this task using an all to all comparison through a nested loop.
- The outermost loop is over the orientations.
- Next next loop goes over the different convolution kernels.
- An inner loop runs over all maps.
- And finally the innermost loop iterates over the shifts.
This offers in principle four dimensions for parallelization. BioEM provides the following options:
- MPI: Bioem can use MPI to parallelize over the orientations in the outermost loop. In this case
the probabilities for all reference maps / convolution kernels / shifts are calculated for a
certain subset of orientations by each MPI process. Afterward, the probabilities computed by
every MPI process are reduced to the final probabilities. If started via 'mpirun', BioEM will
automatically distribute the orientations evenly among all MPI processes.
- OpenMP: BioEM can use OpenMP to parallelize over the maps in the inner loop. As processing of
these maps is totally independent, there is no synchronization required at all. BioEM will
automatically multithread over the maps. The number of employed threads can be controlled with
the standard OMP_NUM_THREADS environment variable for OpenMP.
- Graphics Processing Unites (GPUs): BioEM can use GPUs to speed up the processing. In this case,
the inner loop over all maps with all shifts is processed by the GPU in one step with
parallelization over the maps. The projections and the convolutions are still processed by the
CPU. This process is pipelined such that the CPU prepares the next projections and convolutions
while the GPU calculates the probabilities for all maps and for the previous projection and
convolution. Hence this is a horizontal parallelization layer among the maps with an additional
vertical layer through the pipeline. Usage of GPUs must be enabled with the GPU=1 environment
variable. One BioEM process will always only use one GPU, by default the fastest one. A GPU
device can be explicitly configured with the GPUDEVICE=[x] environment variable. Multiple GPUs
can be used through MPI. In this case every GPU will process all maps but calculate the
probabilities only for a subset of the orientations (see description of MPI above). Selection
of GPU devices for each MPI processed must be carried out by:
* Either the scheduler must mask all GPUs but one, such that the MPI processes sees only 1
GPU device.
* Or the GPUDEVICE=[x] environment variable must be set differently for each MPI process.
* Or it is possible to set GPUDEVICE=-1. In this case the MPI process with rank N on a system
with G GPUs will take the GPU with ID (N % G).
- GPU / CPU combined processing: Besides the pipeline approach described in the previous point,
which employs the CPU for creating the convoluted projection maps and the GPU for calculating
the probabilies for every combination with a map, there is also the possibility to split the
set of maps among the CPU and the GPU. This is facilitated by the GPUWORKLOAD=[x] environment
variable, which sets the percentage of the number of maps processed by the GPU to x%. The
pipeline is still used, i.e. in an optimal situation the CPU will:
* Issue a GPU kernel call such that the GPU calculates the probabilities for x% of the maps
for the current orientation and convolution.
* Process its own fraction of (100-x)% of the maps for the current orientation and convolution
in parallel to the GPU.
* Afterward, finish the preparation of the next orientation and convolution before the GPU has
finished calculating the probabilities for the current orientation and convolution.
- Multiple Projections at once via OpenMP: BioEM can prepare the projection of multiple
orientations at once using OpenMP. The benefit compared to the pure OpenMP parallelization over
the maps is however tiny. This is relevant if: MPI is not used, OpenMP is used and usually if
no GPU is used, and if the number of reference maps is small. The number of projections at once
is determined by the BIOEM_PROJECTIONS_AT_ONCE=[x] environment variable.
- Fourier-algorithm to process all shifts in parallel: BioEM can operate using two algorithms:
A Fourier-algorithm and a non-Fourier-algorithm. Besides numberical effects, both produce
identical results assuming certain boundary conditions. The Fourier-algorithm automatically
takes all shifts into account without having to loop over the shifts. Hence, its runtime is
almost independent from the number of shifts. The non-Fourier-algorithms is faster for very
few shifts. Benchmarks have shown, that the Fourier-algorithm is faster if the number of shifts
per direction is four or larger, which is the case for almost every relevant configuration.
Hence, the Fourier-algorithm should almost always be used. It can be activated / deactivated
through the FFTALGO=[x] environment variable, which defaults to 1. With the Fourier-algorithm
used normally, the number of nested loops is reduced to three, which maks the loop over the
maps the innermost loop.
For the parallelization over the CPU cores of one node, there are in principle two options:
- One can simply use OpenMP to paralleliza over the maps (optionally using the
BIOEM_PROJECTIONS_AT_ONCE=[x] variable to prepare multiple projections at once.
- One can use MPI with as many MPI processes as there are CPU cores and with OMP_NUM_THREAD=1.
In this case, the parallelization is done over the projections only.
In general, the first option is better for many maps while the second option is faster for
few maps.
Naturally, different methods of parallelization can be combined:
- The Fourier-algorithm can be combined (and is by default combined) with any chosen option
- One can combine MPI with the GPU algorithm to use multiple GPUs at once (as described in
the GPU section).
- As stated in the previous paragraph, on one node OpenMP works better for many maps while MPI
is better with few maps. For a medium number of maps, both methods can be combined. For
instance, OMP_NUM_THREADS=[x] can be set to x = 1/4th of the number of CPU cores on the system,
and BioEM can be called with 'mpirun' and 4 MPI processes. In that case always four
orientations are processed in parallel and x maps are processed in parallel.
- On multiple nodes, one can either use MPI exclusively, using as many MPI processes as there are
CPU cores in total. Alternatively, one can use one MPI process per node and do the intra-node
parallelization with OpenMP. Naturally this can also be combined with the previous option, to
mix MPI and OpenMP inside a single node.
- One can use GPUs and CPU cores jointly to calculate the probabilities for all maps. For more
than one GPU, MPI must be employed. In this case, the number of MPI processes must match the
number of GPUs. So the above method to combine MPI and OpenMP inside one node must be employed,
in order to use all CPU cores.
2. List of environment variables.
FFTALGO=[x] (Default: 1)
Set to 1 to enable the Fourier-algorithm (default) or to 0 to use the non-Fourier-algorithm.
GPU=[x] (Default: 0)
Set to 1 to enable GPU usage, set to 0 to use only the CPU.
GPU will be used to calculate the probabilities for all maps. The preparation of projections
and convolutions will be processed by the CPU. This is arranged in a pipeline to ensure
continuous GPU utilization.
GPUALGO=[x] (Default: 2)
This option is only relevant if GPU=1 and FFTALGO=0.
Hence, it is commonly not used any more, since FFTALGO defaults to 1.
For the non-Fourier-algorithm there are three GPUALGO implementations:
- GPUALGO=2: This will parallelize over the maps and over the shifts. The approach requires less
memory bandwidth than GPUALGO=0 or GPUALGO=1. However, it poses several contraints
on the problem configuration:
* The number of shifts per dimension must be a power of 2.
* The total number of shifts must be a factor of the number of CUDA threads per
- GPUALGO=1: This will parallelize over the maps and then loop over the shifts on the GPU. It is
usually slower than GPUALGO=2 but there are no constraints on the problem
configuration. The maps are not processed all at once but in chunks.
- GPUALGO=0: As GPUALGO=1, but all maps are processed at once. It is always slower than GPUALGO=1
and should not be used anymore.
GPUDEVICE=[x] (Default: fastest)
Only relevant if GPU=1.
- If this is not set, BioEM will autodetect the fastest GPU.
- If x >= 0, BioEM will use GPU number x.
- If x = -1, BioEM runs with N MPI threads, and the system has G GPUs, then BioEM will use GPU
with number (N % G). The idea is that one can place multiple MPI processes on one node, and
each will use a different GPU. For a multi-node configuration, one must make sure that
consecutive MPI ranks are placed on the same node, i.e. four processes on two nodes (node0 and
node1) must be placed as (node0, node0, node1, node1) and not as (node0, node1, node0, node1),
because in the latter case only 1 GPU per node will be used (by two MPI processes each).
GPUWORKLOAD=[x] (Default: 100)
Only relevant if GPU=1.
This defines the fraction of the workload in percent. To be precise: the fraction of the number
of maps processed by the GPU. The remaining maps will be processed by the CPU. Preparation
of projection and convolution will be processed by the CPU in any case.
GPUASYNC=[x] (Default: 1)
Only relevant if GPU=1.
This uses a pipeline to overlap the processing on the GPU, the preparation of projections and
convolutions on the CPU, and the DMA transfer. There is no reason to disable this except for
debugging purposes.
GPUDUALSTREAM=[x] (Default: 1)
Only relevant if GPU=1.
If this is set to 1, the GPU will use two streams in parallel. This can help to improve the GPU
utilization. Benchmarks have shown that there is a very little positive effect by this setting.
OMP_NUM_THREADS=[x] (Default: Number of CPU cores)
This is the standard OpenMP environment variable to define the number of OpenMP threads.
It can be used for profiling purposes to analyze the scaling.
If one choses to use MPI for intra-node parallelization (See last two paragraphs of chapter 1),
this can be set to x=1 (to use MPI exclusively) or to other values for a mixed MPI / OpenMP
This defines the number of projections prepared at once. Benchmarks have shown that the effect is
negligible. OpenMP is used to prepare these projections in parallel. It is mostly relevant, if
- OpenMP is used.
- No GPU is used.
- The number of reference maps is very small.
BIOEM_DEBUG_BREAK=[x] (Default: deactivated)
This is a debugging option. It will reduce the number of projection and convolutions to a maximum
of x both. It can be used for profiling to analyze scaling, and for fast sanity tests.
BIOEM_DEBUG_NMAPS=[x] (Default: deactivated)
As BIOEM_DEBUG_BREAK, with the difference that this limits the number of reference maps to a
maximum of x.
BIOEM_DEBUG_OUTPUT=[x] (Default: 2)
Change the verbosity of the output. Higher means more output, lower means less output.
- 0: Stands for no debug output.
- 1: Limited timing output.
- 2: Standard timing output showing durations of projection, convolution, and comparison (creation
of probabilities.
Values above 2 add successively more extensive output.
3. Suggestions for runtime configurations
The following settings should not be touched and left at theirs defaults:
FFTALGO (Default 1), GPUALGO (Default 2), GPUASYNC (Default 1), GPUDUALSTREAM (Default 1)
For profiling, one should limit the number of orientations and projections.
BIOEM_DEBUG_OUTPUT=2 is a goo choice to get the timing. For a few number of maps it might
make sense to switch to BIOEM_DEBUG_OUTPUT=1.
For a production environment, BIOEM_DEBUG_OUTPUT=0 can reduce the size of the text output.
BIOEM_PROJECTIONS_AT_ONCE=[x] has usually a tiny or none but never a negative effect.
The memory footprint increases with x, so it should not be set unnecessarily large.
For best performance, choose a multiple of the number of OpenMP threads.
GPU=1 should usually be used if a GPU is available.
Performancewise, one Titan GPU corresponds roughly to 20 cores at 3 GHz.
If the CPU has significant compute capabilities, one should use GPUWORKLOAD=[x] for combined
CPU / GPU processing.
If one uses combined CPU / GPU processing, a good value for GPUWORKLOAD=[x] must be determined.
A good starting point is to measure CPU and GPU individually with
and compare the time for the comparison. Assuming the CPU takes c seconds for the comparison and the
GPU takes g second, a good starting point for GPUWORKLOAD=[x] is x = 100 * g / (c + g).
On a single node, one should use OpenMP parallelization for many maps (> 1000) and MPI
parallelization for few maps (< 100). Assume a system with N CPU cores, the command would be:
OMP_NUM_THREADS=1 mpirun -n [N] BioEM (for few maps)
For a medium number of maps, a combined MPI / OpenMP configuration can be beser.
Assume 20 CPU cores, possible options would be (among others):
OMP_NUM_THREADS=1 mpirun -n 20 BioEM (20 MPI processes with 1 OMP thread each)
OMP_NUM_THREADS=2 mpirun -n 10 BioEM (10 MPI processes with 2 OMP threads each)
OMP_NUM_THREADS=5 mpirun -n 4 BioEM (5 MPI processes with 4 OMP threads each)
OMP_NUM_THREADS=10 mpirun -n 2 BioEM (2 MPI processes with 10 OMP threads each)
In any case, you should make sure that the number of MPI processes times the number of OMP threads
per process equals the number of (virtual) CPU cores.
In general, the more MPI processes, the better for few maps, the more OMP threads, the better for
many maps.
If a system offers multiple GPUs, all GPUs should be used. This must be accomplished via MPI.
In this case, the number of MPI processes per node must match the number of GPUs per node.
There are different ways to make sure every MPI process uses a different GPU (as discussed in
the GPU paragraph of chapter 1). Assuming the MPI processes are placed such, that consecutive MPI
ranks are placed on one node, one can use the GPUDEVICE=-1 setting. This is assumed here.
Let us assume an example of N nodes with C CPU cores each and G GPUs each.
The following command will use all GPUs and ignore the CPUs:
With the following command, you can use all the CPU cores as well. A combined MPI / OpenMP setting
as discussed in the previous paragraph must be used, under the constraint that the number of MPI
processes matches the number of GPUs:
Here, GPUWORKLOAD must be tuned as described before.
BioEM: Bayesian inference of Electron Microscopy
PRE-ALPHA VERSION: April 25, 2014
**** FFTW libraries:
**** BOOST libraries:
**** OpenMP:
**** CMake:
for compliation with CMakeLists.txt file.
**** Cuda: Parallel Code for GPUs.
*** The BioEM code compares one Model to multiple experimental
EM images.
*** Command line input & help is found by just running the
compiled executable ./bioEM
++++++++++++ FROM COMMAND LINE +++++++++++
Command line inputs:
--Inputfile arg (Mandatory) Name of input parameter file
--Modelfile arg (Mandatory) Name of model file
--Particlesfile arg (Mandatory) Name of paricles file
--ReadPDB (Optional) If reading model file in PDB format
--ReadMRC (Optional) If reading particle file in MRC format
--ReadMultipleMRC (Optional) If reading Multiple MRCs
--DumpMaps (Optional) Dump maps after they were red from maps file
--LoadMapDump (Optional) Read Maps from dump instead of maps file
--help (Optional) Produce help message
-- Main output file: "Output_Probabilities"
RefMap #(number Particle Map) Probability #(log(P))
RefMap #(number Particle Map) Maximizing Param: #(Euler Angles) #(PSF parameters) #(center displacement)
**Important: It is recommended to compare log(P) with respect to other Models or to Noise as in [1].
-- (Optional) Write the probailities for each triplet of Euler Angles (key word: WRITE_PROB_ANGLES in InputFile).
A directory with example EM particles, c-alpha PDB & simple Model, and
the corresponding launch scripts are provided.
-- Standard input file parameters are provided and recommened.
Two options are allowed for the map-particle files:
A) Simple *.txt or .dat with data formated as
printf"%8d%8d%12.4f\n" where the first two columns are
the pixel indexes and the third column is the intensity.
Multiple particles are read in the same file with the
separator "PARTICLE" & Number.
-- For this case it is recommended all particles
to be normalized to zero average and unit standard deviation.
B) Standard MRC particle file. If reading multiple MRCs
provide in command line
--Particlesfile FILE --ReadMRC --ReadMultipleMRC
where FILE contains the names of each mrc file to be read.
If only one MRC on command line
--Particlesfile FILEMRC --ReadMRC
where FILEMRC is the name of the single mrc file.
-- Standard PDB file: Reading only CA atoms and corresponding
residues with proper density.
-- *.txt *.dat file: With format printf"%f %f %f %f %f\n",
the first three columns as the coordinates of atoms or
voxels, fourth column is the radius (\AA) and the
last column is the corresponding density.
(Useful for all atom representation or 3D EM density maps).
[1] Cossio, P and Hummer, G. J Struct Biol. 2013 Dec;184(3):427-37. doi: 10.1016/j.jsb.2013.10.006.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment