Commit 1f5be3fc authored by Pilar Cossio's avatar Pilar Cossio
Browse files

Tuttorial update

parent 8ae18d67
2
0.836000 1.369000 2.640000
-1.375000 0.400000 3.018000
\documentclass[12pt, psamsfonts]{book}
\usepackage[english]{babel}
\usepackage[latin1]{inputenc}
\usepackage[reqno]{amsmath}
\usepackage{threeparttable,rotating,fancybox}
\usepackage[colorlinks=true]{hyperref} %Insert hyperlinks in latex files
\usepackage{graphicx}
\usepackage{amsfonts}
\usepackage{latexsym}
\usepackage[dvips]{color}
%% \usepackage{amssymb, amsmath, amsfonts}
%% \usepackage{latin1}
%% \usepackage{color}
%% \usepackage[dvips]{graphics}
\usepackage{fancyhdr}
\setlength{\headheight}{15.2pt}
\pagestyle{fancy}
%\usepackage[utf8]{inputenc}
\pagenumbering{roman}
\newcommand*{\titleGM}{\begingroup % Create the command for including the title page in the document
\hbox{ % Horizontal box
\hspace*{0.1\textwidth} % Whitespace to the left of the title page
\rule{3pt}{\textheight} % Vertical line
\hspace*{0.05\textwidth} % Whitespace between the vertical line and title page text
\parbox[b]{0.75\textwidth}{ % Paragraph box which restricts text to less than the width of the page
{\noindent\Huge BioEM Manual}\\[2\baselineskip] % Title
{\large \textit{A software for bayesian analysis of EM images}}\\[4\baselineskip] % Tagline or further description
{\Large \textsc{ }}%Pilar Cossio \\ David Rohr \\ Volker Linderstruth \\ Gerhard Hummer}} % Author name
\vspace{0.5\textheight} % Whitespace between the title block and the publisher
{\noindent Max Planck Institute of Biophysics }\\[\baselineskip] % Publisher and logo
}}
\endgroup}
\title{BioEM Manual}
\author{P.Cossio, D. Rohr, V. Lindestruth, G. Hummer }
\date{March 2015}
\begin{document}
\pagestyle{plain}
\titleGM
\tableofcontents
%\pagestyle{fancy}
\chapter{The BioEM software}
\pagenumbering{arabic}
\section{Introduction}
The BioEM method calculates the posterior probability
of a model to multiple experimental EM images, using Bayesian analysis.
The key idea is not to modify the raw images but to create a calculated image, from the original model, as similar as possible to the observed experimental image.
The calculated image takes into account the relevant factors in the experiment and image formation, such as the molecule orientation, interference effects, uncertainties in the particle center, offset in the intensities, etc., and, importantly, noise.
Technically, the model is first rotated to a given orientation, then projected along the $z$-axis,
then it is convoluted with a point spread function in Fourier space to cope with imaging artefacts,
next it is shifted by a certain number of pixels to account for the center displacement,
and finally this modified projection is compared to a reference particle-image.
The similarity between the calculated projection and the experimental image, for a given parameter set, is assessed through a likelihood function.
The posterior probability of the model is obtained by integrating the likelihood function, and prior probabilities, over a all parameter ranges.
The BioEM software is used to perform this integration numerically. A detailed description of the BioEM method is found in Refs. \cite{CossioHummer_2013,BioEM_server}.
\section{Installation}
Before installation, there are several programs or libraries that
should be installed on the compute node. In the following,
we will give a brief explanation of the mandatory, and the optional prerequisite programs.
\subsection{Prerequisite programs and libraries}
\begin{itemize}
\item {\it FFTW library:} is a subroutine library for computing the discrete Fourier transform.
It is specifically used in BioEM, to calculate the convolution of the ideal image with the point spread function (PSF), and
the cross-correlation of the calculated image to the experimental image. FFTW can be downloaded from the webpage www.fftw.org.
\item {\it BOOST library:} provides support for tasks and structures such as linear algebra, pseudorandom number generation, multithreading,
image processing, and unit testing. In particular, they are used in the code to access and organize input-data.
BOOST can be downloaded from www.boost.org.
\item {\it OpenMP:} is a programming interface that supports multi-platform shared memory parallel programming.
It is normally, included in the standard GNU or Intel c++ compliers (so no downloading should be necessary). For more information see http://openmp.org/.
\end{itemize}
The optional (but
{\bf encouraged} to use) programs for an easy compilation, and optimal performance, are described bellow.
\begin{itemize}
\item {\it CMake:} is a cross-platform software for managing the build process of software using a compiler-independent method ({\it i.e.} creating a Makefile).
CMake can be downloaded from www.cmake.org.
\item {\it CUDA:} is a parallel computing platform implemented by the graphics processing units (GPUs) that NVIDIA\cite{} produce.
Thus, NVIDIA graphics cards are necessary for running BioEM with the CUDA implementation. For more information see
www.nvidia.com.
\item {\it MPI:} Message Passing Interface is a standardized and portable message-passing system designed to function on a wide variety of parallel computers, with and without shared-memory.
{\bf PC:} Difference between openMPI and MPICH (?). Is one recommend over the other?
\end{itemize}
After these programs are successfully installed in your compute node,
it will be possible to install BioEM.
{\it Note:} It is recommended that the same complier that is used for the
libraries, is also used BioEM.
\subsection{Download}
A compressed directory of the BioEM software can be downloaded from [mpi biophys].
After downloading the {\it tar} file, uncompress by executing
\vspace{0.5cm}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{tar -zxvf BioEM.tar.gz}}}}
\vspace{0.5cm}
In the uncompressed {\it BioEM} directory, there are
\begin{itemize}
\item[--]the source code {\it cpp} files, and {\it include} directory with corresponding header files.
\item[--]the copyright license, and README file.
\item[--]the {\it CMakeLists.txt} file that is necessary for CMake (see section bellow).
\item[--]the {\it Tutorial} directory that includes the example files used in the tutorial (see chapter \ref{tutorial}). Inside this directory,
there is also a directory called {\it MODEL\_COMPARISON}.
\item[--]the {\it Quaternions} directory that includes list files of quaternions that sample uniformly
the rotational group (see section \ref{intor}).
\end{itemize}
\subsection{Installing with CMake}
The easiest installation is done with the CMake program
that generates automatically a Makefile, according to the
specific CPU/GPU architecture, and desired features. CMake
uses the {\it CMakeLists.txt} file, that contains all the instructions to generate the Makefile.
This file is provided in the uncompressed BioEM directory.
At the beginning of the {\it CMakeLists.txt} the modifiable options are provided.
These options should be enabled/disabled ({\bf ON}/{\bf OFF}) according to the desired functionalities.
\begin{itemize}
\item To enable or disable {\it OpenMP}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{option (INCLUDE\_OPENMP "Build BioEM with OpenMP support" ON/OFF)}}}}
\item To enable or disable {\it MPI}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{option (INCLUDE\_MPI "Build BioEM with MPI support" ON/OFF)}}}}
\item To enable or disable {\it CUDA}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{option (INCLUDE\_CUDA "Build BioEM with CUDA support" ON/OFF)}}}}
\item To print out the CMake variables
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{option (PRINT\_CMAKE\_VARIABLES "List all CMAKE Variables" ON/OFF)}}}}
\item To use {\it CUDA} with the INTEL complier ({\it icc, mpiicc})
one needs to turn {\bf ON} the \texttt{CUDA\_FORCE\_GCC} variable
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{option (CUDA\_FORCE\_GCC "Force GCC as host compiler for CUDA part (If
standard host compiler is incompatible with CUDA)" ON/OFF)}}}}
\end{itemize}
{\it Note:} For certain architectures, additional files ({\it e.g.} FindFFTW.cmake) could also be needed for successful CMake
execution. For more information, on specific CMake features ({\it e.g.} changing compiler) see www.cmake.org.
\subsubsection{Steps for basic installation}
\begin{itemize}
\item[--] Setup the desired features in the CMakeLists.txt file.
\item[--] Generate a build directory in the main BioEM directory
\fbox{%
\parbox{10cm}{
{\footnotesize \texttt{mkdir build}}}}
\item[--]Access the build directory, and run CMake with the {\it CMakeLists.txt} file
\fbox{%
\parbox{10cm}{
{\footnotesize \texttt{cd build \\
cmake ../CMakeLists.txt}}}}
\item[--] If this process is successful, a Makefile and CMakeFiles directory should be generated in the build directory.
If this is not the case, turn on the \texttt{PRINT\_CMAKE\_VARIABLES} option in the CMakeLists.txt file, and re-run
CMake with verbosity.
\item[--] After generating the Makefile, run it in the build directory
\fbox{%
\parbox{10cm}{
{\footnotesize \texttt{make}}}}
\item[--] If this process is successful a bioEM executable should be generated.
\end{itemize}
For a simple test, run
\vspace{0.5cm}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{./bioEM}}}}
\vspace{0.5cm}
The output on the screen should be
\vspace{0.5cm}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{
Command line inputs:\\
--Modelfile arg (Mandatory) Name of model file\\
--Particlesfile arg if BioEM (Mandatory) Name of particles file\\
--Inputfile arg if BioEM (Mandatory) Name of input parameter file\\
--PrintBestCalMap arg (Optional) Only print best calculated map (file nec.).
NO BioEM (!)\\
--ReadOrientation arg (Optional) Read orientation list instead of uniform
grid (file nec.)\\
--ReadPDB (Optional) If reading model file in PDB format\\
--ReadMRC (Optional) If reading particle file in MRC format \\
--ReadMultipleMRC (Optional) If reading Multiple MRCs \\
--DumpMaps (Optional) Dump maps after they were red from maps file\\
--LoadMapDump (Optional) Read Maps from dump instead of maps file\\
--OutputFile arg (Optional) For changing the outputfile name\\
--help (Optional) Produce help message\\
}}}}
\vspace{1cm}
\chapter{BioEM input}
The BioEM software's main input is from the command line. Here the filenames
of the model, particle-images and parameter file should be provided.
We will now give a detailed description of each input item.
\section{Model file}
The model is represented as spheres (or points) in 3-dimensional space, with corresponding radius and density.
%either in PDB or
% txt format.
Through the command line one has to provide the name of the file
that contains these items:
\vspace{0.5cm}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{ --Modelfile arg}}}}
\vspace{0.5cm}
Where \texttt{arg} is the filename. There are two types of model file formats that are read by BioEM:
\begin{itemize}
\item[--] {\it *.txt *.dat file:}
Useful for all atom representation or 3D voxel representation of density maps.
With format "\%f \%f \%f \%f \%f",
the first three columns are the coordinates of atoms or
voxels, the fourth column is the radius or voxel side length ($\AA$) and the
last column is the corresponding electron density
(Format: {\footnotesize \texttt{ x --- y --- z --- radius --- density }}).
\item[--] {\it pdb file:} BioEM reads only the C$_\alpha$ atom positions from standard {\it pdb} files, with
their corresponding residue type. The residues are modeled as a sphere, centered at the
C$_\alpha$ with corresponding van-der-Waals radii, and electron density (as in ref.
\cite{CossioHummer_2013}). To read pdb files the following commandline keyword is needed:
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{--ReadPDB}}}}
{\it Note:} the .pdb extension is not mandatory.
\end{itemize}
{\it Additional Feature:} It is possible to model the elements simply as points
(instead of spheres) by projecting only the density of each element. For this, add the keyword \texttt{"NO\_PROJECT\_RADIUS"},
in the input parameter file (see section \ref{Physparm}).
\section{Particle-image file}
The name of the experimental image file is needed, as a commandline input in BioEM:
\vspace{0.5cm}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{ --Particlesfile arg}}}}
\vspace{0.5cm}
Where \texttt{arg} is the name of the file containing the experimental particle images.
Two format options are allowed for the these files:
\begin{itemize}
\item[--] {\it *.txt or .dat file:} Data are formatted as
"\%8d\%8d\%16.8f" where the first two columns are
the pixel indexes, and the third column is the image intensity.
Multiple particles are read in the same file with the
separator \texttt{"PARTICLE" \& particle number}.
Pixel indexes should start at 0, and all pixels should be
included. It is recommended that particles
are normalized to zero average and unit standard deviation.
\item[--] {\it .mrc file:} BioEM also reads standard {\it .mrc} particle-image
files. To do so, the additional commandline keyword is needed:
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{ --ReadMRC }}}}
If reading multiple {\it mrc} files, the name of the file containing the {\it list}
of all the {\it mrc} files should be provided. The additional
command is also required
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{--ReadMultipleMRC}}}}
{\it Example:}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{ --Particlesfile LIST --ReadMRC --ReadMultipleMRC}}}}
\texttt{LIST} is the name of the file containing the list of the multiple {\it mrc} files.
{\it Notes:} The {\it .mrc} extension is not mandatory.
By default, when reading {\it mrc} particles, the intensities are normalized to
zero average and unit standard deviation. Use keyword \texttt{NO\_MAP\_NORM} to unset this default.
\end{itemize}
{\it Additional Features:} These are some useful commandline keywords for executing multiple times BioEM
with a large image set:
\vspace{0.5cm}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{--DumpMaps}}}}
\vspace{0.5cm}
writes out a file {\it maps.dump} containing the particle maps in binary format for a faster re-reading.
To read the dumped maps, use
\vspace{0.5cm}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{--LoadMapDump }}}}
\vspace{0.5cm}
See the chapter \ref{tutorial} for a detailed description.
\section{Input-parameter file}
\label{Physparm}
BioEM has two sets of parameters.
One set describes the physical problem, like the number of pixels.
Another set describes the runtime configuration, which involves how to parallelize, whether to
use a GPU, and some other algorithmic settings. These parameters do not (or only slightly)
change the output, but have a large influence on the compute performance. They are passed to
BioEM via environment variables.
The two sets are treated differently, because the first set is related to the actual problem, while
the second set belongs to the compute node where the problem is processed.
In this section, we will only describe the input parameters related to the physical
problem. For a detailed description of the performance variables see chapter \ref{perfparm}.
Additionally from the model and experimental images input, BioEM needs
an input file describing the physical constraints, and integration
limits of the algorithm:
\vspace{0.5cm}
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{ --Inputfile arg}}}}
\vspace{0.5cm}
where \texttt{arg} is the filename of the input file. This file contains specific
keywords that will describe the physical conditions, and constraints for the algorithm.
Bellow we will describe, in detail, each keyword for the input parameter file.
\subsection{Micrograph parameters}
Mandatory inputs for the description of the experimental map are
\begin{itemize}
\item \texttt{PIXEL\_SIZE (float)}
Pixel size in $\AA$ of the experimental micrograph.
\item \texttt{NUMBER\_PIXELS (int)}
We assume a square particle-image. Here, \texttt{(int)} is the number of pixels in each
dimension, {\it e.g.} particle-image of 220 x 220 pixles, then \texttt{(int)= 220}.
\end{itemize}
In a standard BioEM calculation, the integration over the model orientations,
PSF parameters, and center displacement are performed numerically.
To do so, one needs to define the integration ranges,
and grid spacing for each parameter. These quantities will
depend on the experimental conditions (such as defocus ranges, and preferred particle orientations),
thus should be specified by the user.
\subsection{Integration of orientations}
\label{intor}
In BioEM, there are two manners of describing the orientations of the model
in 3D space, with the Euler angles or with quaternions.
There are several possibilities for sampling the space of rotations, and
performing the integration.
\begin{itemize}
\item {\it Grid-sampling of the Euler Angles ($\alpha,\beta,\gamma$):}
Sampling of the full Euler angle space with uniform distribution of $\alpha \in [-\pi,\pi]$, $\mathrm cos(\beta) \in [-1,1]$
and $\gamma \in [-\pi,\pi]$. For this case one needs only to provide the number of grid points
in $\alpha$, and $\beta$ (because those in $\gamma$ will be the same as in $\alpha$).
The keywords in the parameter file are
\texttt{GRIDPOINTS\_ALPHA (int)}\\
\texttt{GRIDPOINTS\_BETA (int)}
{\it Note:} To sample uniformly it is recommended that
\texttt{GRIDPOINTS\_ALPHA$\sim$ 2*GRIDPOINTS\_BETA}.
\item {\it Grid-sampling of quaternions:}
With BioEM it is also possible to generate a grid in quaternion space for
the integration over the orientations. For this option, one should provide the
keyword:
\texttt{USE\_QUATERNIONS}
Additionally, to sample the quaternions in a hypercubic-grid, one should include
\texttt{GRIDPOINTS\_QUATERNION (int)}
where \texttt{(int)} is the grid spacing in each quaternion dimension $\in [-1,1]$.
\item {\it Read orientations from file:} With this feature the orientational grid points are not
calculated directly in the code, but are instead read from a file. This provides more flexibility in the sampling, and
integration. For this feature, an extra keyword in the command line is necessary:
\fbox{%
\parbox{12cm}{
{\footnotesize \texttt{--ReadOrientation arg}}}}
Where \texttt{arg} is the name of the file containing the orientations.
The first row of the file should have \texttt{(int)} equal to the total number of orientations.
The orientations can be described with the standard Euler angles,
or with quaternions.
\begin{itemize}
\item[--] The format for the Euler angle file should be "\%12.6f\%12.6f\%12.6f", ordered
as $\alpha,\beta,\gamma$, respectively.
\item[--] The format for the file containing the quaternions should be:
"\%12.6f\%12.6f\%12.6f\%12.6f".
One should recall that to use quaternions the extra keyword \texttt{USE\_QUATERNIONS} in the input parameter file is necessary.
\item[--] {\it Prior for orientations:} For this input, it is possible to
assign prior probabilities for each orientation. To do so, one should add an extra
column (of format "\%12.6f") that indicates the value of the prior probability for each orientation.
\end{itemize}
\end{itemize}
{\bf Important Remark:} We note that not all of these possibilities sample uniformly the group
of rotations in 3D space ({\it SO3}).
For uniform sampling of {\it SO3}, we recommend
the successive orthonormal images method from Ref. \cite{quaternionsMitchell}, with quaternions.
In the directory {\it Quaternions}, we provide lists of quaternions that sample uniformly {\it SO3}
using this method. Several lists with different number of total orientations are provided.
We highly recommend to use these files instead of using trivial grid-sampling of the Euler angles or quaternions.
\subsection{Integration of the CTF or PSF}
\subsubsection{Integration in Fourier space using the CTF and Envelop:}
To take into account the interference effects in the
experiment, we multiply the idea image $I_0$ from a model with the {\bf C}ontrast {\bf T}ransfer {\bf F}unction (CTF) in Fourier space,
$\mathrm{CTF}(s)\mathcal{F}(I_{0})$, where $s$ is the radial spatial frequency, and $\mathcal{F}$ indicates the Fourier transform.
An approximate expression for the CTF is $-A\cos(as^2/2)-\sqrt{1-A^2}\sin(as^2/2)$, where the parameter $a=2\pi \lambda \Delta f$ and $\lambda$ is the
electron wavelength, and $\Delta f$ is the defocus. $A \in [0,1]$ is a parameter that establishes the contributions of the cosine and sine components.
In addition, the CTF is normally modulated by an envelop function (see Eqs. 2 and 3 in Ref. \cite{bioEM}) with $\mathrm{Env}(s)=e^{-bs^2/2}$,
where the parameter $b=B/2$ is half of the B-factor\cite{PenzekXXX}. %We note that all this analysis is done in Fourier space.
To calculate the posterior probability, one must integrate numerically the three parameter $a,b$ and $A$, or equivalently, the defocus $\Delta f$, B-factor $B$,
and amplitude $A$.
To do so, one should include in the input parameter file the start and end limits, and number of grid points in each integration:
\textit{Parameter -- (start) -- (end) -- (gridpoints)}\\
\indent \texttt{CTF\_DEFOCUS (float) (float) (int)}\\
\indent \texttt{CTF\_B\_FACTOR (float) (float) (int)}\\
\indent \texttt{CTF\_AMPLITUDE (float) (float) (int)}\\
The defocus units should be in micro-meters, and that of the B-factor should be in \AA$^2$.
\subsubsection{Integration in real space using the PSF:}
The {\bf P}oint {\bf S}pread {\bf F}unction (PSF), that is the real-space equivalent of the contrast transfer function,
is defined in the Supplementary Information of Ref. \cite{BioEM_server}.
Here, an ideal calculated image is convoluted with the PSF to mimic the interference effects in the imaging experiment.
Similarly as the CTF, the PSF has three real-space variables: amplitude $A_R$, envelop $\chi$ and phase $\theta$ (following the same
notation as in Ref. \cite{BioEM_server}).
The amplitude defines the contribution of the PSF real-space sine or cosine parts, and is within $[0,1]$.
The envelop is the real space equivalent of the B-factor, and should be given in
units of $\AA^{-2}$. The phase is also in real space, and should be given in units of $\AA^{-2}$.
The specific ranges of these parameters will depend on the imaging conditions of each experiment
(defocus, astigmatism etc).
To use the PSF (CTF analysis is default), one should include the keyword:
\texttt{USE\_PSF}
The information of the integration limits and grid points should be included in the input file as:
\textit{Parameter -- (start) -- (end) -- (gridpoints)}\\
\indent \texttt{PSF\_ENVELOPE (float) (float) (int)}\\
\indent \texttt{PSF\_PHASE (float) (float) (int)}\\
\indent \texttt{PSF\_AMPLITUDE (float) (float) (int)}\\
{\it Additional feature:} One can print out the corresponding CTF parameters that maximize the posterior,
with the keyword in the parameter file
\texttt{WRITE\_CTF\_PARAM}\\
{\it Note:} It is important to remark that one of these two procedures (either the CTF or PSF) is mandatory in the BioEM analysis.
\subsection{Integration of center displacement}
The integration of the particle center is done over a square and uniform grid.
The particle, along both directions, is translated from its center up to a maximum distance ({\it max displ.}).
Users should provide this maximum displacement and the integration grid spacing in units of pixels.
The keywords in parameter file are:\\
\textit{Parameter - (max displ.) - (grid-space)}\\
\texttt{DISPLACE\_CENTER (int) (int)}\\
\vspace{0.2cm}
{\it Example:} If \texttt{[DISPLACE\_CENTER 10 2]}, the integration will be done
between $[x_c-10,x_c+10]$ along $x$ (where $c$ denotes center), and $[y_c-10,y_c+10]$ along $y$, with sampling every 2 pixels.
\vspace{0.5cm}
{\bf Important Remark:}
The integration of the {\it normalization}, {\it offset} and {\it noise} parameters is carried out analytically. See
Supplementary Information of Ref. \cite{CossioHummer_2013}.
\section{Additional features}
\label{crosscor}
Besides calculating the BioEM probability, one can use the code to extract other features
that can be useful in the analysis of the data.
These extra features are:
\begin{itemize}
\item {\it Model prior probability:} It is possible to
include the model's prior using the keyword in the parameter file: \\
\texttt{PRIOR\_MODEL (float)}\\
where \texttt{(float)} is the numerical value of the model's prior.
\item {\it Posterior probability as a function of orientations:}
One can write out the log-posterior as a function of each orientation.