BioEM_labbook.org 92.1 KB
Newer Older
Luka Stanisic's avatar
Luka Stanisic committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#+TITLE:       Laboratory journal for BioEM project
#+AUTHOR:      Luka Stanisic, Markus Rampp, Pilar Cossio, ...
#+EMAIL:       luka.stanisic@mpcdf.mpg.de
#+LANGUAGE:    en
#+CATEGORY: WORK
#+STARTUP: logdrawer
#+TODO: TODO(t!) WAIT(w@) REMINDER(r!) | DONE(d!) MOVED(m!) POSTPONED(p@) CANCELLED(c@)
#+TAGS: EXPERIMENTS(E) CODE(C) READING(R) PRESENTATION(P) WRITING(W) MIXED(M) GENERAL(G) ADMINISTRATIVE(A)

* Goals
** TODO [#A] Smarter GPUWORKLOAD                                       :CODE:
   - [ ] Get rid of the guess value GPUWORKLOAD by implementing a simple autotuning approach (use the first few iterations to adapt the load balancing).
** TODO [#B] Dynamic (task-based) balancing of CPU/GPU workload
   :LOGBOOK:
   - State "TODO"       from              [2017-06-13 Tue 16:28]
   :END:
   - Probably in OpenMP4, but still need to discuss about this part
Luka Stanisic's avatar
Luka Stanisic committed
18
19
20
21
22
23
** TODO [#C] Intel KNL
   :LOGBOOK:
   - State "TODO"       from              [2017-06-23 Fri 15:52]
   :END:
   - Try to compile on Intel KNL
   - Investigate the performance
Luka Stanisic's avatar
Luka Stanisic committed
24
** Algorithmic improvements (new functionality)
Luka Stanisic's avatar
Luka Stanisic committed
25
26
27
28
29
30
31
32
* REMINDER Ideas
  :LOGBOOK:
  - State "REMINDER"   from              [2017-06-23 Fri 15:54]
  :END:
** Hyperthreading
   - Initial tests on draco showed no benefits from hyperthreading (on the contrary)
   - Markus has a feeling (and experience) that hyperthreading cant help much codes such as BioEM
   - However, with some changes to the code, it might be interesting to investigate again this option
Luka Stanisic's avatar
Luka Stanisic committed
33
* Important codes
34
** Recipes for compilations and runs
Luka Stanisic's avatar
Luka Stanisic committed
35
*** [partially working] dvl machine
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
**** Machine description
     - Local machines at MPCDF for development
     - dvl01:
       + 20 CPU cores Intel machine (Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz)
       + 2 K40 GPUs
     - dvl02:
       + 20 CPU cores Intel machine (Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz)
       + 2 Nvidia K20Xm GPUs
**** Installation with gnu compilers

#+BEGIN_SRC 
# Loading necessary modules
module purge 
module load git/2.7.4
module load gcc/4.9
module load impi/5.1.3
module load cmake/3.6
module load boost/gcc/1.57
module load fftw/gcc/3.3.4
module load cuda/7.5

# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

# Configuration and compilation
cmake -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON $SRC_DIR/
make -j5 VERBOSE=1
#+END_SRC

**** Installation with Intel compilers

#+BEGIN_SRC 
# Loading necessary modules
module purge 
module load git/2.7.4
module load intel/15.0
module load impi/5.1.3
module load cmake/3.6
module load boost/intel/1.57
module load fftw/3.3.4
module load cuda/7.5

# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

# Configuration and compilation
cmake -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON $SRC_DIR/
make -j5 VERBOSE=1
#+END_SRC

**** Running Tutorial example

#+BEGIN_SRC 
# Loading necessary modules if needed (check installation gcc/Intel)

# Environment variables
export OMP_NUM_THREADS=5
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1
export GPUWORKLOAD=100

# Paths
TUTORIAL_DIR="$afsdir/BioEM_fork/Tutorial_BioEM"
BUILD_DIR="$HOME/BioEM_project/build"

# Running tutorial test
mpiexec -n 2 $BUILD_DIR/bioEM --Inputfile $TUTORIAL_DIR/Param_Input --Modelfile $TUTORIAL_DIR/Model_Text --Particlesfile $TUTORIAL_DIR/Text_Image_Form
#+END_SRC

**** Running larger example from the paper

#+BEGIN_SRC 
# Loading necessary modules if needed (check installation gcc/Intel)

# Environment variables
Luka Stanisic's avatar
Luka Stanisic committed
127
export OMP_NUM_THREADS=10
128
129
130
131
132
133
134
135
136
137
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1
export GPUWORKLOAD=100

# Paths
INPUT_DIR="$HOME/BioEM_project/inputs"
BUILD_DIR="$HOME/BioEM_project/build"

138
139
140
141
142
# Running only few iterations, reading already generated maps (only if they are available in the same folder
BIOEM_DEBUG_BREAK=20 BIOEM_DEBUG_OUTPUT=2 mpiexec -n 2 $BUILD_DIR/bioEM --Inputfile $INPUT_DIR/INPUT_FRH_Sep2016 --Modelfile $INPUT_DIR/Mod_X-ray_PDB --Particlesfile $INPUT_DIR/2000FRH_Part --LoadMapDump

# To check if the execution provided correct results
/afs/ipp-garching.mpg.de/u/sluka/BioEM_fork/Tutorial_BioEM/MODEL_COMPARISION/subtract_LogP.sh $INPUT_DIR/Output_Probabilities_20_ref Output_Probabilities | tail
143
144
145
146
147

# Running full example
BIOEM_DEBUG_OUTPUT=0 mpiexec -n 2 $BUILD_DIR/bioEM --Inputfile $INPUT_DIR/INPUT_FRH_Sep2016 --Modelfile $INPUT_DIR/Mod_X-ray_PDB --Particlesfile $INPUT_DIR/2000FRH_Part
#+END_SRC

148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
**** Running multiple experiments example

#+BEGIN_SRC 
# Loading necessary modules if needed (check installation gcc/Intel)

# Environment variables
export OMP_NUM_THREADS=20
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1
export GPUWORKLOAD=100

# Paths
INPUT_DIR="$HOME/BioEM_project/inputs"
BUILD_DIR="$HOME/BioEM_project/build"

echo "Time, Workload, GPUs, OMP_THREADS" > results.csv

# Running multiple experiments
res=$(BIOEM_DEBUG_BREAK=50 BIOEM_DEBUG_OUTPUT=0 BIOEM_AUTOTUNING=0 GPUWORKLOAD=100 OMP_NUM_THREADS=5 mpiexec -n 2 $BUILD_DIR/bioEM --Inputfile $INPUT_DIR/INPUT_FRH_Sep2016 --Modelfile $INPUT_DIR/Mod_X-ray_PDB --Particlesfile $INPUT_DIR/2000FRH_Part --LoadMapDump | tail -1 | cut -d ' ' -f 5)

# Write results into file
echo "$res, 100, 1, 5" >> results.csv

#+END_SRC

175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
*** [error] miy (Minsky) machine - problems at CUDA runtime
**** Machine description
     - 3 identical machine with IBM Power 8+ architecture
     - Local machines at MPCDF for development and testing Power8 capabilities
     - miy01-miy03:
       + 2 Power8+ CPUs (160 cores?)
       + 4 Nvidia P100 GPUs
**** Installation with gnu compilers

   - On Minsky machine problems with gcc options that are not supported by Power8 (IBM) machine:
     + c++: error: unrecognized command line option ‘-march=native’
     + c++: error: unrecognized command line option ‘-mfpmath=sse’
     + c++: error: unrecognized command line option ‘-minline-all-stringops’
   - Without it, it is possible to compile the project
   - Neither Git nor AFS are available

#+BEGIN_SRC 
# Loading necessary modules
module purge 
module load gcc/5.4
module load smpi/10.1
module load fftw/gcc-5.4
module load cuda/8.0

# Paths
SRC_DIR="$HOME/BioEM_project/BioEM"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

# Configuration and compilation
cmake -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON $SRC_DIR/
make -j5 VERBOSE=1
#+END_SRC

**** Running Tutorial example

#+BEGIN_SRC 
# Loading necessary modules if needed (check installation gcc/Intel)

# Environment variables
export OMP_NUM_THREADS=5
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1
export GPUWORKLOAD=100

# Paths
TUTORIAL_DIR="$HOME/BioEM_project/BioEM/Tutorial_BioEM"
BUILD_DIR="$HOME/BioEM_project/build"

# Running tutorial test
mpiexec -n 2 $BUILD_DIR/bioEM --Inputfile $TUTORIAL_DIR/Param_Input --Modelfile $TUTORIAL_DIR/Model_Text --Particlesfile $TUTORIAL_DIR/Text_Image_Form
#+END_SRC

**** Running larger example from the paper

#+BEGIN_SRC 
# Loading necessary modules if needed (check installation gcc/Intel)

# Environment variables
export OMP_NUM_THREADS=5
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1
export GPUWORKLOAD=100

# Paths
INPUT_DIR="$HOME/BioEM_project/inputs"
BUILD_DIR="$HOME/BioEM_project/build"

# Running only few iterations, reading already generated maps (only if they are avaialble in the same folder
BIOEM_DEBUG_BREAK=4 BIOEM_DEBUG_OUTPUT=2 mpiexec -n 2 $BUILD_DIR/bioEM --Inputfile $INPUT_DIR/INPUT_FRH_Sep2016 --Modelfile $INPUT_DIR/Mod_X-ray_PDB --Particlesfile $INPUT_DIR/2000FRH_Part --LoadMapDump

# Running full example
BIOEM_DEBUG_OUTPUT=0 mpiexec -n 2 $BUILD_DIR/bioEM --Inputfile $INPUT_DIR/INPUT_FRH_Sep2016 --Modelfile $INPUT_DIR/Mod_X-ray_PDB --Particlesfile $INPUT_DIR/2000FRH_Part
#+END_SRC

Luka Stanisic's avatar
Luka Stanisic committed
259
*** [working] hydra machine
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
**** Machine description (from http://www.mpcdf.mpg.de/services/computing/hydra/about-the-system)

     In October 2013, the main part of the iDataPlex HPC system HYDRA
     was installed at the MPCDF with Intel Ivy Bridge processors (~
     3500 nodes with 20 cores @ 2.8 GHz each), in addition to the
     Intel Sandy Bridge-EP processors (610 nodes with 16 cores @ 2.6
     GHz each) that are available since Sept 2012. 350 of the Ivy
     Bridge nodes are equipped with accelerator cards (338 nodes with
     2 NVIDIA K20X GPGPUs each, 12 nodes with 2 Intel Xeon Phi cards
     each).

     Most of the compute nodes have a main memory of 64 GB, 2 x 100 of
     the Ivy Bridge nodes and 20 of the Sandy Bridge nodes have a main
     memory of 128 GB.

     In total there are ~ 83.000 cores with a main memory of 280 TB
     and a peak performance of about 1.7 PetaFlop/s. The accelerator
     part of the HPC cluster has a peak performance of about 1
     PetaFlop/s.

     In addition to the compute nodes there are 8 login nodes and 26
     I/O nodes that serve the 5 PetaByte of disk storage.

     The common interconnect is a fast InfiniBand FDR14 network.

     The compute nodes are bundled into 5 domains, three domains
     (including the old Sandy Bridge processor domain) with 628 nodes
     each, one big domain with more than 1800 nodes, and one domain
     consisting of the 350 nodes with accelerator cards.  Within one
     domain, the InfiniBand network topology is a 'fat tree' topology
     for high efficient communication. The InfiniBand connection
     between the domains is much weaker, so batch jobs are restricted
     to a single domain.

Luka Stanisic's avatar
Luka Stanisic committed
294
**** [working] Installation with gnu compilers
295
296
297
298

#+BEGIN_SRC 
# Loading necessary modules
module purge 
Luka Stanisic's avatar
Luka Stanisic committed
299
module load git/2.8
300
module load cmake/3.5
Luka Stanisic's avatar
Luka Stanisic committed
301
module load gcc/5.4
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
module load mpi.ibm/1.4.0
module load fftw/gcc/3.3.4
module load boost/gcc/1.61
module load cuda/7.5

# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

# There are some strange problems with CUDA, however this small hack with links seems to solve the issue
ln -s /u/system/SLES11/soft/cuda/libcuda/libcuda.so.352.79 libcuda.so.1
ln -s libcuda.so.1 libcuda.so

# Configuration and compilation (need to manually add CUDA_rt_LIBRARY)
Luka Stanisic's avatar
Luka Stanisic committed
322
cmake -DMPI_C_COMPILER=mpigcc -DMPI_CXX_COMPILER=mpicxx -DCMAKE_CXX_COMPILER=g++  -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=OFF -DCUDA_rt_LIBRARY=/u/system/SLES11/soft/cuda/7.5/lib64/libcudart.so -DCUDA_SDK_ROOT_DIR=/u/system/SLES11/soft/cuda/7.5/samples/common -DCUDA_CUDA_LIBRARY=$PWD/libcuda.so $SRC_DIR/
323
324
325
326
327
328
329
make -j5 VERBOSE=1
#+END_SRC

**** [working] Installation with Intel compilers

#+BEGIN_SRC 
# Loading necessary modules
Luka Stanisic's avatar
Luka Stanisic committed
330
331
module purge
module load git/2.8
332
module load cmake/3.5
Luka Stanisic's avatar
Luka Stanisic committed
333
334
module load intel/16.0
module load mkl/11.3
335
module load mpi.ibm/1.4.0
Luka Stanisic's avatar
Luka Stanisic committed
336
337
module load fftw/3.3.4
module load boost/intel/1.61
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
module load cuda/7.5

# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

# There are some strange problems with CUDA, however this small hack with links seems to solve the issue. Probably CMake needs to be improved
ln -s /u/system/SLES11/soft/cuda/libcuda/libcuda.so.352.79 libcuda.so.1
ln -s libcuda.so.1 libcuda.so

# Configuration and compilation (need to manually add CUDA_rt_LIBRARY and CUDA_SDK_ROOT_DIR)
cmake -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON -DCUDA_rt_LIBRARY=/u/system/SLES11/soft/cuda/7.5/lib64/libcudart.so -DCUDA_SDK_ROOT_DIR=/u/system/SLES11/soft/cuda/7.5/samples/common -DCUDA_CUDA_LIBRARY=$PWD/libcuda.so $SRC_DIR/
make -j5 VERBOSE=1
#+END_SRC

Luka Stanisic's avatar
Luka Stanisic committed
359
**** [working] Running interactive Tutorial example (no GPUs)
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377

#+BEGIN_SRC 
# Loading necessary modules if needed (check installation gcc/Intel)

# Environment variables
export OMP_NUM_THREADS=5
export OMP_PLACES=cores
export FFTALGO=1
export GPU=0

# Specific for interactive tests on hydra
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
echo localhost > host.list
echo localhost >> host.list
echo localhost >> host.list
echo localhost >> host.list

# Paths
Luka Stanisic's avatar
Luka Stanisic committed
378
TUTORIAL_DIR="$HOME/BioEM_project/tutorial"
379
380
381
382
383
384
BUILD_DIR="$HOME/BioEM_project/build"

# Running tutorial test
mpiexec -n 4 $BUILD_DIR/bioEM --Inputfile $TUTORIAL_DIR/Param_Input --Modelfile $TUTORIAL_DIR/Model_Text --Particlesfile $TUTORIAL_DIR/Text_Image_Form
#+END_SRC

Luka Stanisic's avatar
Luka Stanisic committed
385
386
387
**** [working] Running larger example from the paper
     
     - Sample scripts are available here: https://www.mpcdf.mpg.de/services/computing/hydra/sample-batch-script
388
389
390
391
392
393
394
395
396
397
398

#+BEGIN_SRC 
# @ shell=/bin/bash
#
# Sample script for LoadLeveler
#
# @ error   = job.err.$(jobid)
# @ output  = job.out.$(jobid)
# @ job_type = parallel
# @ requirements = (Feature=="gpu")
# @ node_usage= not_shared
Luka Stanisic's avatar
Luka Stanisic committed
399
400
# @ node = 4
# @ tasks_per_node = 2
401
402
403
404
405
406
407
408
# @ resources = ConsumableCpus(20)
# @ network.MPI = sn_all,not_shared,us
# @ wall_clock_limit = 01:00:00
# @ notification = complete
# @ notify_user = $(user)@rzg.mpg.de
# @ queue

# Loading necessary modules for Intel compilers
Luka Stanisic's avatar
Luka Stanisic committed
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
module purge
module load git/2.8
module load cmake/3.5
module load intel/16.0
module load mkl/11.3
module load mpi.ibm/1.4.0
module load fftw/3.3.4
module load boost/intel/1.61
module load cuda/7.5

# Environment variables
export OMP_NUM_THREADS=10
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1

# Environment variable to tune
export GPUWORKLOAD=90
export BIOEM_DEBUG_OUTPUT=0
export BIOEM_AUTOTUNING=0

# Paths
INPUT_DIR="$HOME/BioEM_project/inputs"
BUILD_DIR="$HOME/BioEM_project/build"
datafile="$HOME/BioEM_project/data/hydra_$BIOEM_AUTOTUNING.org"
> $datafile

# Environment capture
$HOME/BioEM_project/get_info.sh $datafile
echo "* BIOEM OUTPUT" >> $datafile

# Running full example
cd /ptmp/$USER/
mpiexec -n 8 $BUILD_DIR/bioEM --Inputfile $INPUT_DIR/INPUT_FRH_Sep2016 --Modelfile $INPUT_DIR/Mod_X-ray_PDB --Particlesfile $INPUT_DIR/2000FRH_Part >> $datafile
#+END_SRC

    - Later just do /llsubmit batch_script/
Luka Stanisic's avatar
Luka Stanisic committed
447
*** [working] new recipes for draco
Luka Stanisic's avatar
Luka Stanisic committed
448
**** Installation with Intel modules
Luka Stanisic's avatar
Luka Stanisic committed
449
#+BEGIN_SRC 
Luka Stanisic's avatar
Luka Stanisic committed
450
451
452
453
454
455
456
457
module purge
module load git/2.13
module load cmake/3.7
module load intel/17.0
module load impi/2017.3
module load fftw/3.3.6
module load boost/intel/1.64
module load cuda/8.0
458

Luka Stanisic's avatar
Luka Stanisic committed
459
460
461
462
463
464
465
466
467
468
# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

Luka Stanisic's avatar
Luka Stanisic committed
469
470
# Configuration and compilation (need to manually add CUDA_rt_LIBRARY and CUDA_SDK_ROOT_DIR)
cmake -DMPI_C_COMPILER=mpiicc -DMPI_CXX_COMPILER=mpiicpc -DCMAKE_CXX_COMPILER=icpc -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON $SRC_DIR/
Luka Stanisic's avatar
Luka Stanisic committed
471
472
473
make -j5 VERBOSE=1
#+END_SRC

Luka Stanisic's avatar
Luka Stanisic committed
474
475
476
**** Installation with gcc modules

Before CUDA_HOST_COMPILER needs to be manually set to gcc/5.4 as CUDA was compiled with that gcc
Luka Stanisic's avatar
Luka Stanisic committed
477
#+BEGIN_SRC 
Luka Stanisic's avatar
Luka Stanisic committed
478
479
480
481
#set (CUDA_HOST_COMPILER /mpcdf/soft/SLES122/common/gcc/5.4.0/bin/gcc)
# # Needs to be set instead of
#set (CUDA_HOST_COMPILER gcc)

Luka Stanisic's avatar
Luka Stanisic committed
482
module purge
Luka Stanisic's avatar
Luka Stanisic committed
483
484
485
486
487
488
489
module load git/2.13
module load cmake/3.7
module load gcc/6.3
module load impi/2017.3
module load fftw/gcc/3.3.6
module load boost/gcc/1.64
module load cuda/8.0
Luka Stanisic's avatar
Luka Stanisic committed
490
491
492
493
494
495
496
497
498
499
500

# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

Luka Stanisic's avatar
Luka Stanisic committed
501
502
# Configuration and compilation
cmake -DMPI_C_COMPILER=mpigcc -DMPI_CXX_COMPILER=mpicxx -DCMAKE_CXX_COMPILER=g++ -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON $SRC_DIR/
Luka Stanisic's avatar
Luka Stanisic committed
503
504
505
make -j5 VERBOSE=1
#+END_SRC

Luka Stanisic's avatar
Luka Stanisic committed
506
*** [working] phys machine
Luka Stanisic's avatar
Luka Stanisic committed
507
508
509
510
511
512
513
**** Machine description
     No real documentation of the machine, but the characteristics are
     one of the following: 

     1. phys01 (E5-2680v3 + GTX1080) - suggested by Markus in his email
     2. HSW + GTX980 - suggested in the spreed sheet from Markus

Luka Stanisic's avatar
Luka Stanisic committed
514
**** [working] Installation with gnu compilers
Luka Stanisic's avatar
Luka Stanisic committed
515
516
517
518
519
520
521
522
523

#+BEGIN_SRC 
# Loading necessary modules
module purge 
module load git/2.7.4
module load cmake/3.5
module load gcc/4.9
module load impi/5.1.3
module load fftw/gcc/3.3.4
Luka Stanisic's avatar
Luka Stanisic committed
524
module load boost/gcc/1.57
Luka Stanisic's avatar
Luka Stanisic committed
525
526
527
528
529
530
531
532
533
534
535
536
537
module load cuda/7.5

# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

# Configuration and compilation (need to manually add CUDA_rt_LIBRARY)
Luka Stanisic's avatar
Luka Stanisic committed
538
cmake -DMPI_C_COMPILER=mpigcc -DMPI_CXX_COMPILER=mpicxx -DCMAKE_CXX_COMPILER=g++ -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON $SRC_DIR/
Luka Stanisic's avatar
Luka Stanisic committed
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
make -j5 VERBOSE=1
#+END_SRC

**** [working] Installation with Intel compilers

#+BEGIN_SRC 
# Loading necessary modules
module purge
module load git/2.7.4
module load cmake/3.5
module load intel/15.0
module load mkl/11.3
module load impi/5.1.3
module load fftw/3.3.4
module load boost/intel/1.60
module load cuda/7.5

# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

# Configuration and compilation (need to manually add CUDA_rt_LIBRARY and CUDA_SDK_ROOT_DIR)
cmake -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON $SRC_DIR/
make -j5 VERBOSE=1
#+END_SRC

**** [working] Running interactive Tutorial example (no GPUs)

#+BEGIN_SRC 
# Loading necessary modules if needed (check installation gcc/Intel)

# Environment variables
export OMP_NUM_THREADS=5
export OMP_PLACES=cores
export FFTALGO=1
export GPU=0

# Paths
TUTORIAL_DIR="$HOME/BioEM_project/tutorial"
BUILD_DIR="$HOME/BioEM_project/build"

# Running tutorial test
srun -n 4 $BUILD_DIR/bioEM --Inputfile $TUTORIAL_DIR/Param_Input --Modelfile $TUTORIAL_DIR/Model_Text --Particlesfile $TUTORIAL_DIR/Text_Image_Form
#+END_SRC

**** [working] Running larger example from the paper

#+BEGIN_SRC 
### run in /bin/bash
#$ -S /bin/bash
#$ -j y
#$ -N bioem-test
#$ -cwd
#$ -m e
#$ -M luka.stanisic@rzg.mpg.de
#$ -pe impi_hydra 96
#$ -l h_rt=01:00:00
Luka Stanisic's avatar
Luka Stanisic committed
602
#$ -P gpu
Luka Stanisic's avatar
Luka Stanisic committed
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
#$ -l use_gpus=1
##$ -R yes

# Loading necessary modules for Intel compilers
module purge
module load git/2.7.4
module load cmake/3.5
module load intel/15.0
module load mkl/11.3
module load impi/5.1.3
module load fftw/3.3.4
module load boost/intel/1.60
module load cuda/7.5

# Environment variables
export OMP_NUM_THREADS=12
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1

# Additional variable
export KMP_AFFINITY=compact,granularity=core,1

# Environment variable to tune
export GPUWORKLOAD=90
export BIOEM_DEBUG_OUTPUT=0
export BIOEM_AUTOTUNING=0

# Paths
INPUT_DIR="$HOME/BioEM_project/inputs"
BUILD_DIR="$HOME/BioEM_project/build"
datafile="$HOME/BioEM_project/data/phys_$BIOEM_AUTOTUNING.org"
> $datafile

# Environment capture
$HOME/BioEM_project/get_info.sh $datafile
echo "* BIOEM OUTPUT" >> $datafile

# Running full example
cd $HOME/jobs
mpiexec -perhost 2 $BUILD_DIR/bioEM --Inputfile $INPUT_DIR/INPUT_FRH_Sep2016 --Modelfile $INPUT_DIR/Mod_X-ray_PDB --Particlesfile $INPUT_DIR/2000FRH_Part >> $datafile
#+END_SRC

    - Later just do /qsub batch_script_phys/

**** Template of a batch script from Markus

#+BEGIN_SRC 
### run in /bin/bash
#$ -S /bin/bash
#$ -j y
#$ -N bioem-LARGE-GTX1080_01n_GPUWRK-scan
#$ -cwd
#$ -m e
#$ -M mjr@rzg.mpg.de
#$ -pe impi_hydra 24
#$ -l h_rt=06:00:00
#$ -l use_gpus=1
##$ -R yes

module load gcc/5.4 impi
module load cuda/8.0

BIOEM=../../build_gpu_1.0.1/bioEM

for gpuload in 60 65 70 75; do

WRKDIR=work_${JOB_NAME}.${JOB_ID}-GPU${gpuload}
mkdir $WRKDIR
cd $WRKDIR

ln -s ../2000FRH_Part
ln -s ../INPUT_FRH_Sep2016
ln -s ../Mod_X-ray_PDB

export OMP_NUM_THREADS=12
export KMP_AFFINITY=compact,granularity=core,1
export FFTALGO=1
export GPU=1
export GPUWORKLOAD=$gpuload
#export BIOEM_DEBUG_OUTPUT=1
export GPUDEVICE=-1


mpiexec -perhost 2 $BIOEM --Inputfile INPUT_FRH_Sep2016 --Modelfile Mod_X-ray_PDB --Particlesfile 20\
00FRH_Part  >& log_gpu.out
#mpiexec -perhost 2 nvprof -o ./bioemMPI.%q{PMI_RANK}.nvprof $BIOEM --Inputfile INPUT_FRH_Sep2016 --\
Modelfile Mod_X-ray_PDB --Particlesfile 2000FRH_Part  >& log_gpu.out


echo -n "GPUWORKLOAD=$gpuload : "
grep 'The code ran for' log_gpu.out | grep 'rank 0'

cd ..

done

#+END_SRC

Luka Stanisic's avatar
Luka Stanisic committed
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
** Deprecated
*** [partially working] draco machine
**** Machine description (from http://www.mpcdf.mpg.de/services/computing/draco/about-the-system)
     The extension cluster DRACO of the HPC system HYDRA was installed
     in May 2016 at the MPCDF with Intel 'Haswell' Xeon E5-2698
     processors (~ 880 nodes with 32 cores @ 2.3 GHz each). 106 of the
     nodes are equipped with accelerator cards (2 x PNY GTX980 GPUs
     each).

     Most of the compute nodes have a main memory of 128 GB, 4 nodes
     have 512 GB, 1 has 256 GB, 4 of the GPU nodes have a main memory
     of 256 GB.

     In January 2017, the DRACO cluster was expanded by 64 Intel
     'Broadwell' nodes that were purchased by the Fritz-Haber
     Institute. The 'Broadwell' nodes have 40 cores each and a main
     memory of 256 GB.

     In total there are 30.688 cores with a main memory of 128 TB and
     a peak performance of 1.12 PetaFlop/s.

     In addition to the compute nodes there are 4 login nodes and 8
     I/O nodes that serve the 1.5 PetaByte of disk storage.

     The common interconnect is a fast InfiniBand FDR14 network.

     The compute nodes and GPU nodes are bundled into 30 domains.
     Within one domain, the InfiniBand network topology is a 'fat
     tree' topology for high efficient communication. The InfiniBand
     connection between the domains is much weaker, so batch jobs are
     restricted to a single domain, that is 32 nodes.

**** [error: problems with boost] Installation with gnu compilers

#+BEGIN_SRC 
# Loading necessary modules
module purge 
module load git/2.8
module load cmake/3.5
module load gcc/4.9
module load impi/5.1.3
module load fftw/gcc/3.3.4
module load boost/gcc/1.61
module load cuda/7.5

# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

# Configuration and compilation (need to manually add CUDA_rt_LIBRARY)
cmake -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON $SRC_DIR/
make -j5 VERBOSE=1
#+END_SRC

**** [working] Installation with Intel compilers

#+BEGIN_SRC 
# Loading necessary modules
module purge
module load git/2.8
module load cmake/3.5
module load intel/16.0
module load mkl/11.3
module load impi/5.1.3
module load fftw/3.3.4
module load boost/intel/1.61
module load cuda/7.5

# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

# Configuration and compilation (need to manually add CUDA_rt_LIBRARY and CUDA_SDK_ROOT_DIR)
cmake -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON $SRC_DIR/
make -j5 VERBOSE=1
#+END_SRC

**** [working] Running interactive Tutorial example (no GPUs)

#+BEGIN_SRC 
# Loading necessary modules if needed (check installation gcc/Intel)

# Environment variables
export OMP_NUM_THREADS=5
export OMP_PLACES=cores
export FFTALGO=1
export GPU=0

# Paths
TUTORIAL_DIR="$HOME/BioEM_project/tutorial"
BUILD_DIR="$HOME/BioEM_project/build"

# Running tutorial test
srun -n 4 $BUILD_DIR/bioEM --Inputfile $TUTORIAL_DIR/Param_Input --Modelfile $TUTORIAL_DIR/Model_Text --Particlesfile $TUTORIAL_DIR/Text_Image_Form
#+END_SRC

**** [working] Running larger example from the paper

     - Sample scripts are available here: https://www.mpcdf.mpg.de/services/computing/draco/sample-batch-script

#+BEGIN_SRC 
#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./tjob_hybrid.out.%j
#SBATCH -e ./tjob_hybrid.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J bioem_test
# Queue (Partition):
#SBATCH --partition=gpu
# Node feature:
#SBATCH --constraint="gpu"
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
# for OpenMP:
#SBATCH --cpus-per-task=16
#
#SBATCH --mail-type=none
#SBATCH --mail-user=<userid>@rzg.mpg.de
# Wall clock limit:
#SBATCH --time=01:00:00

# Loading necessary modules for Intel compilers
module purge
module load git/2.8
module load cmake/3.5
module load intel/16.0
module load mkl/11.3
module load impi/5.1.3
module load fftw/3.3.4
module load boost/intel/1.61
module load cuda/7.5

# Environment variables
export OMP_NUM_THREADS=16
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1

# For hyperthreading, but actually brings bad performance
#export OMP_NUM_THREADS=32
#export OMP_PLACES=threads
#export SLURM_HINT=multithread 

# Environment variable to tune
export GPUWORKLOAD=90
export BIOEM_DEBUG_OUTPUT=0
export BIOEM_AUTOTUNING=0

# Paths
INPUT_DIR="$HOME/BioEM_project/inputs"
BUILD_DIR="$HOME/BioEM_project/build"
datafile="$HOME/BioEM_project/data/draco_$BIOEM_AUTOTUNING.org"
> $datafile

# Environment capture
$HOME/BioEM_project/get_info.sh $datafile
echo "* BIOEM OUTPUT" >> $datafile

# Running full example
cd /ptmp/$USER/
srun -n 8 $BUILD_DIR/bioEM --Inputfile $INPUT_DIR/INPUT_FRH_Sep2016 --Modelfile $INPUT_DIR/Mod_X-ray_PDB --Particlesfile $INPUT_DIR/2000FRH_Part >> $datafile
#+END_SRC

    - Later just do /sbatch batch_script_draco/
Luka Stanisic's avatar
Luka Stanisic committed
883
884
885
886
887
888
889
890
891
892
** General configuration remarks
   - Choose number of computing nodes N
   - See number of cores on every node C
   - See number of GPUs on every machine G

   - Number of MPI_NODES (mpiexec/srun -n parameter) is MPI_NODES=N*G
   - Number of MPI processes per node is (tasks_per_node/--ntasks-per-node) is G
   - Number of cores per task and OMP_NUM_THREADS (?/--cpus-per-task) is C/G
     + If OMP_NUM_THREADS is larger than the max number of cores, it is similar as C/G

Luka Stanisic's avatar
Luka Stanisic committed
893
894
895
896
897
898
899
900
** General compilation remarks
   1. When compiling with Intel compilers, the following options are needed for the cmake command:

-DMPI_C_COMPILER=mpiicc -DMPI_CXX_COMPILER=mpiicpc -DCMAKE_CXX_COMPILER=icpc

   2. When compiling with GCC compilers, CUDA_HOST_COMPILER variable needs to be manually set to gcc/5.4, although module gcc/6.3 is loaded. Additionally, the following options are needed for the cmake command:

-DMPI_C_COMPILER=mpigcc -DMPI_CXX_COMPILER=mpicxx -DCMAKE_CXX_COMPILER=g++ 
Luka Stanisic's avatar
Luka Stanisic committed
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
* Summaries, papers, reports 
=put here links to potential external repositories=

* Notes (meetings, audio, tasks, etc.)
** 2017-06-01
*** Suggestions from Markus (from email)

#+BEGIN_SRC 
# Run:
export OMP_NUM_THREADS=5
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1
export GPUWORKLOAD=80
#run 80% of the work on the GPU

# 4 GPUs (1 per MPI-rank)
mpiexec -n 4 ../build/bioEM --Inputfile INPUT_FRH_Sep2016 --Modelfile Mod_X-ray_PDB --Particlesfile 2000FRH_Part

... try also other setups from the Tutorial

# target clusters:

hydra (E5-2680v2 + K20x)
draco (E5-2689v3 + GTX980)
miy01 (Power8 + P100)
phys01 (E5-2680v3 + GTX1080) ?
dvl01 (E5-2680v2 + K40) single test machine for interactive tests
#+END_SRC
     
     - First goal:: get rid of the guess value GPUWORKLOAD by implementing a simple autotuning approach (use the first few iterations to adapt the load balancing).
     - Let's discuss the perspectives and mid-term goals on June 12th.
       + algorithmic improvements (new functionality)
       + dynamic (task based ?) balancing of GPU/CPU workload
       + Xeon Phi?

** 2017-06-02
*** Trying installation/execution on hydra/minsky machines
   - Cloned the project (needed to use https as my username on ssh key is not the same)
   - Connecting to the machines
   - Installing BioEM with module loading on miy machine (no OpenMP, so need to deactivate it). Also default GCC CUDA flags seem to not be compatible, so have to manually deactivate them
#+BEGIN_SRC 
ssh miy
git clone https://sluka@gitlab.mpcdf.mpg.de/MPIBP-Hummer/BioEM.git

# Loading necessary modules
module purge
module load gcc/5.4
module load smpi/10.1
module load fftw/gcc-5.4
module load cuda/8.0

cd ~/BioEM_project/BioEM
mkdir -p build
cd build
cmake -DCMAKE_INSTALL_PREFIX=$PWD -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON ..
make VERBOSE=1
#+END_SRC

   - To run on Minsky machines use:
#+BEGIN_SRC
# Loading necessary modules
module purge
module load gcc/5.4
module load smpi/10.1
module load fftw/gcc-5.4
module load cuda/8.0
 
# Environment variables
export OMP_NUM_THREADS=1
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1
export GPUWORKLOAD=100

# Running
cd /home/sluka/BioEM_project/BioEM/Tutorial_BioEM

mpiexec -n 1 ../build/bioEM --Inputfile Param_Input --Modelfile Model_Text --Particlesfile Text_Image_Form > out.out
#+END_SRC

   - Running on Minsky machines bigger problem, but only 4 orientations:
#+BEGIN_SRC 
# Loading necessary modules
module purge
module load gcc/5.4
module load smpi/10.1
module load fftw/gcc-5.4
module load cuda/8.0

# Environment variables
export OMP_NUM_THREADS=1
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1
export GPUWORKLOAD=100

# Running
cd /home/sluka/BioEM_project/inputs/

GPUWORKLOAD=100 BIOEM_DEBUG_BREAK=4 BIOEM_DEBUG_OUTPUT=2 mpiexec -n 4 ../BioEM/build/bioEM --Inputfile INPUT_FRH_Sep2016 --Modelfile Mod_X-ray_PDB --Particlesfile 2000FRH_Part
#+END_SRC

   - Installing BioEM with module loading on hydra machine
#+BEGIN_SRC 
ssh hydra
module load git
git clone https://sluka@gitlab.mpcdf.mpg.de/MPIBP-Hummer/BioEM.git

# Loading necessary modules (GCC)
module purge
module load cmake/3.5
module load gcc/4.9
module load mpi.ibm/1.4.0
module load fftw/gcc/3.3.4
module load boost/gcc/1.61
module load cuda/7.5

# Loading necessary modules (icc)
module purge
module load cmake/3.5
module load intel/16.0
module load mkl/11.3
module load mpi.ibm/1.4.0
module load fftw/3.3.4
module load boost/intel/1.61
module load cuda/7.5

cd BioEM
mkdir -p build
cd build
rm -rf *
rm ../CMakeCache.txt
cmake -DCMAKE_INSTALL_PREFIX=$PWD -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON ..
make VERBOSE=1

# Alternative
cmake -DCMAKE_INSTALL_PREFIX=$PWD -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON -DCUDA_rt_LIBRARY=/u/system/SLES11/soft/cuda/7.5/lib64/libcudart.so -DCUDA_SDK_ROOT_DIR=/u/system/SLES11/soft/cuda/7.5/samples/common -DCUDA_CUDA_LIBRARY=/u/system/SLES11/soft/cuda/7.5/lib64/stubs/libcuda.so ..


# Hack to overcome the issues
cd ..
ln -s /u/system/SLES11/soft/cuda/libcuda/libcuda.so.352.79 libcuda.so.1
ln -s libcuda.so.1 libcuda.so
#+END_SRC

    - Problems with "CUDA_SDK_ROOT_DIR" which needs to be set to "/u/system/SLES11/soft/cuda/7.5/samples/common"
    - Problems with "CUDA_rt_LIBRARY" which needs to be set to "/u/system/SLES11/soft/cuda/7.5/lib64/libcudart.so"
    - The most important problem is with "CUDA_CUDA_LIBRARY" which needs to be set to "/u/system/SLES11/soft/cuda/7.5/lib64/stubs/libcuda.so"
    - With that everything compiles, but not able to run "./bioEM" as there is the following error:
./bioEM: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
    - Indeed when doing "ldd bioEM" libcuda.so.1 is not found (as we gave it libcuda.so)
    - But there is no other libcuda.so on the machine
    - Trying some hacks to overcome the issue

#+BEGIN_SRC 
cd -
cd BioEM
ln -s /u/system/SLES11/soft/cuda/libcuda/libcuda.so.352.79 libcuda.so.1
ln -s libcuda.so.1 libcuda.so

cd build
rm -rf *
rm ../CMakeCache.txt
cmake -DCMAKE_INSTALL_PREFIX=$PWD -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON -DCUDA_rt_LIBRARY=/u/system/SLES11/soft/cuda/7.5/lib64/libcudart.so -DCUDA_SDK_ROOT_DIR=/u/system/SLES11/soft/cuda/7.5/samples/common -DCUDA_CUDA_LIBRARY=$PWD/../libcuda.so ..
make VERBOSE=1

./bioEM
#+END_SRC

    - Need to fix this issue in a proper way
    - [X] Run MPI execution of the BioEM with batch script
    - Test script on Hydra SandyBridge for running bioEM (more info for script configuration: https://www.mpcdf.mpg.de/services/computing/hydra/batch-system)
#+BEGIN_SRC 
# @ shell=/bin/bash
#
# Sample script for LoadLeveler
#
# @ error   = job.err.$(jobid)
# @ output  = job.out.$(jobid)
# @ job_type = parallel
# @ node_usage= not_shared
# @ node = 1
# @ tasks_per_node = 1
# @ resources = ConsumableCpus(16)
# @ network.MPI = sn_all,not_shared,us
# @ wall_clock_limit = 00:05:00
# @ notification = complete
# @ notify_user = $(user)@rzg.mpg.de
# @ queue

# run the program
cd /ptmp/$USER/
poe /u/$USER/BioEM/build/bioEM > prog.out
#+END_SRC

    - Full experimental script on Hydra IvyBridge with GPUs for running bioEM
#+BEGIN_SRC 
# @ shell=/bin/bash
#
# Sample script for LoadLeveler
#
# @ error   = job.err.$(jobid)
# @ output  = job.out.$(jobid)
# @ job_type = parallel
# @ requirements = (Feature=="gpu")
# @ node_usage= not_shared
# @ node = 1
# @ tasks_per_node = 1
# @ resources = ConsumableCpus(20)
# @ network.MPI = sn_all,not_shared,us
# @ wall_clock_limit = 01:00:00
# @ notification = complete
# @ notify_user = $(user)@rzg.mpg.de
# @ queue

# Loading necessary modules (icc)
module purge
module load cmake/3.5
module load intel/16.0
module load mkl/11.3
module load mpi.ibm/1.4.0
module load fftw/3.3.4
module load boost/intel/1.61
module load cuda/7.5

# Environment variables
export OMP_NUM_THREADS=5
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1
export GPUWORKLOAD=80

# run the program
cd /ptmp/$USER/
poe /u/$USER/BioEM/build/bioEM > prog.out
#+END_SRC
 
    - To submit job use /llsubmit script_name/
    - To check the status /llq -u sluka/
    - To cancel job /llcancel job_id/
    - To see available batch classes /llclass/
    - To run in interactive mode, need to create file /host.list/ with the word /localhost/ in each line for every MPI node. However, it seems that it is not possible to use GPUs in that case
    - To run interactive mode use
#+BEGIN_SRC 
# Loading necessary modules (icc)
module purge
module load cmake/3.5
module load intel/16.0
module load mkl/11.3
module load mpi.ibm/1.4.0
module load fftw/3.3.4
module load boost/intel/1.61
module load cuda/7.5

export OMP_NUM_THREADS=5
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS

cd ~/BioEM/Tutorial_BioEM/

mpiexec -n 4 ../build/bioEM --Inputfile Param_Input --Modelfile Model_Text --Particlesfile Text_Image_Form > out2.out

#+END_SRC

    - To disable debug output use "export BIOEM_DEBUG_OUTPUT=0", it is enabled by default (value 2)
*** Machines
**** Minsky (miy)
     - 3 Miy machines
     - No batch system so use 'gpschedule'
     - 2 Power8+ CPUs and 4 Nvidia P100 GPUs
     - The GPUs are operated under MPS, so you can allocate multiple processes (MPI-ranks) on a single GPU.
     - Home and software directories are cross-mounted via NFS. There is a module system for additional software (like IBM's Spectrum MPI: module smpi/10.1)
     - The IBM software stack is not perfect yet. For example, ESSL and pESSL are missing a number of LAPACK ScaLAPACK routines, so you may have to link additional libraries. We are trying to provide a complementary LAPACK (from netlib).
     - Connect directly from desktop:
       + ssh miy01.bc.rzg.mpg.de
       + ssh miy02.bc.rzg.mpg.de
       + ssh miy03.bc.rzg.mpg.de
**** KNL (knl)
     - The 4-node KNL test system is currently operated in cache/quadrant mode. Mode(s) in principle can be changed on request, but I'd like to ask you not to request changes too frequently since this requires manual reboots (twice actually) and some internal coordination overhead.
     - There are NFS-shared home directories but no high-performance filesystem
     - Connect directly from desktop:
       + ssh knl1.bc.rzg.mpg.de
       + ssh knl2.bc.rzg.mpg.de
       + ssh knl3.bc.rzg.mpg.de
       + ssh knl4.bc.rzg.mpg.de
     - You can start MPI jobs simply by
#+BEGIN_SRC 
module load impi
mpiexec -n 256 -perhost 64 -hosts knl1,knl2,knl3,knl4 ./a.out
# (example running on all 4 nodes)
#+END_SRC

     - The MPI authentication is via ssh keys, so $HOME/.ssh/authorized_keys has
to contain your public key.

**** Hydra (hydra)
     - Full machine info: https://www.mpcdf.mpg.de/services/computing/hydra/configuration
     
** 2017-06-06
*** DONE Running on hydra machine [3/3]
    :LOGBOOK:
    - State "DONE"       from              [2017-06-09 Fri 17:15]
    :END:
   - Managed to run on hydra machine
   - [X] Running Tutorial on Hydra machine
   - [X] Running configuration proposed by Markus on Hydra machine
   - [X] Dont have access to dvl01 and gp01 machines, need to ask for it
** 2017-06-07
*** Understanding the code
   - Analyzing BioEM code:
     + Projections and Convolutions are always computed by CPUs (OMP)
     + Only "Comparison" is done either on CPU or on GPU
     + Comparison is done on GPU [0,maxRef], and the rest on CPUs (maxRef,END] by calling a function from CUDA code
     + If maxRef changes from one iteration to another, it should not be a problem
     + maxRef is private, but it shouldn't be a problem to add setter and geter or to declare it public
     + On the other hand maxRef is used for the initialization of CUDA, so maybe some more important changes are needed after all
   - GPUWORKLOAD<100 works only when there is OMP enabled on the machine
   - Problems using MPI+CUDA without OMP
   - When running with gdb without MPI and OMP, it actually works. However, this is a heisenbug, probably due to address space randomization. Need to add this in gdb "set disable-randomization off" to reproduce the problem
   - Not sure how "maps" works in /include/map.h/
** 2017-06-08
*** Trying miy machine
   - On Minsky machine problems with gcc options that are not supported by Power8 (IBM) machine:
     + c++: error: unrecognized command line option ‘-march=native’
     + c++: error: unrecognized command line option ‘-mfpmath=sse’
     + c++: error: unrecognized command line option ‘-minline-all-stringops’
   - Without it, it is possible to compile the project
   - However, it is not possible to run with GPUs as there are some problem with allocations
   
** 2017-06-09
*** Trying dvl machines
   - Managed to succesfully run on hydra node with GPUs an example from Markus (with workload 80%)
   - Implement simple workload autotuning, and try it during the weekend on hydra
   - Talk with Markus to understand completely the output. No communication between MPI ranks? 
   - Discuss possible WORKLOAD implementations
   - There are independent dvl01 and dvl02 machines (one with K40 and one with K20)
   - Installation on dvl01 machine
#+BEGIN_SRC 
# Loading necessary modules
module purge
module load intel/15.0
module load impi/5.1.3
module load cmake/3.6
module load boost/intel/1.57
module load fftw/3.3.4
module load cuda/7.5

cd ~/BioEM_project/BioEM
mkdir -p build
cd build
rm -rf *
cmake -DCMAKE_INSTALL_PREFIX=$PWD -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON ..
make VERBOSE=1
#+END_SRC
    - Run real example on dvl01 machine
#+BEGIN_SRC 
# Loading necessary modules
module purge
module load intel/15.0
module load impi/5.1.3
module load cmake/3.6
module load boost/intel/1.57
module load fftw/3.3.4
module load cuda/7.5

# Environment variables
export OMP_NUM_THREADS=5
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1
export GPUWORKLOAD=100

# Running tutorial test
cd ~/BioEM_project/BioEM/Tutorial_BioEM
GPU=1 mpiexec -n 2 ../build/bioEM --Inputfile Param_Input --Modelfile Model_Text --Particlesfile Text_Image_Form

# Running full example
cd ~/BioEM_project/input_data/test1
GPUWORKLOAD=100 BIOEM_DEBUG_BREAK=4 BIOEM_DEBUG_OUTPUT=2 mpiexec -n 2 ~/BioEM_project/BioEM/build/bioEM --Inputfile INPUT_FRH_Sep2016 --Modelfile Mod_X-ray_PDB --Particlesfile 2000FRH_Part

#+END_SRC
     - Installation with GNU compilers
#+BEGIN_SRC 
# Loading necessary modules
module purge
module load gcc/4.9
module load impi/5.1.3
module load cmake/3.6
module load boost/gcc/1.57
module load fftw/gcc/3.3.4
module load cuda/7.5

cd ~/BioEM_project/BioEM
mkdir -p build
cd build
rm -rf *
cmake -DCMAKE_INSTALL_PREFIX=$PWD -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON ..
make VERBOSE=1
#+END_SRC
    - Run real example on dvl01 machine
#+BEGIN_SRC 
# Loading necessary modules
module purge
module load gcc/4.9
module load impi/5.1.3
module load cmake/3.6
module load boost/gcc/1.57
module load fftw/gcc/3.3.4
module load cuda/7.5

# Environment variables
export OMP_NUM_THREADS=5
export OMP_PLACES=cores
export FFTALGO=1
export GPU=1
export GPUDEVICE=-1
export GPUWORKLOAD=100

# Running tutorial test
cd ~/BioEM_project/BioEM/Tutorial_BioEM
GPU=1 mpiexec -n 2 ../build/bioEM --Inputfile Param_Input --Modelfile Model_Text --Particlesfile Text_Image_Form

# Running full example
cd ~/BioEM_project/input_data/test1
GPUWORKLOAD=100 BIOEM_DEBUG_BREAK=4 BIOEM_DEBUG_OUTPUT=2 mpiexec -n 2 ~/BioEM_project/BioEM/build/bioEM --Inputfile INPUT_FRH_Sep2016 --Modelfile Mod_X-ray_PDB --Particlesfile 2000FRH_Part

#+END_SRC

     - When running with cuda7.5 and cuda8.0, similar errors
#+BEGIN_SRC 
CUDA Error 46 / all CUDA-capable devices are busy or unavailable (/home/sluka/BioEM_project/BioEM/bioem_cuda.cu: 396)
#+END_SRC
     - The problem seems to come from the first malloc
     - If running without MPI, the problem with CUDAs call to cuCtxDestroy(tmpContext) in initialization 

     - Installation on dvl02 is the same as on dvl01. Unfortunatelly, the errors are the same as well

     - One other error that is sometimes observed is
#+BEGIN_SRC 
bioEM: tpp.c:63: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed.
#+END_SRC

     - Depending on the number of GPUs used, error may change, but it is still never working
     - Sometimes there is even a lifelock (which is actually again a problem in CUDA initialization, most probably at cuCtxDestroy)

*** MOVED Things to discuss with Markus [1/7]:
    :LOGBOOK:
    - State "MOVED"      from "TODO"       [2017-06-12 Mon 15:59]
    - State "TODO"       from "TODO"       [2017-06-12 Mon 14:54]
    - State "TODO"       from "TODO"       [2017-06-12 Mon 14:51]
    - State "TODO"       from "TODO"       [2017-06-12 Mon 09:50]
    :END:
    - Problems with installation on hydra, but managed to hack it. The execution seems to be perfectly fine, even with GPUs, however not possible to have interactive nodes with GPUs, while waiting for batch jobs is too long
    - [X] Problems with GPUs when running on miy and dvl nodes (installation seems to be fine)
      + miy machine sometimes have problems with GPUs, sometimes it just hangs without anything executed
      + dvl machine should work, but Marksu didnt check it recently, he will test it and tell me about the output (and input configurations)
    - [ ] Discuss the output file, input parameters:
      + Orientations
      + Projections
      + Convolutions
      + Comparisons
    - [ ] Discuss different possibilities for GPUWORKLOAD implementations (simple, smarter, ver advanced, full auto-tuning framework study, etc.)
      + Also on which for loop level to put: projection, convolution/comparison (probably comparison)
    - [ ] From the paper: parallelize over projections via OpenMP instead of MPI using mutexes
    - [ ] Discuss dependencies between different parts of the algorithm (images independent? rotations?)
    - [-] Organization and development conventions
      + [X] Common/personal src, labbook, analysis scripts, data, etc.=Mostly up to me
      + [ ] Usage of BioEM gitlab, branches, large number of commits.=Mostly up to me, but need to figure out for the labbook
      + [ ] Storage of large data files (results)
    - [ ] Briefly discuss future plans
      + Audio meeting with people from Frankfurt
      + Typical timeline and lifespan for such a project (few to 6 months, but can launch new projects)
      + Algorithmic improvements
      + Task-based implementation
      + Using KNL
** 2017-06-12
*** Prototype implementation of Autotuning
    - Added some simple autotuning code
      + Actually just start from 100% and lower by 10% if the performance is better until performance gets worse
      + Much more advanced features are possible
** 2017-06-13
*** Synthesized notes from previous email discussions
    1. Automatic autotuning for load balancing, user friendliness, and
       general optimizations
       1) including a possibly "multiscaling/hierarchical" approach in
          which the parameter search (in particular for orientations)
          uses down-sampled images initially, and zooms in onto the
          most relevant orientations (and CTF parameters etc) to speed
          up the search and (!) achieve higher accuracy
       2) thinking about smarter ways of handling
          rotations/reprojections/cross-correlation calculations
    2. Making the code useful also in standard EM reconstruction
       1) by determining precise orientations of individual
          model-particle combinations (zooming in in orientation
          space) that can then improve the class-averaging or even
          enable direct reconstruction from individual particles
       2) by dealing with more complex CTFs (possibly read in; this is
          an issue Pilar has been running into)
    3. Move toward refinement as an alternative to reconstruction
       1) by using hybrid models that combine high-resolution
          structures (e.g., from X-ray) with 3D reconstructed maps for
          the "rest", with overlaps "masked out";
       2) and then optimize the orientation/position of the high-res
          components, much the way now done by fitting into the
          reconstructed map, but here by fitting directly against
          particles
       3) evaluate gradients (with respect to [collective]
          positions/orientations of part of the model) of the log
          posterior, such that models can be optimized efficiently.
    4. Discussion regarding multiscaling and changing the loops
       + This was a point that we discussed previously related to the
         paragraph of using OpenMP for the orientations. Now the code
         is written an optimized to handle an all-orientations to
         all-projections comparison. But if we want to do a
         "multiscaling" approach, which in other words means to select
         a few orientations for each image, and do the analysis only
         over those. As the code is now, by "zooming in" we are losing
         GPU parallellization power
       + *Gerhard*: The multiscaling I have in mind is much much
         simpler: instead of working with a 256x256 image do one round
         of 64x64, then 128x128, then 256x256. At every round use
         orientations to zoom in on peak(s) of the log-posterior
         (making sure the peaks are covered; ie maxima are within
         domains). In this way, you would use the code as before, but
         with quaternions that are optimized between steps, such that
         the posterior integration is very accurate in orientation
         space.
       + *Pilar*: I think it is important to clarify that if there is no
         all-orientations to all-images comparison (i.e., now each
         particle has a specified set of orientations) then the
         parallelization over the GPU is not possible (its lost! I
         cant use the GPU). Regardless if we use many or few
         pixels. Currently, the code uses the GPU to do a very simple
         calculation: 1-"calculated image" vs ALL experimental
         particles. If we "zoom in" and say I want to explore for each
         particle just around a given orientation, then I cannot do
         the "1-calculated image vs all-experimental particles
         analysis", because now each calculated image will be
         different for each particle. Basically, what I want to say is
         that if we want to use the GPU in Round 2: >use orientations
         to zoom in on peak(s) of the log-posterior we need to modify
         the BioEM code. I think this is important because the GPU is
         useful & quick, but it will require a fair amount of work to
         change the loop so the orientations go to the GPU instead of
         the particle images (e.g. 1-particle to "all of its specific
         calculated images" instead of 1-calculated image to all
         particles.). The problem here will be the dependencies. If we
         want the GPU to do the loop over the orientations, we could
         make the GPU do everything: rotate,project, ctf &
         cross-correlation (but we would have to write it all in
         cuda...). Alternatively, we could make a queue in which the
         CPU prepares the "stack" of calculated images for each
         particle, and then make the GPU only do an 1-particle to "all
         of its calculated images" comparison. I dont think this is
         trivial because it implies changing quite a bit the
         architecture of the code. I think this is a full
         project. Nevertheless, we should try to include the GPU in
         the subsequent rounds of "zoom-in" optimization.
       + *Gerhard*: in model space, one can also use hierarchical
         representations, doing the coarse-pixel search with a
         residue-based model, and only the fine pixel search for near
         optimal orientations with all atom
       + I honestly do not know how easy will it be to change the
         paralellization loop will be, more because there is a clear
         dependency orientation -> projection -> ctf ->
         comparison. This will require a fair amount of work but it is
         definitely our next computational goal.
       + *Gerhard*: From your most recent results it seems that we still
         sample orientation only very roughly. If I assume your 13000
         quternions to be evenly distributed, the average angle
         spacing is ~(8*pi*pi/13000.)**(1./3.)*180./pi = 10.5
         degrees. This is a lot, in my opinion, in particular for
         large/elongated molecules. Over 5 nm, a 10 degree rotation
         causes a sin(10deg)*5nm ~1 nm displacement.
       + *Pilar*: I agree with you're overall statement (we do need to
         zoom over the orientations). However, I disagree with your
         calculation. I think you should take the square root and not
         1./3. Its a solid angle so there are only 2 angles
         (e.g. theta & phi) not 3. I'm attaching an example of how
         4608 orientations look on the sphere. Intuitively, if I look
         at 13000 orientations on the sphere they do not seem
         separated by 10 deg along each angle. The average solid angle
         per orientation for 4608 is 28 deg^2 ~ 5 deg x 5 deg (along
         each angle). Nonetheless I agree that a refine grid over the
         correct orientation is important.
    5. Discussion regarding reconstructions
       + For this we would need to write a completely new subroutine,
         which takes the best orientations obtained from BioEM and
         generated a 3D map using the individual images. This is what
         the "normal" EM reconstruction algorithms do, such as Relion
         & EMAN & XMIPP. So maybe it might be better to do some type
         of BioEM_-patch and use one of the existing programs. I have
         met with the developers of XMIPP (Jose Carazo from spain,
         where Scheres/Relion started) and they said that they would
         be happy to collaborate & implement BioEM in XMIPP. So maybe
         we could think of doing a collaborative project with them.
       + *Gerhard*: I agree that we do not want to do reconstructions
         ourselves. So this should be done by using existing code,
         possibly in collaboration. But the power of BioEM is that it
         can give accurate distributions of the possible orientations
         of a particle. This is invaluable for quality
         reconstructions. Same argument as above, but now in reverse:
         if I have a 3 degree error in my orientation, the position at
         5 nm shifts by 2.6 Å. This is likely a factor why the
         resolution of reconstructions is usually lower at the surface
         than in the core. As a first step it would be important to
         assess if Relion does this better than BioEM.
       + *Pilar*: I was talking to Matteo and Janet about this and they
         are a bit concerned about this: "...However, I don't think
         the EM community would be so thrilled about improving a map
         based on a model, where model bias could be an issue." In the
         last years there was this polemic about the model bias,
         e.g. reconstructing Einstien's face with noise-particles,
         just by given the "correct model orientations". This is just
         to keep in mind of what we will face. I still think we should
         do it.
    6. Discussion regarding CTF sampling
       + 1.2) by dealing with more complex CTFs (possibly read in;
         this is an issue Pilar has been running into)
       + If this is just reading in CTF parameters for each particle,
         I think this might not be such a problem but it would have to
         be coordinated in the same way as the multiscale approach
         (point 0.1).
       + *Gerhard*: Yes, I also see this mainly as a code flexibility issue. In
         practice, it may be better not to reinvent the wheel and
         simply use the CTFs determined in the traditional way,
         instead of sampling over CTF parameters. Saves lots of CPU
         time (factor 10-100, since it is multiplicative?),
       + *Pilar*: Yes, we would save this much.
       + *Gerhard*: and improves the accuracy (because we cannot afford sampling
         the CTF very accurately, and we seem to have problems getting
         this "converged" in cases where the model is uncertain
         (ATPase)). We could also fit "traditional CTFs" and then fix
         the parameters, or sample a narrow window around them.
       + *Pilar*: I think sampling is still important (maybe over a
         reduced window), as I mentioned, if the CTF parameters are
         wrong, the best orientation-estimate will probably fail.
    7. Discussion regarding reconstruction
       + This is definitely the way to go! It also needs development
         from the methods/mathematical side. My preference would be to
         mix BioEM with complexes for refinement but this is still to
         think.....
       + *Gerhard*: Yes, that is one way to go on the model side (rigid body +
         flex linkers). As I see it, there are two simple routes:
         non-gradient-based optimizers and gradient-based
         optimizers. To work with the former will require very
         efficient posterior calculations, eg by focusing only on
         regions in the orientation/parameter space that are relevant,
         with these regions adaptively adjusted as the optimization
         proceeds. Again, from a programming side, this may require a
         bit of rethinking to get full-blast pipelines. To calculate
         gradients is hard, but not impossible. We have some basic
         results. Gradients would help a lot in the end game of
         refinement.
       + *Pilar*: We should try both routes.

*** Audio meeting with Pilar Cossio
**** Preparation
***** Technical
      - Propose https://meet.jit.si/bioem
      - Or create an Inria pad
      - Discuss labbook, gitlab and similar source code related
        questions
      - Where to keep the data, scripts, etc.
***** DONE Current status and things to discuss [6/6]
      :LOGBOOK:
      - State "DONE"       from "TODO"       [2017-06-13 Tue 16:22]
      - State "TODO"       from "TODO"       [2017-06-13 Tue 16:22]
      - State "TODO"       from "TODO"       [2017-06-13 Tue 16:21]
      - State "TODO"       from "TODO"       [2017-06-13 Tue 16:20]
      - State "TODO"       from "TODO"       [2017-06-13 Tue 16:20]
      - State "TODO"       from "TODO"       [2017-06-13 Tue 16:20]
      - State "TODO"       from              [2017-06-13 Tue 16:20]
      :END:
      - Problems with installation on hydra, but managed to hack
        it. The execution seems to be perfectly fine, even with GPUs,
        however not possible to have interactive nodes with GPUs,
        while waiting for batch jobs is too long
      - [X] Problems with GPUs when running on miy and dvl nodes
        (installation seems to be fine)
        + miy machine sometimes have problems with GPUs, sometimes it
          just hangs without anything executed
        + dvl machine should work, but Markus didnt check it recently,
          he will test it and tell Luka about the output (and input
          configurations)
      - [X] Possibly discuss the output file, input parameters:.=This is quite clear from the paper
        + Orientations
        + Projections
        + Convolutions
        + Comparisons
      - [X] From the paper: parallelize over projections via OpenMP
        instead of MPI using mutexes
      - [X] Discuss dependencies between different parts of the
        algorithm (images independent? rotations?).=We will see how this evolves with new algorithmic changes
      - [X] What type of results are we looking at, just overall
        execution time. If there are large output files envisioned,
        any conventions for storing it.=For now yes, just make sure results are correct
      - [X] What type of autotuning is needed, how advanced
	+ Also on which for loop level to put: projection,
          convolution/comparison (probably comparison)

***** Discuss goals
      - Smarter GPUWORKLOAD
      - Algorithmic improvements (new functionality)
      - ?dynamic (task-based) balancing of CPU/GPU workload
      - ?Intel KNL

**** Notes from the meeting :ATTACH:
     :PROPERTIES:
     :Attachments: OptimizationBioEM.pdf
     :ID:       3fbbc978-d881-4b53-beab-4d33e9242694
     :END:
     - Pad (which was not used at the end): https://pad.inria.fr/p/wicgnYxP3nWZsX3n
     - Pilar sent slides (attached) on possible optimizations inside an MPI process
     - Goals:
       1. Autotuning and see what kind of results it produces
       2. Possibly try task-based
       3. Try changing parallelization schemes (putting OpenMP on the highest level, and MPI somewhere lower)
     - We would definitely need in future the current implementation (possibly more optimized with autotuning or even task-based)
     - However, it would be nice to have another implementation which is much more adaptive and is not exhaustively checking every orientation, but "zooming" only to the most interesting ones
       + This reduction of orientations to inspect would reduce the overall amount of work, however it would be hard then to give enough work to CUDA
       + Currently most of the work is done on CPUs, and only comparison on GPUs or CPUs
       + First step would be to try new parallelization scheme with all orientations to check if it is producing correct results, and only later reduce the number of orientations through some more advanced algorithms
     - In the far future, the nice idea would be to couple COMPLEXES (on which Berenger is working) with BioEM project to reconstruct models
       + COMPLEXES computes models which serve later as an input to BioEM
       + The idea is to start with loose coupling, and later possibly do something much closer. This is more of an engineering challange
       + In future, it could be Berenger or Luka who will tackle this
     - Organization
       + Try audios once a month
       + Use Gitlab issues for the communications
       + Pilar is coming back to Frankfurt in August. She might visit Munich somewhere in August for a few hours
       + The easiest way to proceed would be to do a private fork of the BioEM project, put everything there (notes, scripts, analysis) and add everyone who can contribute to the repository
       + A test case to use is the one used in the GPU paper, possibly using environment variable BIOEM_DEBUG_BREAK to run just a few iterations. Possibly command line inputs --DumpMaps and later --LoadMapDump could also speedup the process
     - Later discussion between Luka and Markus on autotuning implementation
       + Just measure the timings on GPUs and CPUs for an iteration, and then from that derive the optimal balance
       + Possibly rebalance every X iterations (or projections/orientations) if the balance changed
	 
1643
** 2017-06-16
Luka Stanisic's avatar
Luka Stanisic committed
1644
*** MOVED Developing autotuning on dvl machine [3/4]
1645
    :LOGBOOK:
Luka Stanisic's avatar
Luka Stanisic committed
1646
1647
    - State "MOVED"      from "TODO"       [2017-06-22 Thu 10:00]
    - State "TODO"       from "TODO"       [2017-06-20 Tue 13:01]
1648
1649
1650
1651
1652
    - State "TODO"       from              [2017-06-16 Fri 16:58]
    :END:
   - Added nicer error detection and print for the Driver CUDA errors
   - Problem on dvl device was related to the part that tests all CUDA devices, searching for the fastest one. Commenting out this part made both MPI and non-MPI executions possible.
   - [ ] Still it seems that when initializing device 1 then 0, this causes a problem (initialization 0 then 1 seems to be fine). Need to inspect this problem in more details
Luka Stanisic's avatar
Luka Stanisic committed
1653
1654
1655
1656
1657
     + Markus thinks that this may be related to the way CUDA is configured on /dvl/ machine. They enabled the special MPS mode, which for some unknown reasons is causing troubles for BioEM code.
     + We will need to inspect this more and check this hypothesis
     + If this is true, additional code in BioEM is needed, to make sure it can work on similar machines
     + We checked it, and indeed the problem on dvl machines was caused by the EXCLUSIVE_PROCESS mode of GPUs. In DEFAULT mode everything works smoothly, even with the GPU testing code
     + I need to investigate why EXCLUSIVE_PROCESS mode is causing troubles for BioEM
1658
1659
1660
   - [X] When changing the code, do constantly checks if numerically everything is still correct. Possibly rely on subtract_LogP.sh script available in Tutorial_Bio/MODEL_COMPARISON
     + If comparing only first 20 models, there is significant difference between the obtained Output_Probabilities and the ones sent by Markus together with the inputs. Hence another Output_Probabilities_20_ref as a reference was created
   - [X] When changing workload during the execution, never getting the best performance for that workload (compared to when the same workload is tested without autotuning).=Actually, this was solved by doing deviceFinishRun() -> deviceStartRun(). These functions introduce overhead, but it seems that they are necessary
Luka Stanisic's avatar
Luka Stanisic committed
1661
   - [X] Create a script for typical runs
1662

Luka Stanisic's avatar
Luka Stanisic committed
1663
*** Simple analysis of the results1
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697

#+begin_src R :results output graphics :file :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session org-R
library(ggplot2)

df <- read.csv("data/results1.csv")

df$Workload <- factor(df$Workload, levels=c(" 60%", " 80%", " 100%", " Auto"))
df$GPUs <- as.factor(as.character(paste("#GPUs =", df$GPUs)))
df$OMP_THREADS <- as.factor(as.character(paste("#OMPs =", df$OMP_THREADS)))
df$OMP_THREADS <- factor(df$OMP_THREADS, levels=c("#OMPs = 5", "#OMPs = 20"))

ggplot(df, aes(x=Workload, y=Time, fill=Workload)) + geom_bar(stat="identity") + facet_grid(GPUs ~ OMP_THREADS) + theme_bw() + ylab("Overall execution time [s]") + xlab("Workload processed on GPUs")
#+end_src

#+RESULTS:
[[file:/tmp/babel-2425YP4/figure2425NIt.png]]

        - Maybe it would be good to save these results

#+begin_src R :results output graphics :file analysis/results1_analysis.pdf :exports both :width 6 :height 4 :session org-R
library(ggplot2)

df <- read.csv("data/results1.csv")

df$Workload <- factor(df$Workload, levels=c(" 60%", " 80%", " 100%", " Auto"))
df$GPUs <- as.factor(as.character(paste("#GPUs =", df$GPUs)))
df$OMP_THREADS <- as.factor(as.character(paste("#OMPs =", df$OMP_THREADS)))
df$OMP_THREADS <- factor(df$OMP_THREADS, levels=c("#OMPs = 5", "#OMPs = 20"))

ggplot(df, aes(x=Workload, y=Time, fill=Workload)) + geom_bar(stat="identity") + facet_grid(GPUs ~ OMP_THREADS) + theme_bw() + ylab("Overall execution time [s]") + xlab("Workload processed on GPUs")
#+end_src

#+RESULTS:
[[file:analysis/results1_analysis.pdf]]
Luka Stanisic's avatar
Luka Stanisic committed
1698
** 2017-06-19
Luka Stanisic's avatar
Luka Stanisic committed
1699
*** TODO Autotuning [7/8]:
Luka Stanisic's avatar
Luka Stanisic committed
1700
    :LOGBOOK:
Luka Stanisic's avatar
Luka Stanisic committed
1701
    - State "TODO"       from "TODO"       [2017-07-03 Mon 08:51]
Luka Stanisic's avatar
Luka Stanisic committed
1702
1703
    - State "TODO"       from "TODO"       [2017-06-30 Fri 10:49]
    - State "TODO"       from "TODO"       [2017-06-30 Fri 10:49]
Luka Stanisic's avatar
Luka Stanisic committed
1704
1705
1706
1707
1708
    - State "TODO"       from "TODO"       [2017-06-23 Fri 15:46]
    - State "TODO"       from "TODO"       [2017-06-23 Fri 12:58]
    - State "TODO"       from "TODO"       [2017-06-23 Fri 09:52]
    - State "TODO"       from "TODO"       [2017-06-22 Thu 17:27]
    - State "TODO"       from "TODO"       [2017-06-22 Thu 17:27]
Luka Stanisic's avatar
Luka Stanisic committed
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
    - State "TODO"       from "TODO"       [2017-06-22 Thu 09:59]
    - State "TODO"       from "TODO"       [2017-06-21 Wed 16:21]
    - State "TODO"       from "TODO"       [2017-06-21 Wed 16:21]
    - State "TODO"       from "TODO"       [2017-06-21 Wed 16:21]
    - State "TODO"       from "TODO"       [2017-06-20 Tue 14:40]
    - State "TODO"       from "TODO"       [2017-06-20 Tue 13:20]
    - State "TODO"       from "TODO"       [2017-06-20 Tue 13:01]
    - State "TODO"       from "TODO"       [2017-06-20 Tue 09:47]
    - State "TODO"       from "TODO"       [2017-06-19 Mon 16:49]
    - State "TODO"       from              [2017-06-19 Mon 16:48]
    :END:
    - Discussed the reported issues with Markus
    - He is thinks that the approach is good
    - [X] Next step is to test it on the hydra, draco and phys machines, and see the influence
      + phys machine is using SunGrid job scheduler (/qstat/), but it is not well documented for MPCDF machines. Markus will send me a template for running jobs on this machine
    - [X] Only later realized that the influence of the pipelining is not so big, so actually measuring the time on "GPU only" and "CPU only" might be better approach. Implementing another autotuning approach might be interesting. Could even evaluate both and present results
    - [ ] Markus thinks that enabling autonuning with GPUWORKLOAD=-1 or without stating GPUWORKLOAD might be the best idea
Luka Stanisic's avatar
Luka Stanisic committed
1726
1727
      + Pilar agrees, and this should be activated by default (GPUWORKLOAD='')
      + Final code should look like this before merging (and keeping only one Autotuning algorithm)
Luka Stanisic's avatar
Luka Stanisic committed
1728
    - [X] Discuss on following meeting the possibility to recalibrate the workload during the execution. For now there seems to be no need for such a thing, but it can be added quite easily and cheaply in terms of the performance overhead and code development
Luka Stanisic's avatar
Luka Stanisic committed
1729
1730
      + Pilar believes that this can be useful, check with Markus
      + Propose an environment variable for this, as well as for the number of comparisons to get stable performance (STABILIZER)
Luka Stanisic's avatar
Luka Stanisic committed
1731
      + Pilar agreed, do it in such a way
Luka Stanisic's avatar
Luka Stanisic committed
1732
1733
    - [X] Check if there is another approach for Autotuning, described in Numerical Recipes book
      + Yes with bisection, performance results will show if this is better than other approaches
Luka Stanisic's avatar
Luka Stanisic committed
1734
    - [X] Report error regarding documentation of the computation of GPUWORKLOAD
Luka Stanisic's avatar
Luka Stanisic committed
1735
1736
      + Done, waiting for the reply and someone to correct this
      + If Pilar doesnt reply by Wednesday 21st, commit in main project and add tag 1.0.2
Luka Stanisic's avatar
Luka Stanisic committed
1737
1738
1739
1740
1741
      + Fixed and pushed
    - [X] Discuss with Markus possibility of hyperthreading and whether they generally use it
      + Could be a thing to test, although Markus doesnt expect a significant improvements from it
      + On draco, hyperthreading (if I configured it correctly), seems to bring negative influence on the performance
    - [X] Create another type of the visulization (Timeline vs Time for projection), possibly add events for the optimal workload (or colors for workload)
Luka Stanisic's avatar
Luka Stanisic committed
1742
** 2017-06-20
Luka Stanisic's avatar
Luka Stanisic committed
1743
*** TODO Other code things to add for BioEM [2/5]
Luka Stanisic's avatar
Luka Stanisic committed
1744
    :LOGBOOK:
Luka Stanisic's avatar
Luka Stanisic committed
1745
    - State "TODO"       from "TODO"       [2017-06-22 Thu 11:59]
Luka Stanisic's avatar
Luka Stanisic committed
1746
1747
1748
1749
1750
    - State "TODO"       from "TODO"       [2017-06-20 Tue 16:23]
    - State "TODO"       from "TODO"       [2017-06-20 Tue 16:13]
    - State "TODO"       from              [2017-06-20 Tue 10:19]
    :END:
    - [X] Problems with installing BioEM on hydra machine with Intel recipe that used to work. Issues seem unrelated to autotuning changes, but something regarding bio:configure and reading the parameters. Actually, before Luka was not using the right Intel compilers, with the correct one everything compiles fine
Luka Stanisic's avatar
Luka Stanisic committed
1751
    - [X] Need to find a proper way of handling errors in the code
Luka Stanisic's avatar
Luka Stanisic committed
1752
    - [ ] Need to do a nice cleanup before merging into the main project
Luka Stanisic's avatar
Luka Stanisic committed
1753
      + Check if it is OK to add workload information to the "TimeComparison" line
Luka Stanisic's avatar
Luka Stanisic committed
1754
    - [ ] Add nice printf for writing the Optimal Workload
Luka Stanisic's avatar
Luka Stanisic committed
1755
      + Check if it is OK to add such info
Luka Stanisic's avatar
Luka Stanisic committed
1756
    - [ ] Add more profoud CUDA profiling, possibly using specialized CUDA tools for that. We will certainly need it in future when doing more developments in BioEM
Luka Stanisic's avatar
Luka Stanisic committed
1757
      + Already added debugging information
Luka Stanisic's avatar
Luka Stanisic committed
1758
1759
    - [ ] Ensure that pinning is done correctly (in Intel case there shouldnt be any problem)

Luka Stanisic's avatar
Luka Stanisic committed
1760
*** DONE Simple analysis of the result2
Luka Stanisic's avatar
Luka Stanisic committed
1761
    :LOGBOOK:
Luka Stanisic's avatar
Luka Stanisic committed
1762
    - State "DONE"       from "TODO"       [2017-06-22 Thu 13:55]
Luka Stanisic's avatar
Luka Stanisic committed
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
    - State "TODO"       from              [2017-06-20 Tue 15:17]
    :END:

#+begin_src R :results output graphics :file :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session org-R
library(ggplot2)

df <- read.table("data/results2.csv", header=TRUE, sep=",", strip.white=TRUE)

df$Algorithm <- factor(df$Algorithm, levels=c("Ref", "Static", "Auto1", "Auto2", "Auto3"))
df$GPUs <- as.factor(as.character(paste("#GPUs = ", df$GPUs)))
df$OMP_THREADS <- as.factor(as.character(paste("#OMPs = ", df$OMP_THREADS)))

ggplot(df, aes(x=Algorithm, y=Time, fill=Workload)) + geom_bar(stat="identity") + scale_fill_gradientn(colours=rainbow(4)) + facet_wrap( ~ Machine) + theme_bw() + ylab("Overall execution time [s]") + xlab("Workload algorithm")
#+end_src

#+RESULTS:
Luka Stanisic's avatar
Luka Stanisic committed
1779
[[file:/tmp/babel-22888QGl/figure22888ofZ.png]]
Luka Stanisic's avatar
Luka Stanisic committed
1780
1781
1782

        - Maybe it would be good to save these results

Luka Stanisic's avatar
Luka Stanisic committed
1783
#+begin_src R :results output graphics :file analysis/results2_analysis.pdf :exports both :width 8 :height 4 :session org-R
Luka Stanisic's avatar
Luka Stanisic committed
1784
1785
library(ggplot2)

Luka Stanisic's avatar
Luka Stanisic committed
1786
df <- read.table("data/results2.csv", header=TRUE, sep=",", strip.white=TRUE)
Luka Stanisic's avatar
Luka Stanisic committed
1787
1788
1789
1790
1791

df$Algorithm <- factor(df$Algorithm, levels=c("Ref", "Static", "Auto1", "Auto2", "Auto3"))
df$GPUs <- as.factor(as.character(paste("#GPUs = ", df$GPUs)))
df$OMP_THREADS <- as.factor(as.character(paste("#OMPs = ", df$OMP_THREADS)))

Luka Stanisic's avatar
Luka Stanisic committed
1792
ggplot(df, aes(x=Algorithm, y=Time, fill=Workload)) + geom_bar(stat="identity") + scale_fill_gradientn(colours=rainbow(4)) + facet_wrap( ~ Machine) + theme_bw() + ylab("Overall execution time [s]") + xlab("Workload algorithm")
Luka Stanisic's avatar
Luka Stanisic committed
1793
1794
1795
#+end_src

#+RESULTS:
Luka Stanisic's avatar
Luka Stanisic committed
1796
[[file:analysis/results2_analysis.pdf]]
Luka Stanisic's avatar
Luka Stanisic committed
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
** 2017-06-21
*** MOVED Autotuning [0/1]
    :LOGBOOK:
    - State "MOVED"      from "TODO"       [2017-06-22 Thu 09:59]
    - State "TODO"       from              [2017-06-21 Wed 16:24]
    :END:
    - Implementing third autotuning algorithm, based on bisection (from Numerical Recipes)
    - Running experiments on 4 nodes on machines hydra, draco, phys
    - [ ] Strange error on phys machine where no error is supposed to occur (just after the computation). Disabled temporarily the check to make it work, but this needs to be investigated more
** 2017-06-22
Luka Stanisic's avatar
Luka Stanisic committed
1807
*** TODO BioEM code errors regrouped [1/2]
Luka Stanisic's avatar
Luka Stanisic committed
1808
    :LOGBOOK:
Luka Stanisic's avatar
Luka Stanisic committed
1809
    - State "TODO"       from "TODO"       [2017-06-29 Thu 15:29]
Luka Stanisic's avatar
Luka Stanisic committed
1810
1811
1812
1813
1814
    - State "TODO"       from "TODO"       [2017-06-22 Thu 10:51]
    - State "TODO"       from "TODO"       [2017-06-22 Thu 10:51]
    - State "TODO"       from              [2017-06-22 Thu 09:57]
    :END:
    - [ ] Strange error on phys machine where no error is supposed to occur (just after the CUDA computation). It is happening in bioem_cuda.cu:300, although the code error is 0 (which normally means cudaSuccess, hence no error). Disabled temporarily the check to make it work, but this needs to be investigated more
Luka Stanisic's avatar
Luka Stanisic committed
1815
1816
      + Code was 0 as cudaGetLastError was reseting the code error. Hence, using cudaPeekAtLastError() might be better
      + Actually cudaPeekAtLastError() shows that the error was CUDA_ERROR_INVALID_DEVICE
Luka Stanisic's avatar
Luka Stanisic committed
1817
      + Error occurs in the first CUDA computing call inside for loop, in call /multComplexMap/
Luka Stanisic's avatar
Luka Stanisic committed
1818
    - [X] Still it seems that when initializing device 1 then 0, this causes a problem  (initialization 0 then 1 seems to be fine). Need to inspect this problem in more details
Luka Stanisic's avatar
Luka Stanisic committed
1819
1820
1821
1822
1823
      + Markus thinks that this may be related to the way CUDA is configured on /dvl/ machine. They enabled the special MPS mode, which for some unknown reasons is causing troubles for BioEM code.
      + We will need to inspect this more and check this hypothesis
      + If this is true, additional code in BioEM is needed, to make sure it can work on similar machines
      + We checked it, and indeed the problem on dvl machines was caused by the EXCLUSIVE_PROCESS mode of GPUs. In DEFAULT mode everything works smoothly, even with the GPU testing code
      + I need to investigate why EXCLUSIVE_PROCESS mode is causing troubles for BioEM
Luka Stanisic's avatar
Luka Stanisic committed
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
** 2017-06-23
*** Analyzing workload choice for Autotuning 3 on dvl

#+begin_src R :results output graphics :file :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session org-R 
  library(ggplot2)

  df1 <- read.table("data/workload_test.csv", header=FALSE, sep=",", strip.white=TRUE)
  df1$Exp <- "Exp1"
  df2 <- read.table("data/workload_test2.csv", header=FALSE, sep=",", strip.white=TRUE)
  df2$Exp <- "Exp2"
  df3 <- read.table("data/workload_test3.csv", header=FALSE, sep=",", strip.white=TRUE)
  df3$Exp <- "Exp3"

  df <- rbind(df1, df2, df3)
  names(df) <- c("Name", "MPI_Rank", "Proj", "Comp", "Duration", "Workload", "Exp")
  base <- floor(max(df$Proj)/(max(df$MPI_Rank)+1))
  df$Comp2 <- (df$Proj-base*df$MPI_Rank)*base + (df$Comp - base*df$MPI_Rank)

  ggplot(df, aes(x=Comp2, y=Duration, color=factor(Workload))) + geom_point()  + facet_wrap( Exp ~ MPI_Rank, ncol=2) + theme_bw() + ylab("Comparison duration [s]") + xlab("Comparison timeline")

#+end_src

#+RESULTS:
[[file:/tmp/babel-22888QGl/figure22888dQt.png]]

#+begin_src R :results output graphics :file :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session org-R 
  library(ggplot2)
  library(plyr)

  df1 <- read.table("data/workload_test.csv", header=FALSE, sep=",", strip.white=TRUE)
  df1$Exp <- "Exp1"
  df2 <- read.table("data/workload_test2.csv", header=FALSE, sep=",", strip.white=TRUE)
  df2$Exp <- "Exp2"
  df3 <- read.table("data/workload_test3.csv", header=FALSE, sep=",", strip.white=TRUE)
  df3$Exp <- "Exp3"

  df <- rbind(df1, df2, df3)
  names(df) <- c("Name", "MPI_Rank", "Proj", "Comp", "Duration", "Workload", "Exp")

  df <- ddply(df, .(MPI_Rank, Workload, Exp), summarise, Duration = max(Duration))

  ggplot(df, aes(x=Workload, y=Duration, color=factor(Workload))) + geom_point()  + facet_wrap( Exp ~ MPI_Rank, ncol=2) + theme_bw() + ylab("Max comparison duration [s]") + xlab("Workload value")

#+end_src

#+RESULTS:
[[file:/tmp/babel-22888QGl/figure22888QNb.png]]
*** Analyzing workload choice for Autotuning 3 on draco

#+begin_src R :results output graphics :file analysis/draco_workload1.pdf :exports both :width 9 :height 12 :session org-R 
  library(ggplot2)

  df1 <- read.table("data/draco_workload_test.csv", header=FALSE, sep=",", strip.white=TRUE)
  df1$Exp <- "Experiment 1"
  df2 <- read.table("data/draco_workload_test2.csv", header=FALSE, sep=",", strip.white=TRUE)
  df2$Exp <- "Experiment 2"
  df <- rbind(df1, df2)

  names(df) <- c("Name", "MPI_Rank", "Proj", "Comp", "Duration", "Workload", "Exp")

  base <- floor(max(df$Proj)/(max(df$MPI_Rank)+1))
  df$Comp2 <- (df$Proj-base*df$MPI_Rank)*base + (df$Comp - base*df$MPI_Rank)

  df$MPI_Rank <- as.factor(as.character(paste("MPI_Rank = ", df$MPI_Rank)))

  ggplot(df, aes(x=Comp2, y=Duration, color=factor(Workload))) + geom_point()  + facet_grid( MPI_Rank ~ Exp) + theme_bw() + ylab("Comparison duration [s]") + xlab("Comparison timeline")+xlim(2000000,2100000)

#+end_src

#+RESULTS:
[[file:analysis/draco_workload1.pdf]]

#+begin_src R :results output graphics :file analysis/draco_workload2.pdf :exports both :width 9 :height 12 :session org-R 
  library(ggplot2)
  library(plyr)

  df1 <- read.table("data/draco_workload_test.csv", header=FALSE, sep=",", strip.white=TRUE)
  df1$Exp <- "Experiment 1"
  df2 <- read.table("data/draco_workload_test2.csv", header=FALSE, sep=",", strip.white=TRUE)
  df2$Exp <- "Experiment 2"
  df <- rbind(df1, df2)

  names(df) <- c("Name", "MPI_Rank", "Proj", "Comp", "Duration", "Workload", "Exp")

  #df <- ddply(df, .(MPI_Rank, Workload, Exp), summarise, Duration = max(Duration))

  df$MPI_Rank <- as.factor(as.character(paste("MPI_Rank = ", df$MPI_Rank)))

  ggplot(df, aes(x=Workload, y=Duration, color=factor(Workload))) + geom_point()  + facet_grid( MPI_Rank ~ Exp) + theme_bw() + ylab("Comparison duration [s]") + xlab("Workload value")

#+end_src

#+RESULTS:
[[file:analysis/draco_workload2.pdf]]

#+begin_src R :results output graphics :file analysis/draco_workload1_zoom.pdf :exports both :width 6 :height 6 :session org-R 
  library(ggplot2)

  df1 <- read.table("data/draco_workload_test.csv", header=FALSE, sep=",", strip.white=TRUE)
  df1$Exp <- "Experiment 1"
  df2 <- read.table("data/draco_workload_test2.csv", header=FALSE, sep=",", strip.white=TRUE)
  df2$Exp <- "Experiment 2"
  df <- rbind(df1, df2)

  names(df) <- c("Name", "MPI_Rank", "Proj", "Comp", "Duration", "Workload", "Exp")

  base <- floor(max(df$Proj)/(max(df$MPI_Rank)+1))
  df$Comp2 <- (df$Proj-base*df$MPI_Rank)*base + (df$Comp - base*df$MPI_Rank)

  df$MPI_Rank <- as.factor(as.character(paste("MPI_Rank = ", df$MPI_Rank)))

  ggplot(df[df$MPI_Rank=="MPI_Rank =  4" & df$Exp=="Experiment 1",], aes(x=Comp2, y=Duration, color=factor(Workload))) + geom_point()  + facet_grid( MPI_Rank ~ Exp) + theme_bw() + ylab("Comparison duration [s]") + xlab("Comparison timeline")+xlim(2010000,2100000) + ggtitle("Zoomed view on bogus performance")

#+end_src

#+RESULTS:
[[file:analysis/draco_workload1_zoom.pdf]]

#+begin_src R :results output graphics :file analysis/draco_workload0.pdf :exports both :width 4.5 :height 12 :session org-R 
  library(ggplot2)

  df <- read.table("data/draco_workload_test0.csv", header=FALSE, sep=",", strip.white=TRUE)
  df$Exp <- "Static experiment 87%"
  names(df) <- c("Name", "MPI_Rank", "Proj", "Comp", "Duration", "Workload", "Exp")

  base <- floor(max(df$Proj)/(max(df$MPI_Rank)+1))
  df$Comp2 <- (df$Proj-base*df$MPI_Rank)*base + (df$Comp - base*df$MPI_Rank)

  df$MPI_Rank <- as.factor(as.character(paste("MPI_Rank = ", df$MPI_Rank)))

  df$Workload <- 87

  ggplot(df, aes(x=Comp2, y=Duration, color=factor(Workload))) + geom_point()  + facet_grid( MPI_Rank ~ Exp) + theme_bw() + ylab("Comparison duration [s]") + xlab("Comparison timeline")

#+end_src

#+RESULTS:
[[file:analysis/draco_workload0.pdf]]
Luka Stanisic's avatar
Luka Stanisic committed
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
** 2017-06-26
*** TODO Workload 87% issue
    :LOGBOOK:
    - State "TODO"       from              [2017-06-29 Thu 15:01]
    :END:
    - Stange observation on draco machines, for workload 87% (as well as 86) there is a huge variability for Comparison duration
    - Not sure why it is happening, maybe it is related to the part executed on CPU
    - Could it be CPU frequency scalling?
      + That would explain the issue and the results
      + However, draco nodes are normally working in /performance/ mode, so thse things should not occur
    - Tried to experimentally detect the same problem on dvl machine, but everything was stable
    - [X] Wait until draco machine is back, with new OS, reproduce the experiments and see if the issue persisted
      + If needed, check PAPI counters, look at the frequency, profile in more detail
    - Can compute how much effort was performed on GPU and how much on CPUs. Then run separate experiments by putting exactly that number of workload on ressources. This is quite easy to do by commenting parts of GPU/CPU code (although results will be worng at the end)
      + Commenting OMP code in bioem_cuda.cu:369 doesnt create any problems to the execution
      + Putting number of iteration to 0 in GPU code in bioem_cuda.cu:283 doesnt create any problems to the execution, and it gives an estimate of the OMP execution duration
    - No need for a large execution, as the problem is quite stable. So decrease the number of MPI nodes and the size of the problem.
    - Actually after update draco seems to be in /powersafe/ mode, so performance is quite stable. However, workload value could be now quite different, as OMP is not as performant as it used to be
    - [ ] Discuss with Christian about draco governor modes
Luka Stanisic's avatar
Luka Stanisic committed
1981
    - [X] Explain to Pilar in BioEM issue
Luka Stanisic's avatar
Luka Stanisic committed
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
    
*** TODO EXCLUSIVE_MODE on dvl GPUs issue
    :LOGBOOK:
    - State "TODO"       from              [2017-06-29 Thu 15:02]
    :END:
    - Exclusive mode is used to be able to benefit from Nvidia MPS
    - Normally it should not compromise standard CUDA code, but for some reason there are problems with BioEM on dvl machine
    - [ ] Do we want MPS implementation of BioEM one day?
    - [X] cuCtxCreate function was causing errors, maybe because only one context is allowed with EXCLUSIVE_PROCESS and that one already exists. Need to verify this hypothesis.=Yes
    - Adding computeMode information to the output. It returns value "3" which is normally non-existing
      + enum cudaComputeMode {
      + cudaComputeModeDefault = 0,
      + cudaComputeModeExclusive = 1,
      + cudaComputeModeProhibited = 2 }
    - A lot of problems, some people encountered the same issues
    - At the end, it seems that for CUDA versions > 6.5, we just *mustnt* call cudaSetDevice(), but just cudaDeviceSynchronize()
    - Explanation from https://github.com/kaldi-asr/kaldi/issues/1487
      + Unfortunately they don't say what cudaSetDevice does if it is
        called on a device with computeMode
        cudaComputeModeExclusiveProcess that is already occupied by
        another process . Now I would expect it to return
        cudaErrorDeviceAlreadyInUse indicating the device cannot be
        used, but that is not what happens. In fact it returns
        cudaSuccess. They again don't say what subsequent non-device
        management runtime functions return when an occupied
        "Exclusive Process" mode device is chosen with cudaSetDevice
        from a different process. This time they do what I expect them
        to do and return cudaErrorDevicesUnavailable. The problem is
        they continue to return cudaErrorDevicesUnavailable even if
        the device is no longer occupied. I tried calling
        cudaDeviceReset after every failed context creation attempt
        but that did not change anything. I tried calling cudaFree(0)
        instead of cudaDeviceSynchronize() but that did not change
        anything either. I also tried a bunch of other hacks that I
        can not remember now but nothing helped. It seems like there
        is nothing we can do at this point other than submitting a bug
        report to NVIDIA and wait for them to fix it. If anyone
        reading this can think of anything to work around this issue
        in the meantime, let me know.
      + BTW, someone here
        https://devtalk.nvidia.com/default/topic/415723/choosing-cuda-device-programmatically/
        is saying that when using compute exclusive mode, you should
        just not call cudaSetDevice-- presumably it will automatically
        pick a free device when you call
        cudaDeviceSynchronize(). Interestingly, this is how the code
        *originally* worked. However, after a certain version of the
        CUDA toolkit, the original code stopped working, and that's
        when we added the iteration over devices and the calls to
        cudaSetDevice. One possibility is to query the version of the
        CUDA toolkit, and if it's >= 8.0, go back to the original
        strategy of trying to call cudaDeviceSynchronize() and
        checking the error status.
    - Another interesting discussion: https://serverfault.com/questions/377005/using-cuda-visible-devices-with-sge
    - Other people complaining about the CUDA bug on this matter:
      + https://devtalk.nvidia.com/default/topic/869602/cuda-7-0-and-compute-exclusive-mode-on-multi-gpu-machine/
      + https://devtalk.nvidia.com/default/topic/857233/dual-gpu-cuda-6-5-vs-7-0/
      + 
    - For checking different type of CUDA errors, consult this explanation: https://devblogs.nvidia.com/parallelforall/how-query-device-properties-and-handle-errors-cuda-cc/
      + Basically both cudaGetLastError() and cudaDeviceSynchronize() are needed, for synchronous and asynchronous errors
    - Conclusion: when CUDA is set to EXCLUSIVE_PROCESS mode,
      processes are directly, automatically mapped to different GPU
      devices. In some cases, BioEM tries for CUDA process 0 which is
      on GPU 0 to set its device to GPU 1. This creates an error since
      there is already a running process CUDA 1 on GPU 1. However,
      when compute mode is DEFAULT, this thing is very much
      needed. Hence, do SetDevice only if mode is computeMode
    - [X] CUDA processes are actually created with pProb.init. Is it supposed to be like that?
      + Probably not, it is better to move this code
    - Also CUDA has bugs regarding SetDevice errors
    - BTW: for BioEM code GPUs perform 8.5% worse when run in EXCLUSIVE_MODE compared to the DEFAULT mode
    - Proposed solution:.=Actually, better to do even more
#+BEGIN_SRC 
cudaGetDeviceProperties(&deviceProp ,bestDevice);
if (deviceProp.computeMode == 0)
{
   checkCudaErrors(cudaSetDevice(bestDevice));
}
else
{
   printf("CUDA device %d is not set in DEFAULT mode, make sure that processes are correctly pinned!\n", bestDevice);
   checkCudaErrors(cudaDeviceSynchronize());
}	
#+END_SRC
    - [ ] Test new solution on several machines, with different inputs
    - [ ] Get the right patch, send it to the master BioEM project
    - [X] Write explanations to Pilar
    - [ ] Close the issue on BioEM_fork
    - After Application group meeting, Andy said that NVidia clearly states that EXCLUSIVE_MODE should only be used with MPS
      
** 2017-06-27
*** New recipes for draco (not working)
    - Installation with Intel modules
#+BEGIN_SRC 
module purge
module load git/2.13
module load cmake/3.7
module load intel/17.0
module load impi/2017.3
module load fftw/3.3.6
module load boost/intel/1.64
module load cuda/8.0

# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

# Configuration and compilation (need to manually add CUDA_rt_LIBRARY and CUDA_SDK_ROOT_DIR)
cmake -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=ON $SRC_DIR/
make -j5 VERBOSE=1
#+END_SRC

   - Problems compiling with Intel (seems to be related to FFTW-Intel compatibility)
#+BEGIN_SRC 
/usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: /mpcdf/soft/SLES122/HSW/fftw/3.3.6-pl2/intel-17.0/impi-2017.3/lib/libfftw3f.a(trig.o): undefined reference to symbol '__libm_sse2_sincos'
/mpcdf/soft/SLES122/common/intel/ps2017.4/17.0/linux/compiler/lib/intel64/libimf.so: error adding symbols: DSO missing from command line
#+END_SRC
   - When bioem module was installed on draco machine (probably it works), it was with boost/1.54. However, it really looks like the problem is coming from FFT


   - Installation with gcc modules
#+BEGIN_SRC 
module purge
module load git/2.13
module load cmake/3.7
module load gcc/6.3
module load impi/2017.3
module load fftw/gcc/3.3.6
module load boost/gcc/1.64
module load cuda/8.0

# Paths
SRC_DIR="$afsdir/BioEM_fork"
BUILD_DIR="$HOME/BioEM_project/build"
mkdir -p $BUILD_DIR
cd $BUILD_DIR

# Deleting files from previous installations previous 
rm -rf $BUILD_DIR/*
rm -rf $SRC_DIR/CMakeFiles $SRC_DIR/CMakeCache.txt $SRC_DIR/Makefile $SRC_DIR/cmake_install.cmake

# Configuration and compilation (need to manually add CUDA_rt_LIBRARY and CUDA_SDK_ROOT_DIR)
cmake -DUSE_MPI=ON -DUSE_OPENMP=ON -DUSE_CUDA=ON -DPRINT_CMAKE_VARIABLES=ON -DCUDA_FORCE_GCC=OFF $SRC_DIR/
make -j5 VERBOSE=1
#+END_SRC

   - Problems compiling with gcc (seems to be related to CUDA-gcc compatibility)
#+BEGIN_SRC 
mpcdf/soft/SLES122/common/cuda/8.0.61/bin/nvcc -M -D__CUDACC__ /afs/ipp-garching.mpg.de/u/sluka/BioEM_fork/bioem_cuda.cu -o /u/sluka/BioEM_project/build/CMakeFiles/bioEM.dir//bioEM_generated_bioem_cuda.cu.o.NVCC-depend -ccbin gcc -m64 -DWITH_CUDA -DWITH_OPENMP -DWITH_MPI -Xcompiler ,\"-O3\",\"-march=native\",\"-fweb\",\"-mfpmath=sse\",\"-frename-registers\",\"-minline-all-stringops\",\"-ftracer\",\"-funroll-loops\",\"-fpeel-loops\",\"-fprefetch-loop-arrays\",\"-ffast-math\",\"-ggdb\",\"-g\" --use_fast_math -ftz=true -O4 -Xptxas -O4 -gencode=arch=compute_20,code=[sm_20,sm_21] -gencode=arch=compute_30,code=sm_30 -gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=[sm_60,sm_61] -DNVCC -I/mpcdf/soft/SLES122/common/cuda/8.0.61/include -I/afs/ipp-garching.mpg.de/u/sluka/BioEM_fork/include -I/mpcdf/soft/SLES122/HSW/fftw/3.3.6-pl2/gcc-6.3/impi-2017.3/include -I/mpcdf/soft/SLES122/common/boost/1.64/gcc/6.3.0/include -I/mpcdf/soft/SLES122/common/intel/ps2017.4/impi/2017.3/intel64/include
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
In file included from /mpcdf/soft/SLES122/common/cuda/8.0.61/include/cuda_runtime.h:78:0,
                 from <command-line>:0:
/mpcdf/soft/SLES122/common/cuda/8.0.61/include/host_config.h:119:2: error: #error -- unsupported GNU version! gcc versions later than 5 are not supported!
 #error -- unsupported GNU version! gcc versions later than 5 are not supported!
  ^~~~~
CMake Error at bioEM_generated_bioem_cuda.cu.o.cmake:222 (message):
  Error generating
  /u/sluka/BioEM_project/build/CMakeFiles/bioEM.dir//./bioEM_generated_bioem_cuda.cu.o


CMakeFiles/bioEM.dir/build.make:63: recipe for target 'CMakeFiles/bioEM.dir/bioEM_generated_bioem_cuda.cu.o' failed
make[2]: *** [CMakeFiles/bioEM.dir/bioEM_generated_bioem_cuda.cu.o] Error 1
make[2]: Leaving directory '/draco/u/sluka/BioEM_project/build'
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/bioEM.dir/all' failed
make[1]: *** [CMakeFiles/bioEM.dir/all] Error 2
make[1]: Leaving directory '/draco/u/sluka/BioEM_project/build'
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

#+END_SRC

** 2017-06-28
*** TODO Issues with boost dependencies
    :LOGBOOK:
    - State "TODO"       from              [2017-06-29 Thu 14:55]
    :END:
    - [ ] Discuss with Pilar if we can get rid of the boost, as it creates unnecessary compatibility/installation issues
      + Had problems with boost on hydra and phys machines with gcc compilers

** 2017-06-29
*** Problem with memory allocation on draco
    - Getting /srun: error: drago32: task2: Bus error/. This is realted to some bad memory accesses. Here is an explanation of the difference between that and Segfault:
      + A segfault is accessing memory that you're not allowed to access. It's read-only, you don't have permission, it belongs to another process, etc...
      + A bus error is trying to access memory that can't possibly be there. You've used an address that's meaningless to the system, or the wrong kind of address for that operation.
    - The problem was probably coming from CUDA, as the contexts where not properly started in DEFAULT MODE
    - New fixes to the code resolved this issue
Luka Stanisic's avatar
Luka Stanisic committed
2173
2174
2175
2176
2177
2178
2179
** 2017-06-30
*** TODO See about new classes with Markus [1/2]
    :LOGBOOK:
    - State "TODO"       from              [2017-06-30 Fri 18:17]
    :END:
    - [X] Is he OK with such approach.=Yes, both Markus and Pilar agree with such approach
    - [ ] How about the copyrights (and my name)?