repository archive workflow · Wiki · nomad-lab / nomad-lab-base

In this page is defined the current workflow from User File Upload -> NormalizedArchive:

User upload a File
1. File is stored on web-repository:/uploads/hashUpload/archive.tar.gz.
2. The upload is registered on dbRepository and parsed by the repository (???).
The file is copies to: labdev:/nomad/repository/data/uploads/hashUpload/archive.tar.gz
The is uncompressed on: labdev:/nomad/repository/data/uploads/extract/hashUpload
For the folders with open access (there is not any file called Restricted):
1. For folders smaller than 10 Gb a zip files is generated on: labdev:/nomad/nomadlab/raw-data/RhashArchive.zip

java -jar <jar file path> --small-upload-file <path to du file> --base-path /nomad/repository/data/extracted

For folders larger than 10 Gb several zip files is generated on: labdev:/nomad/nomadlab/raw-data/RhashArchive.zip (how are the archives gids generated?)
1. Create a file uploadHash.split

line 1: "# " + path to the upload
line 2: empty line
line 3: path inside the upload
line 3..n: files/directories inside that path
line n+1: empty line
just like line 3

  1. Run:

java -jar <jar file path> --split-file <path to split file>

Each zip file in labdev:/nomad/nomadlab/raw-data/RhashArchive.zip is parsed:
1. Create a file with the list of zip files to parse: listOfFiles.txt
2. Load the calculation to parse on RabbitMQ: TreeParserInitializationQueue (todo: create a stable jar) (@brambila if you run this command with two different files the queue is overwritten or you just add more data to the queue?)

sbt "treeparserinitializer/run --file /path/to/file/listOfFiles.txt"

Generate the assignments:
1. Generate a the docker container to executes the assignments in the queue (i.e., figures out which parser should be used for a given file and send it to the SingleParserInitializationQueue)

sbt treeparser/docker

   1. Run docker image (@danilo do you stop this docker after running?):

docker run -d -e NOMAD_ENV=labdev --user $UID:$(getent group docker | cut -d: -f3) --volumes-from parserVolContainer eu.nomad-laboratory/nomadtreeparserworker:versionOfTheDockerFile

    1. This will start to consume the list in the TreeParserInitializationQueue and fill the SingleParserInitializationQueue. Note that if a file structure is not matched by any parser, this file will simply be ignored and not inserted in the queue.
    1. Generate a JAR file, which contains all the parsers with:

sbt calculationparser/assembly

    1. Send the JAR file to EOS ( or generate it already in EOS).
    1. Sync your local raw-data/data to the raw-data/data in labdev.

#Command to sync

    1. Make sure you have a access to /u/fawzi/lib in EOS. This folder contains all the dependencies for executing the jar file.
    
    1. Finally generate the submission script in eos with:

module load mkl
export NOMAD_ENV=eos_prod
module add jdk
export PATH=${PATH}:/u/fawzi/bin

SBT_OPTS="-Xms1024M -Xmx4096M -Xss8M -XX:MaxPermSize=512M"
CMD="java $SBT_OPTS -Dnomad_lab.configurationToUse=eos_prod \
-Deos_prod.nomad_lab.replacements.rootNamespace=path/to/parsed/files \
-Deos_prod.nomad_lab.parser_worker_rabbitmq.numberOfWorkers=1 \
-Djava.library.path=/u/fawzi/lib \
-Deos_prod.nomad_lab.parsing_stats.jdbcUrl=\\"\\" \
-Dnomad_lab.integrated_pipeline.treeparserMinutesOfRuntime=530 \
-Dnomad_lab.parser_worker_rabbitmq.rabbitMQHost=labdev-nomad.esc.rzg.mpg.de \
 -jar /u/brambila/bin/nomadCalculationParserWorker-assembly-1.8.0-119-gb50befb3-dirty.jar"
#for node in $(seq -w 1 32)  ; do
for node in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32  ; do
  \$CMD &> out-${bName}-\${node}.txt &
done
wait

copy file to /nomad/nomadlab/parsed/productionH5/S??/S*.h5 ???
For each file located in /nomad/nomadlab/parsed/productionH5/S??/S*.h5

hdf5lib=lib
jar=jars/tools.jar

java -Djava.library.path=$hdf5lib -jar $jar normalize --archive-uri nmd://$1  --normalizers MetaIndexRebuildNormalizer

For each input file, a normalized file should be created on /nomad/nomadlab/normalized/newProductionH5/N??/N*.h5
Generate Parquet File
Generate Elasticsearch File
Generate JsonFile

The results are stored on: labdev:/noma/nomadlab/parsed/??? (@danilo where is the output folder defined?)

File Upload By User -> File Stored on the Repository -> File Processed on the Repository -> ParsedArchive on the Archive -> NormalizedArchive on the Archive

Upload process @jungho
1. Input:
  1. user upload a file @jungho
2. Output: @jungho
  1. UnprocessedFile (the file from the user is uploaded ) @jungho
  2. Inserts rows in database ( which ones ) @jungho
  3. Status of the process: Success / Fail @jungho
3. Process: @jungho
  1. Where is the output file stored? @jungho
  2. Which information is added to the database @jungho
  3. Detailed Sequence of scripts and actions to go from Input to Output @jungho
  4. How do you detected failed uploads @jungho
    1. Actions to take in case of fail @jungho
    2. List of fails @jungho
Processing UnprocessedFile: @jungho
1. Input:
  1. UnprocessedFile @jungho
2. Output: @jungho
  1. ProcessedFile @jungho
  2. Insert rows in database ( which ones ) @jungho
  3. Status of the process: Success / Fail @jungho
3. Process: @jungho
  1. Where is the output file stored? @jungho
  2. Which information is added to the database @jungho
  3. Detailed Sequence of scripts and actions to go from Input to Output @jungho
  4. How do you detected failed process @jungho
  5. Actions to take in case of fail @jungho
  6. List of fails @jungho
Processing ProcessedFile: @brambila
1. Input:
  1. Once the files are in /nomad/nomadlab/raw-data/data, it is first necessary to generate a text file containing all the new input. Since files that were already parsed are skipped if sent again to the queue, it is not a problem to have a redundant list. You can for example generate this input files with all the data ingested in July by using the command:

ls -lh /nomad/nomadlab/raw-data/data/*R*/*.zip | grep -v 2016 | grep Jul | awk '{print $9}' > listOfFiles.txt

    1. The next step is the creation of the assignments in the queue. Firstly we initialize the queue with the command (in sbt):

cd nomad-lab-base
sbt 
treeparserinitializer/run --file /path/to/file/listOfFiles.txt

This will fill the TreeParserInitializationQueue: 1. Finally to generate the assignments first generate a the docker file which executes the assignments in the queue, i.e., figures out which parser should be used for a given file and send it to the SingleParserInitializationQueue, with the command (in sbt):

cd nomad-lab-base
sbt treeparser/docker

and then put the docker image to run with

            docker run -d -e NOMAD_ENV=labdev --user $UID:$(getent group docker | cut -d: -f3) --volumes-from parserVolContainer eu.nomad-laboratory/nomadtreeparserworker:versionOfTheDockerFile

This will start to consume the list in the TreeParserInitializationQueue and fill the SingleParserInitializationQueue. Note that if a file structure is not matched by any parser, this file will simply be ignored and not inserted in the queue. 1. Useful commands for rabbitMQ:

~/rabbitmqadmin list queues : to see the queues
~/rabbitmqadmin purge queue name=nameOfTheQueue  : to purge a given queue.

    1. Run the parsing:
        1. First generate a JAR file, which contains all the parsers with (in sbt):

sbt calculationparser/assembly

        1. Send the JAR file to EOS ( or generate it already in EOS).

Sync your local raw-data/data to the raw-data/data in labdev.

#Command to sync

        1. Make sure you have a access to /u/fawzi/lib in EOS. This folder contains all the dependencies for executing the jar file. 
        1. Finally generate the submission script in eos with:

module load mkl
export NOMAD_ENV=eos_prod
module add jdk
export PATH=${PATH}:/u/fawzi/bin

SBT_OPTS="-Xms1024M -Xmx4096M -Xss8M -XX:MaxPermSize=512M"
CMD="java $SBT_OPTS -Dnomad_lab.configurationToUse=eos_prod \
-Deos_prod.nomad_lab.replacements.rootNamespace=path/to/parsed/files \
-Deos_prod.nomad_lab.parser_worker_rabbitmq.numberOfWorkers=1 \
-Djava.library.path=/u/fawzi/lib \
-Deos_prod.nomad_lab.parsing_stats.jdbcUrl=\\"\\" \
-Dnomad_lab.integrated_pipeline.treeparserMinutesOfRuntime=530 \
-Dnomad_lab.parser_worker_rabbitmq.rabbitMQHost=labdev-nomad.esc.rzg.mpg.de \
 -jar /u/brambila/bin/nomadCalculationParserWorker-assembly-1.8.0-119-gb50befb3-dirty.jar"
#for i in $(seq -w 1 32)  ; do
for i in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32  ; do
  \$CMD &> out-${bName}-\${i}.txt &
done
wait

This will write the parsed files in /scratch/username/nomad/parsed/"path/to/parsed/files".

 1. Output: @brambila  
    1. ParsedArchive @brambila 
    1. Inserts rows in database ( which ones ) @brambila 
    1. Status of the process: Success / Fail @brambila 
 1. Process: @brambila 
    1. Where is the output file stored? @brambila 
    1. Which information is added to the database @brambila 
    1. Detailed  Sequence of scripts and actions to go from Input to Output @brambila 
 1. How do you detected failed process @brambila 
    1. Actions to take in case of fail @brambila 
    1. List of fails @brambila

Processing ParsedArchive: @asastre
1. Input:
  1. Files localted at /nomad/nomadlab/parsed/productionH5/S??/S*.h5
  2. List of normalizers, right now only works: MetaIndexRebuildNormalizer
2. Output: @asastre
  1. For each input file, a normalized file should be created on /nomad/nomadlab/normalized/newProductionH5/N??/N*.h5
  2. As far as I know there is not any table related with this
  3. No control about this yet
3. Process: @asastre
  - I use this script to run the normalizer on one archive (runNormalizer.sh)

hdf5lib=lib
jar=jars/tools.jar

java -Djava.library.path=$hdf5lib -jar $jar normalize --archive-uri nmd://$1  --normalizers MetaIndexRebuildNormalizer

   * I create a parallel index with all the archiveGID of several files to run in several machines:

#Single machine
ls /nomad/nomadlab/normalized/newProductionH5/S??/S*.h5 | awk -F '/' '{print $NF}' | sed 's/\.h5//g' > index.txt
#Parallel running
awk -v nproc=4 '{para=NR%nproc; print > "index_"para".txt"}' index.txt

 1. To-Do:
    * Control fails and classify in data problems and normalizers problems. 
    * Add more normalizers.
    * Define normalizers. 
    * When a normalizers should be run?

Generation of Statistical information.
Generation of Data for Elasticsearch

Comments

Please register or sign in to add a comment.