NOMAD Archive for parser developers

Introduction

The core architecture of the NOMAD Laboratory archive is in the Base Layer description wiki page.

Querying

Some info on how to query the archive is described in querying wiki page

Where to run

You can access a VM where the data and the whole infrastructure is set up by using labdev-nomad.esc.rzg.mpg.de which can be accessed through gate.rzg.mpg.de. Please use ~/myscratch to check out all the code on a local file sysem, as file operations on the home that uses GPFS are much too slow. You might want to copy the ~fawzi/.ivy2 directory in your home to avoid having to download all dependencies (they are cached in it).

Raw data and nomad uris

Everything starts from the raw data archives created from Nomad Repository. Each of these archives gets a unique identifier consisting of "R" and 28 characters (the archive gid).

You can scan any of those archives for calculations using the nomad uri nmd://<archiveGid> A single calculation is identified by its main file which has a nomad uri nmd://<archiveGid>/<path/to/main/file>.

Adding new raw data

New data should come from the NOMAD Repository, you can add new data there, but having it to appear in the repository requires a repository parser for it. Furthermore transferring the open access data to raw data is not fully automatic.

Luckily these two steps are independent, so you can ask to transfer also data that is not recognized by the repository yet, just ask fawzi.mohamed at fhi-berlin.mpg.de .

If you want to know the details the raw data generation is in the nomad-lab/raw-data-injection repository.

Kubernetes

The main pipeline uses kubernetes as explained in the kubernetes wiki page. With it you can set up a whole pipeline and scale it to use many machines to quickly parse in parallel big amounts of data.

Interactive use

Often you will not want to do it, but simply rely on the default test to parse most things, and then do your checks with

sbt
tool/run parse --main-file-uri nmd://<archiveGid> --test-pipeline

this will parse the whole archive interactively, and if you pass nmd://<archiveGid>/<path/to/main/file> instead it will parse only that file. You can even use nmd://<calculationGid> or nmd://<parsedFileGid> if the mapping is already stored in the parsing statistics database.

The interactive run should give you enough information to debug your parser, it even prints the python command that you can re-execute to perform the python parsing. This is normally easier than in the pipeline.

Results

This run will use the default configuration that puts the parsed results in /parsed/${USER}-default/<parserId>/<archiveGid>/<parsedFileGid>.json. These are the results that should be used as starting point for the analysis.

Global pipeline runs

Global runs done with the pipeline can be used to identify the files that should then be analyzed interactively. These will be done at regular intervals, but if you cannot wait you can "roll your own".

Statistics

To help with this all pipeline runs store detailed statistics in the statistics database (postgres database: parsing_stats running either on parsingstatisticsdb.default.cluster.local on port: 5432 (kubernetes) or localhost port 5435 with user: parsing_stats password: pippo)

Queries on this DB can be used to find out files that fail, regressions between parser versions,...

Filesystem queries for failures

An alternative way to find some informations (but not everything) is to look in the filesystem only if a parser is successful a .json file as described before is created (atomically), if there is a failure is encountred a *.json.failed file is created, and an internal exception while parsing generates a *.json.exception file.

Comments

Please register or sign in to add a comment.

nomad archive for parser devs