Commit 78247f73 authored by Markus Scheidgen's avatar Markus Scheidgen
Browse files

Updated documentation.

parent dba0398d
This diff is collapsed.
docs/components.png

78.1 KB | W: | H:

docs/components.png

83.9 KB | W: | H:

docs/components.png
docs/components.png
docs/components.png
docs/components.png
  • 2-up
  • Swipe
  • Onion skin
nomad xt
========
nomad@FAIR
==========
This project is a prototype for the continuation of the original NOMAD-coe software
and infrastructure with a simplyfied and improved architecture.
and infrastructure with a simplyfied architecture and consolidated code base.
.. toctree::
:maxdepth: 2
......
Introduction
============
Nomad xt (and NOMAD coe) is about software for storing and processing (computational)
material science data based on *big-data* and *disributed computing*.
This documentation is for the nomad software. This software can be used to
store, processing, and manage (computational)
material science data. It comprises a set of services and libraries that is
refered to as the *nomad infrastructure*.
The original nomad software was was developed as part of the
NOMAD-coe project (refered to as NOMAD-coe). This software nomad@FAIR
or just nomad is part of the FAIRDE.eu innitiative.
There are different use-modes for the nomad software, but the most common use is
to run the nomad infrastructure on a cloud and provide clients access to
web-based GUIs and REST APIs. This nomad infrastructure logically comprises the
*nomad repository* for uploading, searching, and downloading raw calculation output
from most relevant computionational material science codes. A second part of nomad
is the archive. It allows to access all uploaded data in a common data format
via a data schema called *meta-info* that includes common and code specific
specifications of structured data. Further services are available from
(nomad-coe.eu)[http://nomad-coe.eu], e.g. the *nomad encyclopedia*, *analytics toolkit*,
and *advanced graphics*.
Architecture
------------
The following depicts the *nomad-xt* architecture with respect to software compenent
The following depicts the *nomad@FAIR* architecture with respect to software compenents
in terms of python modules, gui components, and 3rd party services (e.g. databases,
search enginines, etc.)
search enginines, etc.). It comprises a revised version of the repository and archive.
.. figure:: components.png
:alt: nomad xt components
:alt: nomad components
The main modules of nomad xt
The main modules of nomad
Nomad xt uses a series of 3rd party technologies that already solve most of nomads
Nomad uses a series of 3rd party technologies that already solve most of nomads
processing, storage, availability, and scaling goals:
minio.io
^^^^^^^^
http://minio.io is a s3 compatible object storage API that can be used in scaled in
various cloud and HPC contexts, running on a simple fily system, NAS, or actual object
storage. We use it to store uploaded files, raw repo files for download, and archive files.
Minio enables clients to downlaod and upload files via presigned URLs. This us to provide
selective secure scalable access to data files.
storage. We use it to store uploaded files, raw repository files for download, and
archive files. Minio enables clients to downlaod and upload files via presigned URLs.
This us to provide selective secure scalable access to data files.
celery
^^^^^^
......@@ -40,9 +56,8 @@ Elastic search allows for flexible scalable search and analytics.
mongodb
^^^^^^^
Mongo is used to store all other user relavant data, like users, data sets, DOIs, etc.
We also use mongodb to persist processing state and exchange data between processing
tasks.
Mongo is used to store and track the state of the processing of uploaded files and therein c
ontained calculations.
elastic stack
^^^^^^^^^^^^^
......@@ -54,9 +69,9 @@ Data model
----------
.. figure:: data.png
:alt: nomad xt data model
:alt: nomad's data model
The main data classes in nomad xt
The main data classes in nomad
See :py:mod:`nomad.processing`, :py:mod:`nomad.users`, and :py:mod:`nomad.repo`
for further information.
......@@ -65,9 +80,9 @@ Processing
----------
.. figure:: proc.png
:alt: nomad xt processing workflow
:alt: nomad's processing workflow
The workflow of nomad xt processing tasks
The workflow of nomad's processing tasks
See :py:mod:`nomad.processing` for further information.
......@@ -77,4 +92,37 @@ Design principles
- simple first, complicated only when necessary
- adopting generic established 3rd party solutions before implementing specific solutions
- only uni directional dependencies between components/modules, no circles
- only one language: Python (except, GUI of course)
\ No newline at end of file
- only one language: Python (except, GUI of course)
General concepts
----------------
terms
^^^^^
There are is some terminology consistently used in this documentastion and the source
code:
- upload: A logical unit that comprises one (.zip) file uploaded by a user.
- calculation: A computation in the sense that is was created by an individual run of a CMS code.
- raw file: User uploaded files (e.g. part of the uploaded .zip), usually code input or output.
- upload file/uploaded file: The actual (.zip) file a user uploaded
- mainfile: The mainfile output file of a CMS code run.
- aux file: Additional files the user uploaded within an upload.
- repo entry: Some quantities of a calculation that are used to represent that calculation in the repository.
- archive data: The normalized data of one calculation in nomad's meta-info-based format.
ids and hashes
^^^^^^^^^^^^^^
Throughout nomad, we use different ids and hashes to refer to entities. If something
is called *id*, it is usually a random uuid and has no semantic conection to the entity
it identifies. If something is calles a *hash* than it is a hash build based on the
entitiy it identifies. This means either the whole thing or just some properties of
said entities.
The most common hashes are the *upload_hash* and *calc_hash*. The upload hash is
a hash over an uploaded file, as each upload usually refers to an indiviudal user upload
(usually a .zip file). The calc_hash is a hash over the mainfile path within an upload.
The combination of upload_hash and calc_hash is used to identify calculations. They
allow us to id calculations independently of any random ids that are created during
processing. To create hashes we use :func:`nomad.utils.hash`.
\ No newline at end of file
docs/proc.png

65.4 KB | W: | H:

docs/proc.png

64.4 KB | W: | H:

docs/proc.png
docs/proc.png
docs/proc.png
docs/proc.png
  • 2-up
  • Swipe
  • Onion skin
# Setup
# Development Setup
### Preparations
If not already done, you should clone nomad xt and create a python virtual environment.
## Introduction
The nomad infrastructure consists of a series of nomad and 3rd party services:
- nomad handler (python): a small daemon that triggers processing after upload
- nomad worker (python): task worker that will do the processing
- nomad api (python): the nomad REST API
- nomad gui: a small server serving the web-based react gui
- proxy: an nginx server that reverse proxyies all services under one port
- minio: a object storage interface to all files
- elastic search: nomad's search and analytics engine
- mongodb: used to store processing state
- rabbitmq: a task queue used to distribute work in a cluster
First, clone this repo:
All 3rd party services should be run via *docker-compose* (see blow). The
nomad python services can also be run via *docker-compose* or manully started with python.
The gui can be run manually with a development server via yarn, or with
*docker-compose*
Below you will find information on how to install all python dependencies and code
manually. How to use *docker*/*docker-compose*. How run services with *docker-compose*
or manually.
Keep in mind the *docker-compose* configures all services in a way that mirror
the configuration of the python code in `nomad/config.py` and the gui config in
`gui/.env.development`.
## Install python code and dependencies
### Cloning and development tools
If not already done, you should clone nomad and create a python virtual environment.
To clone the repository:
```
git clone git@gitlab.mpcdf.mpg.de:mscheidg/nomad-xt.git
cd nomad-xt
```
Second, create and source your own virtual python environment:
You can use *virtualenv* to create a virtual environement. It will allow you
to keep nomad and its dependencies separate from your system's python installation.
Make sure to base the virtual environement on Python 3.
To install *virtualenv*, create an environement and activate the environment use:
```
pip install virtualenv
virtualenv -p `which python3` .pyenv
source .pyenv/bin/activate
```
Install the development dependencies:
We use *pip* to manage dependencies. There are multiple *requirements* files.
One of them, called *requirements-dev* contains all tools necessary to develop and build
nomad.
```
pip install -r requirements-dev.txt
```
### Install intra nomad dependencies.
This includes parsers, normalizers, python-common, meta-info, etc.
Those dependencies are managed and configures via python scripts.
### Install NOMAD-coe dependencies.
Nomad is based on python modules from the NOMAD-coe project.
This includes parsers, normalizers, python-common and the meta-info.
Those dependencies are managed and configured via python in
`nomad/dependencies.py`. This gives us more flexibility in interactiving with
different parser, normalizer versions from within the running nomad infrastructure.
This step is some what optional. Those dependencies are only needed for processing.
If you do not develop on the processing and do not need to run the workers from
your environment, and only use the docker image for processing, you can skip.
Install some pre-requisite requriements
We compiled a the *requirements-dep* file with python modules that are commonly
used in NOMAD-coe. It is optional, but you should install them first, as they sort
out some issues with installing dependencies in the right order later.
```
pip install -r requirements-dep.txt
```
Run the dependency installation
To actually run the dependencies script:
```
python nomad/dependencies.py
```
This will checkout the proper version of the respective NOMAD-coe modules, install
further requirements, and install the modules themselves.
### Install the the actual code
### Install nomad
Finally, you can add nomad to the environement itself.
```
pip install -r requirements.txt
pip install -e .
```
### Build and run the dev infrastructure with docker.
There are different modes to work and develop with nomad-xt. First, you
can run everything in docker containers. Second, you can only run the 3rd party
services, like databases and message queues in docker, and run everything else (partially)
manually.
## Build and run the infrastructure with docker
### Docker and nomad
Nomad depends on a set of databases, searchengine, and other services. Those
must run to make use of nomad. We use *docker* and *docker-compose* to create a
unified environment that is easy to build and to run.
You can use *docker* to run all necessary 3rd-party components and run all nomad
services manually from your python environement. Or you can run everything within
docker containers. The former is often prefered during development, since it allows
you change things, debug, and re-run things quickly. The later one brings you
closer to the environement that will be used to run nomad in production.
First, you need to build an image with all dependencies. We separated this so we
do not need to redownload and bwheel-build rarely changing dependencies all the time (and
multi stage builds and docker caching seems not to be good enough here).
From the root:
### Docker images for nomad
There are currently three different images and respectively three different docker files:
`requirements.Dockerfile`, `backend.Dockerfile`, and `frontend.Dockerfile`.
Nomad comprises currently three services, the *handler* (deals with user uploads),
the *worker* (does the actual processing), and the *api*. Those services can be
run from one image that have the nomad python code and all dependencies installed. This
is covered by the `backend.Dockerfile`.
The `requirements.Dockerfile` builds an image that has all dependencies pre installed.
We keep it separate, because the dependencies change rather seldomly and we do not want to
reinstall them all the time, we need to build new images.
The fontend image is only for building and serving the gui.
Build the requirements image tagged `nomadxt_requirements`:
```
docker build -t nomad-requirements -f requirements.Dockerfile .
docker build -t nomadxt_requirements -f requirements.Dockerfile .
```
Now we can build the docker compose that contains all external services (redis, rabbitmq,
The other images are build via *docker-compose* and don't have to be created manually.
### Build with docker-compose
Now we can build the *docker-compose* that contains all external services (rabbitmq,
mongo, elastic, minio, elk) and nomad services (worker, handler, api, gui).
```
cd ./infrastructure
cd ./infrastructure/nomadxt
docker-compose build
```
Docker-compose tries to cache individual building steps. Sometimes this causes
troubles and not everything necessary is build when you changed something. In
this cases use:
```
docker-compose build --no-cache
```
### Run everything with docker-compose
You can run all containers with:
```
docker-compose up
```
You can alos run services selectively, e.g.
To shut down everything, just `ctrl-c` the running output. If you started everything
in *deamon* mode (`-d`) use:
```
docker-compose down
```
### Run containers selectively
The following services/containers are managed via our docker-compose:
- rabbitmq, minio, minio-config, mongo, elastic, elk
- worker, handler, api
- gui
- proxy
The *proxy* container runs *nginx* based reverse proxies that put all services under
a single port and different paths.
You can also run services selectively, e.g.
```
docker-compose up -d redis, rabbitmq, minio, minio-config, mongo, elastic, elk
docker-compose up worker handler
docker-compose up api gui proxy
```
## Accessing 3'rd party services
Usually these services only used by the nomad containers, but sometimes you also
need to checkseomthing or do some manual steps.
The file `infrastructure/nomadxt/.env` contains variables that control the ports
used to bind internal docker ports to your host machine. These are the ports you
have to use to connect to the respective services.
### ELK (elastic stack)
If you run the ELK stack (and enable logstash in nomad/config.py),
you can reach the Kibana with [localhost:5601](http://localhost:5601).
The index prefix for logs is `logstash-`.
### Minio
If you want to access the minio object storage via the mc client, register the
infrastructure's minio host to the minio client (mc).
```
mc config host add minio http://localhost:9007 AKIAIOSFODNN7EXAMPLE wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```
### Serving minio, api, gui from one nginx
The docker-compose is setup to server all client accessible services from one webserver
via nginx *proxy_pass* directives. This is done in the `proxy` container.
### mongodb and elastic search
You can access mongodb and elastic search via your prefered tools. Just make sure
to use the right ports (see above).
## Run nomad services manually
### Run the nomad worker manually
You can run the worker, handler, api, and gui as part of the docker infrastructure, like
seen above.
seen above. But, of course there are always reasons to run them manually during
development, like running them in a debugger, profiler, etc.
### Run the nomad worker manually
You can also run the worker yourself, e.g. to develop on the processing. To simply
run a worker do (from the root)
To simply run a worker do (from the root)
```
celery -A nomad.processing worker -l info
```
......@@ -140,7 +240,7 @@ python nomad/api.py
### Run the gui
When you run the gui on its own (e.g. with react dev server below), you have to have
the API running manually also. This *inside docker* API is configured for nging paths
the API running manually also. This *inside docker* API is configured for ngingx paths
and proxies, which are run by the gui container. But you can run the *production* gui
in docker and the dev server gui in parallel with an API in docker.
Either with docker, or:
......@@ -150,7 +250,7 @@ yarn
yarn start
```
### Run tests.
## Run the tests
You need to have the infrastructure partially running: minio, elastic, rabbitmq, redis.
The rest should be mocked or provied by the tests. Make sure that you do no run any
worker and handler in parallel, as they will fight for tasks in the queue.
......
......@@ -37,7 +37,7 @@ FSConfig = namedtuple('FSConfig', ['tmp'])
ElasticConfig = namedtuple('ElasticConfig', ['host', 'calc_index'])
""" Used to configure elastic search. """
MongoConfig = namedtuple('MongoConfig', ['host', 'users_db'])
MongoConfig = namedtuple('MongoConfig', ['host', 'port', 'users_db'])
""" Used to configure mongo db. """
LogstashConfig = namedtuple('LogstashConfig', ['enabled', 'host', 'tcp_port', 'level'])
......@@ -81,6 +81,7 @@ elastic = ElasticConfig(
)
mongo = MongoConfig(
host=os.environ.get('NOMAD_MONGO_HOST', 'localhost'),
port=int(os.environ.get('NOMAD_MONGO_PORT', 27017)),
users_db='users'
)
logstash = LogstashConfig(
......
......@@ -22,6 +22,21 @@ from .fhiaims import FhiAimsBaseNormalizer
"""
After parsing calculations have to be normalized with a set of *normalizers*.
In NOMAD-coe those were programmed in python (we'll reuse) and scala (we'll rewrite).
Currently the normalizers are:
- system.py
- symmetry.py
- fhiaims.py
- systemtype.py
The normalizers are available via
.. autodata:: nomad.normalizing.normalizers
There is one ABC for all normalizer:
.. autoclass::nomad.normalizing.normalizer.Normalizer
:members:
"""
# The loose explicit type is necessary to avoid a ABC class as item type that causes
......
......@@ -30,7 +30,7 @@ r_frame_sequence_local_frames = 'frame_sequence_local_frames_ref'
class Normalizer(metaclass=ABCMeta):
"""
A base class for normalizers. Normalizers work on a :class:`AbstractParserBackend` instance
for read and write.
for read and write. Normalizer instances are reused.
Arguments:
backend: the backend used to read and write data from and to
......
......@@ -32,17 +32,28 @@ Each parser is defined via an instance of :class:`Parser`.
.. autoclass:: nomad.parsing.Parser
:members:
The implementation :class:`LegacyParser` is used for most NOMAD-coe parsers.
.. autoclass:: nomad.parsing.LegacyParser
The parser definitions are available via the following two variables.
.. autodata:: nomad.parsing.parsers
.. autodata:: nomad.parsing.parser_dict
Parsers in NOMAD-coe use a *backend* to create output.
Parsers are reused for multiple caclulations.
Parsers in NOMAD-coe use a *backend* to create output. There are different NOMAD-coe
basends. In nomad@FAIR, we only currently only use a single backed. A version of
NOMAD-coe's *LocalBackend*. It stores all parser results in memory. The following
classes provide a interface definition for *backends* as an ABC and a concrete implementation
based on NOMAD-coe's *python-common* module.
.. autoclass:: nomad.parsing.AbstractParserBackend
:members:
.. autoclass:: nomad.parsing.LocalBackend
:members:
"""
from nomad.parsing.backend import AbstractParserBackend, LocalBackend, LegacyLocalBackend, JSONStreamWriter, BadContextURI, WrongContextState
......
......@@ -36,6 +36,7 @@ class Parser(metaclass=ABCMeta):
main_file_re: A regexp that matches main file paths that this parser can handle.
main_contents_re: A regexp that matches main file headers that this parser can parse.
"""
@abstractmethod
def is_mainfile(self, filename: str, open: Callable[[str], IO[Any]]) -> bool:
""" Checks if a file is a mainfile via the parsers ``main_contents_re``. """
......@@ -58,7 +59,16 @@ class Parser(metaclass=ABCMeta):
class LegacyParser(Parser):
"""
A parser implementation for legacy NOMAD-coe parsers. Uses a
:class:`nomad.dependencies.PythonGit` to specify the old parser repository.
:class:`nomad.dependencies.PythonGit` to specify the old parser repository. It
uses regular expessions to match parsers to mainfiles.
Arguments:
python_get: the git repository and commit that contains the legacy parser
parser_class_name: the main parser class that implements NOMAD-coe's
python-common *ParserInterface*. Instances of this class are currently not reused.
main_file_re: A regexp that is used to match the paths of potential mainfiles
main_contents_re: A regexp that is used to match the first 500 bytes of a
potential mainfile.
"""
def __init__(
self, python_git: PythonGit, parser_class_name: str, main_file_re: str,
......
......@@ -3,7 +3,7 @@ import systax.analysis.symmetryanalyzer
# A patch for the segfault protection of systax (internally uses protection for spglib calls.)
# We basically disable the protection. The multiprocessing based original protection.
# somehow interfers with the celery work infrastructure and leads to a deaklock.
# somehow interfers with the celery work infrastructure and leads to a deaklock. Its a TODO.
def segfault_protect_patch(f, *args, **kwargs):
return f(*args, **kwargs)
......
......@@ -17,17 +17,20 @@ Processing comprises everything that is necessary to take an uploaded user file,
processes it, and store all necessary data for *repository*, *archive*, and potential
future services (e.g. *encyclopedia*).
Processing is build on top of *celery* (http://www.celeryproject.org/).
Processing is build on top of *celery* (http://www.celeryproject.org/) and
*mongodb* (http://www.mongodb.org).
Celery provides a task-based programming model for distributed computing. It uses
a broker, e.g. a distributed task queue like *RabbitMQ* (), to distribute task requests,
and a result backend, e.g. a *Redis* database (), to access (intermediate) task results.
This combination allows us to easily distribute processing work while having
the processing state, i.e. (intermediate) results, always available.
a broker, e.g. a distributed task queue like *RabbitMQ* to distribute tasks. We
use mongodb to store the current state of processing in :class:`Upload` and
:class:`Calculation` documents. This combination allows us to easily distribute
processing work while having the processing state, i.e. (intermediate) results,
always available.
This module is structures into our *celery app* and abstract process base class
This module is structured into our *celery app* and abstract process base class
:class:`Proc` (``base.py``), the concrete processing classes
:class:`Upload` and :class:`Calc` (``data.py``), and the *handler* service that
initiates processing based on file storage notifications (``handler.py``, ``handlerdaemon.py``).
initiates processing based on file storage notifications from *minio*
(``handler.py``, ``handlerdaemon.py``).
This module does not contain the functions to do the actual work. Those are encapsulated
in :py:mod:`nomad.files`, :py:mod:`nomad.repo`, :py:mod:`nomad.users`,
......@@ -40,15 +43,14 @@ Refer to http://www.celeryproject.org/ to learn about celery apps and workers. T
nomad celery app uses a *RabbitMQ* broker. We use celery to distribute processing load
in a cluster.
Processing
----------
We use an abstract processing base class (:class:`Proc`) that provides all necessary
function to execute a process as a series of potentially distributed steps. In
We use an abstract processing base class and document (:class:`Proc`) that provides all
necessary functions to execute a process as a series of potentially distributed steps. In
addition the processing state is persisted in mongodb using *mongoengine*. Instead of
exchanging serialized state between celery tasks, we use the mongodb documents to
exchange data. Therefore, the mongodb always contains the latest processing status.
exchange data. Therefore, the mongodb always contains the latest processing state.
We also don't have to deal with celery result backends and synchronizing with them.
.. autoclass:: nomad.processing.base.Proc
......@@ -57,9 +59,9 @@ There are two concrete processes :class:`Upload` and :class: `Calc`. Instances o
classes do represent the processing state, as well as the respective entity.
.. figure:: proc.png
:alt: nomad xt processing workflow
:alt: nomad processing workflow
This is the basic workflow of a nomad xt upload processing.
This is the basic workflow of a nomad upload processing.
.. autoclass:: nomad.processing.data.Upload
:members:
......
......@@ -35,7 +35,7 @@ import nomad.patch # pylint: disable=unused-import
def mongo_connect():
return connect(db=config.mongo.users_db, host=config.mongo.host)
return connect(db=config.mongo.users_db, host=config.mongo.host, port=config.mongo.port)
if config.logstash.enabled:
def initialize_logstash(logger=None, loglevel=logging.DEBUG, **kwargs):
......@@ -51,7 +51,7 @@ app = Celery('nomad.processing', broker=config.celery.broker_url)
# ensure elastic and mongo connections
if 'sphinx' not in sys.modules:
connect(db=config.mongo.users_db, host=config.mongo.host)
connect(db=config.mongo.users_db, host=config.mongo.host, port=config.mongo.port)
PENDING = 'PENDING'
......
from nomad.processing.data import Upload
import time
if __name__ == '__main__':
suspicious = {}
while True:
for upload in Upload.objects(status='RUNNING', current_task='parse_all'):
if upload.total_calcs == upload.processed_calcs:
if upload.upload_id in suspicious:
del(suspicious[upload.upload_id])
upload.status = 'SUCCESS'
upload.save()
print('Fixed suspicious %s' % upload.upload_id)
else:
print('Found suspicious %s' % upload.upload_id)
suspicious[upload.upload_id] = upload.upload_id
time.sleep(1)
......@@ -14,10 +14,10 @@
"""
This module is about maintaining the repository search index and providing all
data to the repository related parts of nomad xt.
data to the repository related parts of nomad.
We use *elasticsearch_dsl* to interface with elastic search. The class :class:`RepoCalc`
is an elasticsearch_dsl document that is used to represent elastic search index entries.
is an elasticsearch_dsl document that is used to represent repository index entries.