Skip to content
Snippets Groups Projects
Commit adaba438 authored by Lauri Himanen's avatar Lauri Himanen
Browse files

Merge branch '1628-update-development-docs' into 'develop'

Update development docs

Closes #1628

See merge request !1427
parents d1a36e80 0fe73485
No related branches found
No related tags found
1 merge request!1427Update development docs
Pipeline #175029 skipped
# How to navigate the code
NOMAD is a complex project with lots of parts. This guide gives you a rough overview
about the code-base and ideas to what to look at first.
about the codebase and ideas about what to look at first.
## Git Projects
There is one [main NOMAD project](https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR)
(and its [fork on GitHub](https://github.com/nomad-coe/nomad)). This project contains
all the framework and infrastructure code. It instigates all checks, builds, and deployments
for the public NOMAD service, the NOMAD Oasis, and the `nomad-lab` Python package. All
contributions to NOMAD have to go through this project eventually.
all the framework and infrastructure code. It instigates all checks, builds, and
deployments for the public NOMAD service, the NOMAD Oasis, and the `nomad-lab` Python
package. All contributions to NOMAD have to go through this project eventually.
All (Git) projects that NOMAD depends on are either a Git submodule (you find
them all in the `/dependencies` directory or its subdirectories) or they are
listed as PyPI packages in the `/pyproject.toml` of the main project (or one of its
them all in the `dependencies` directory or its subdirectories) or they are
listed as PyPI packages in the `pyproject.toml` of the main project (or one of its
submodules).
You can also have a look at the [list of parsers](../reference/parsers.md) and [built-in plugins](../reference/plugins.md)
that constitute the majority of these projects. The only other projects are
[matid](https://github.com/SINGROUP/matid), [density of state fingerprints](https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-dos-fingerprints), and the [NOMAD Remote Tools Hub tools](https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-remote-tools-hub).
You can also have a look at the [list of parsers](../reference/parsers.md) and
[built-in plugins](../reference/plugins.md) that constitute the majority of these
projects. The only other projects are [MatID](https://github.com/nomad-coe/matid),
[DOS fingerprints](https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-dos-fingerprints),
and the
[NOMAD Remote Tools Hub](https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-remote-tools-hub).
!!! note
The GitLab organization [nomad-lab](https://gitlab.mpcdf.mpg.de/nomad-lab) and the GitHub organizations for [FAIRmat](https://github.com/fairmat-nfdi) and
the [NOMAD CoE](https://github.com/nomad-coe) all represent larger infrastructure and research projects, and
they include many other Git projects that are not related. When navigating the
code-base only follow the submodules.
The GitLab organization [nomad-lab](https://gitlab.mpcdf.mpg.de/nomad-lab) and the
GitHub organizations for [FAIRmat](https://github.com/fairmat-nfdi) and the
[NOMAD CoE](https://github.com/nomad-coe) all represent larger infrastructure and
research projects, and they include many other Git projects that are not related.
When navigating the codebase, only follow the submodules.
## Python code
There are three main directories with Python code:
- `/nomad`: The actual NOMAD code. It is structured into more subdirectories and modules.
- `/tests`: Tests ([pytest](https://docs.pytest.org)) for the NOMAD code. It follows the same module structure, but Python files
are prefixed with `test_`.
- `/examples`: This contains a few small Python scripts that might be linked in the documentation.
The `/nomad` directory contains the following "main" modules. This list is not extensive, but
should help you to navigate the code base:
- `config`: NOMAD is configured through the `nomad.yaml` file. This contains all the ([pydantic](https://docs.pydantic.dev/)) models and default config parameters.
- `utils`: Somewhat self-explanatory; notable features: the structured logging system ([structlog](https://www.structlog.org/)) and id generation and hashes.
- `units`: The unit and unit conversion system based on [Pint](https://pint.readthedocs.io).
- `app`: The [fast-api](https://fastapi.tiangolo.com/) APIs: v1 and v1.2 NOMAD APIs, [OPTIMADE](https://www.optimade.org/), [DCAT](https://www.w3.org/TR/vocab-dcat-2/), [h5grove](https://github.com/silx-kit/h5grove), and more.
- `cli`: The command line interface (based on [click](https://click.palletsprojects.com)). Subcommands are structured into submodules.
- `parsing`: The base classes for parsers, matching functionality, parser initialization, some fundamental parsers like the *archive* parser. See also the docs on [processing](../learn/basics.md#parsing).
- `normalizing`: All the normalizers. See also the docs on [processing](../learn/basics.md#normalizing).
- `metainfo`: The metainfo system, e.g. the schema language that NOMAD uses.
- `datamodel`: The built-in schemas (e.g. `nomad.datamodel.metainfo.simulation` used by all the theory parsers). The base sections and section for the shared entry structure. See also the docs on the [datamodel](../learn/data.md) and [processing](../learn/basics.md).
- `search`: The interface to [elasticsearch](https://www.elastic.co/guide/en/enterprise-search/current/start.html).
- `processing`: Its all about processing uploads and entries. The interface to [celery](https://docs.celeryq.dev/en/stable/) and [mongodb](https://www.mongodb.com).
- `files`: Functionality to main the files for uploads in staging and published. The interface to the file system.
- `archive`: Functionality to store and access archive files. This is the storage format for all processed data in nomad. See also the docs on [structured data](../learn/data.md).
- `nomad`: The actual NOMAD code. It is structured into more subdirectories and modules.
- `tests`: Tests ([pytest](https://docs.pytest.org)) for the NOMAD code.
It follows the same module structure, but Python files are prefixed with `test_`.
- `examples`: A few small Python scripts that might be linked in the documentation.
The `nomad` directory contains the following "main" modules. This list is not extensive
but should help you to navigate the codebase:
- `app`: The [FastAPI](https://fastapi.tiangolo.com/) APIs: v1 and v1.2 NOMAD APIs,
[OPTIMADE](https://www.optimade.org/), [DCAT](https://www.w3.org/TR/vocab-dcat-2/),
[h5grove](https://github.com/silx-kit/h5grove), and more.
- `archive`: Functionality to store and access archive files. This is the storage format
for all processed data in NOMAD. See also the docs on
[structured data](../learn/data.md).
- `cli`: The command line interface (based on [Click](https://click.palletsprojects.com)).
Subcommands are structured into submodules.
- `config`: NOMAD is configured through the `nomad.yaml` file. This contains all the
([Pydantic](https://docs.pydantic.dev/)) models and default config parameters.
- `datamodel`: The built-in schemas (e.g. `nomad.datamodel.metainfo.simulation` used by
all the theory parsers). The base sections and section for the shared entry structure.
See also the docs on the [datamodel](../learn/data.md) and
[processing](../learn/basics.md).
- `metainfo`: The Metainfo system, e.g. the schema language that NOMAD uses.
- `normalizing`: All the normalizers. See also the docs on
[processing](../learn/basics.md#normalizing).
- `parsing`: The base classes for parsers, matching functionality, parser initialization,
some fundamental parsers like the *archive* parser. See also the docs on
[processing](../learn/basics.md#parsing).
- `processing`: It's all about processing uploads and entries. The interface to
[Celery](https://docs.celeryq.dev/en/stable/) and [MongoDB](https://www.mongodb.com).
- `units`: The unit and unit conversion system based on
[Pint](https://pint.readthedocs.io).
- `utils`: Utility modules, e.g. the structured logging system
([structlog](https://www.structlog.org/)), id generation, and hashes.
- `files.py`: Functionality to maintain the files for uploads in staging and published.
The interface to the file system.
- `search.py`: The interface to
[Elasticsearch](https://www.elastic.co/guide/en/enterprise-search/current/start.html).
## GUI code
The NOMAD UI is written as a [react](https://react.dev/) single page application (SPA). It uses (among many other libraries)
[MUI](https://mui.com/), [plotly](https://plotly.com/python/), and [D3](https://d3js.org/). The GUI code is maintained in the `/gui` directory. Most relevant code
can be found in `/gui/src/components`. The application entrypoint is `/gui/src/index.js`.
The NOMAD UI is written as a [React](https://react.dev/) single-page application (SPA). It
uses (among many other libraries) [MUI](https://mui.com/),
[Plotly](https://plotly.com/python/), and [D3](https://d3js.org/). The GUI code is
maintained in the `gui` directory. Most relevant code can be found in
`gui/src/components`. The application entry point is `gui/src/index.js`.
## Documentation
The documentation is based on [MkDocs](https://www.mkdocs.org/). The important files
and directories are:
- [/docs](https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/tree/develop/docs): Contains all the markdown files that contribute to the documentation system.
- [/mkdocs.yml](https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/tree/develop/mkdocs.yml): The index and configuration of the documentation, new files have to be added here as well.
- [/nomad/mkdocs.py](https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/tree/develop/nomad/mkdocs.py): Python code that defines [macros](https://mkdocs-macros-plugin.readthedocs.io/) that can be
used in markdown.
- `docs`: Contains all the Markdown files that contribute to the documentation system.
- `mkdocs.yml`: The index and configuration of the documentation. New files have to be
added here as well.
- `nomad/mkdocs.py`: Python code that defines
[macros](https://mkdocs-macros-plugin.readthedocs.io/) which can be used in Markdown.
## Other top-level directories
- `dependencies`: Contains all the submodules (e.g. the parsers).
- `ops`: Contains artifacts to run NOMAD components, e.g. docker-compose.yaml files and our kubernetes helm chart.
- `scripts`: Contains scripts used during build or for certain development tasks.
\ No newline at end of file
- `dependencies`: Contains all the submodules, e.g. the parsers.
- `ops`: Contains artifacts to run NOMAD components, e.g. `docker-compose.yaml` files,
and our Kubernetes Helm chart.
- `scripts`: Contains scripts used during the build or for certain development tasks.
This diff is collapsed.
This diff is collapsed.
......@@ -4,15 +4,15 @@
A normalizer can be any Python algorithm that takes the archive of an entry as input
and manipulates (usually expands) the given archive. This way, a normalizer can add
additional sections and quantities based on the information already available in the archive.
additional sections and quantities based on the information already available in the
archive.
All normalizer are executed after parsing. Normalizers are run for each entry (i.e. each
All normalizers are executed after parsing. Normalizers are run for each entry (i.e. each
set of files that represent a code run). Normalizers are run in a particular order, and
you can make assumptions about the availability of data created by other normalizers.
A normalizer is run in any case, but it might choose not to do anything. A normalizer
can perform any operation on the archive, but in general should not alter existing information,
but only add more information.
can perform any operation on the archive, but in general it should only add more
information, not alter existing information.
## Starting example
......@@ -33,15 +33,13 @@ class UnitCellVolumeNormalizer(Normalizer):
You simply inherit from `Normalizer` and implement the `normalize` method. The
`archive` is available as a field. There is also a logger on the object that can be used.
Be aware that the processing will already report the run of the normalizer, log its execution
time and any exceptions that might been thrown.
Be aware that the processing will already report the run of the normalizer, log its
execution time and any exceptions that might been thrown.
Of course, if you add new information to the archive, this needs also be defined in the
metainfo. For example you could extend the section system with a special system definition
Of course, if you add new information to the archive, this also needs to be defined in the
Metainfo. For example you could extend the section system with a special system definition
that extends the existing section system definition:
```python
import numpy as np
from nomad.datamodel.metainfo.public import section_system as System
......@@ -54,13 +52,11 @@ class UnitCellVolumeSystem(System):
Or you simply alter the `section_system` class (`nomad/datamodel/metainfo/public.py`).
## System normalizer
There is a special base-class for normalizing systems that allows to run the normalization
There is a special base class for normalizing systems that allows to run the normalization
on all (or only the resulting) `representative` systems:
```python
from nomad.normalizing import SystemBasedNormalizer
from nomad.atomutils import get_volume
......@@ -73,7 +69,6 @@ class UnitCellVolumeNormalizer(SystemBasedNormalizer):
The parameter `is_representative` will be true for the `representative` systems, i.e.
the final step in a geometry optimization or other workflow.
## Adding a normalizer to the processing
For any new normalizer class to be recognized by the processing, the normalizer class
......@@ -93,16 +88,14 @@ normalizers: Iterable[Type[Normalizer]] = [
]
```
## Testing a normalizer
To simply tryout a normalizer, you could use the CLI and run the parse command:
To simply try out a normalizer, you could use the CLI and run the parse command:
```sh
```shell
nomad --debug parse --show-archive <path-to-example-file>
```
But eventually you need to add a more formal test. Place your `pytest`-tests in
But eventually you need to add a more formal test. Place your `pytest` tests in
`tests/normalizing/test_unitcellvolume.py` similar to the existing tests. Necessary
test data can be added to `tests/data/normalizers`.
NOMAD uses parsers to convert raw code input and output files into NOMAD's common Archive format. This is the documentation on how to develop such a parser.
# How to write a parser
NOMAD uses parsers to convert raw code input and output files into NOMAD's common archive
format. This is the documentation on how to develop such a parser.
## Getting started
Let's assume we need to write a new parser from scratch.
First we need to install *nomad-lab* Python package to get the necessary libraries:
```sh
First we need to install the `nomad-lab` Python package to get the necessary libraries:
```shell
pip install nomad-lab
```
We prepared an example parser project that you can work with.
```sh
We have prepared an example parser project that you can work with:
```shell
git clone https://github.com/nomad-coe/nomad-parser-example.git --branch hello-world
```
Alternatively, you can fork the example project on GitHub to create your own parser. Clone
your fork accordingly.
The project structure should be
```none
The project structure should be:
```text
example/exampleparser/__init__.py
example/exampleparser/__main__.py
example/exampleparser/metainfo.py
......@@ -28,14 +34,16 @@ example/README.md
example/setup.py
```
Next you should install your new parser with pip. The `-e` parameter installs the parser
in *development*. This means you can change the sources without having to reinstall.
```sh
Next, you should install your new parser with pip. The `-e` parameter installs the parser
in *development*. This means you can change the sources without having to reinstall:
```shell
cd example
pip install -e .
```
The main code file `exampleparser/parser.py` should look like this:
```python
class ExampleParser(MatchingParser):
def __init__(self):
......@@ -52,18 +60,20 @@ class ExampleParser(MatchingParser):
A parser is a simple program with a single class. The base class `MatchingParser`
provides the necessary interface to NOMAD. We provide some basic information
about our parser in the constructor. The *main* function `run` simply takes a filepath
and an empty archive as input. Now its up to you, to open the given file and populate the
and an empty archive as input. Now it's up to you, to open the given file and populate the
given archive accordingly. In the plain *hello world*, we simple create a log entry,
populate the archive with a *root section* `Run`, and set the program name to `EXAMPLE`.
You can run the parser with the included `__main__.py`. It takes a file as argument and
you can run it like this:
```sh
```shell
python -m exampleparser tests/data/example.out
```
The output should show the log entry and the minimal archive with one `section_run` and
the respective `program_name`.
the respective `program_name`:
```json
{
"section_run": [
......@@ -77,15 +87,18 @@ the respective `program_name`.
## Parsing test files
Let's do some actual parsing. Here we demonstrate how to parse ASCII files with some
structure information in it. As it is typically used by materials science codes.
structure information in it, as it is typically used by materials science codes.
On the `master` branch of the example project, we have a more 'realistic' example:
```sh
```shell
git checkout master
```
This example imagines a potential code output that looks like this (`tests/data/example.out`):
```none
This example imagines a potential code output that looks like this
(`tests/data/example.out`):
```text
2020/05/15
*** super_code v2 ***
......@@ -106,14 +119,15 @@ cell: (0, 0, 0), (1, 0, 0), (1, 1, 0)
energy: 1.29372
```
At the top there are some general informations. Below that is a list of simulated systems with
sites and lattice describing crystal structures as well as a computed energy value as an example for
a code specific quantity from a 'magic source'.
At the top there is some general information. Below that is a list of simulated systems
with "sites" and "lattice" describing the crystal structure, as well as a computed energy
value as an example for a code specific quantity from a 'magic source'.
In order to convert the information from this file into the archive, we first have to
parse the necessary quantities: the date, system, energy, etc. The *nomad-lab* Python
In order to convert the information from this file into the archive, we first have to
parse the necessary quantities: the date, system, energy, etc. The `nomad-lab` Python
package provides a `text_parser` module for declarative parsing of text files. You can
define text file parsers like this:
```python
def str_to_sites(string):
sym, pos = string.split('(')
......@@ -141,27 +155,28 @@ mainfile_parser = UnstructuredTextFileParser(quantities=[
```
The quantities to be parsed can be specified as a list of `Quantity` objects with a name
and a *regular expression (re)* pattern. The matched value should be enclosed in a group(s)
denoted by `(...)`.
By default, the parser uses the *findall* method of `re`, hence overlap
between matches is not tolerated. If overlap cannot be avoided, you should switch to the
*finditer* method by passing *findall=False* to the parser. Multiple
matches for the quantity are returned if *repeats=True* (default). The name, data type,
shape and unit for the quantity can also be intialized by passing a metainfo Quantity.
An external function *str_operation* can also be passed to perform more specific
string operations on the matched value. A local parsing on a matched block can be carried
out by nesting a *sub_parser*. This is also an instance of the `UnstructuredTextFileParser`
with a list of quantities to parse. To access a parsed quantity, you can use the *get*
method.
and a *regular expression (re)* pattern. The matched value should be enclosed in a
group(s) denoted by `(...)`.
By default, the parser uses the `findall` method of `re`, hence overlap between matches is
not tolerated. If overlap cannot be avoided, you should switch to the `finditer` method by
passing `findall=False` to the parser. Multiple matches for the quantity are returned if
`repeats=True` (default). The name, data type, shape and unit for the quantity can also be
initialized by passing a `metainfo.Quantity`.
An external function `str_operation` can also be passed to perform more specific string
operations on the matched value. A local parsing on a matched block can be carried out by
nesting a `sub_parser`. This is also an instance of the `UnstructuredTextFileParser` with
a list of quantities to parse. To access a parsed quantity, you can use the `get` method.
We can apply these parser definitions like this:
```sh
```shell
mainfile_parser.mainfile = mainfile
mainfile_parser.parse()
```
This will populate the `mainfile_parser` object with parsed data and it can be accessed
like a Python dict with quantity names as keys:
```python
run = archive.m_create(Run)
run.program_name = 'super_code'
......@@ -189,24 +204,24 @@ for calculation in mainfile_parser.get('calculation'):
```
You can still run the parser on the given example file:
```sh
```shell
python -m exampleparser tests/data/example.out
```
Now you should get a more comprehensive archive with all the provided information from
the `example.out` file.
** TODO more examples and an explanations for: unit conversion, logging, types, scalar, vectors,
multi-line matrices **
## Extending the Metainfo
The NOMAD Metainfo defines the schema of each archive. There are pre-defined schemas for
all domains (e.g. `common_dft.py` for electron-structure codes; `common_ems.py` for
experiment data, etc.). The sections `Run`, `System`, and the single configuration calculations (`SCC`)
in the example are taken fom `common_dft.py`. While this covers most of the data usually
provided in code input/output files, some data is typically format-specific and applies only
to a certain code or method. For these cases, we allow to extend the Metainfo like
this (`exampleparser/metainfo.py`):
The NOMAD Metainfo defines the schema of each archive. There are predefined schemas for
all domains (e.g. `common_dft.py` for electronic structure codes; `common_ems.py` for
experimental data, etc.). The sections `Run`, `System`, and the single configuration
calculations (`SCC`) in the example are taken from `common_dft.py`. While this covers most
of the data usually provided in code input/output files, some data is typically
format-specific and applies only to a certain code or method. For these cases, we allow to
extend the Metainfo like this (`exampleparser/metainfo.py`):
```python
# We extend the existing common definition of a section "single configuration calculation"
class ExampleSCC(SCC):
......@@ -225,13 +240,15 @@ Until now, we simply run our parser on some example data and manually observed t
To improve the parser quality and ease the further development, you should get into the
habit of testing the parser.
We use the Python unit test framework *pytest*:
```sh
We use the Python unit test framework `pytest`:
```shell
pip install pytest
```
A typical test, would take one example file, parse it, and make statements about
the output.
A typical test would take one example file, parse it, and check assertions about the
output:
```python
def test_example():
parser = rExampleParser()
......@@ -245,26 +262,25 @@ def test_example():
```
You can run all tests in the `tests` directory like this:
```sh
```shell
pytest -svx tests
```
You should define individual test cases with example files that demonstrate certain
features of the underlying code/format.
## Structured data files with numpy
**TODO: examples**
## Structured data files with NumPsy
The `DataTextParser` uses the numpy.loadtxt function to load an structured data file.
The loaded data can be accessed from property *data*.
The `DataTextParser` uses the `numpy.loadtxt` function to load an structured data file.
The loaded data can be accessed from property `data`.
## XML Parser
**TODO: examples**
The `XMLParser` uses the ElementTree module to parse an xml file. The parse method of the
parser takes in an xpath style key to access individual quantities. By default, automatic
data type conversion is performed, which can be switched off by setting *convert=False*.
The `XMLParser` uses the ElementTree module to parse an XML file. The `parse` method of
the parser takes in an XPath-style key to access individual quantities. By default,
automatic data type conversion is performed, which can be switched off by setting
`convert=False`.
## Add the parser to NOMAD
......@@ -274,6 +290,7 @@ parser attributes.
Consider the example where we use the `MatchingParser` constructor to add additional
attributes that determine for which files the parser is intended for:
```python
class ExampleParser(MatchingParser):
def __init__(self):
......@@ -286,54 +303,67 @@ class ExampleParser(MatchingParser):
mainfile_contents_dict={'program': {'version': '1', 'name': 'EXAMPLE'}})
```
- `mainfile_mime_re`: A regular expression on the mime type of files. The parser is run only
on files with matching mime type. The mime-type is *guessed* with libmagic.
- `mainfile_mime_re`: A regular expression on the MIME type of files. The parser is run
only on files with matching MIME type. The MIME type is *guessed* with libmagic.
- `mainfile_contents_re`: A regular expression that is applied to the first 4k of a file.
The parser is run only on files where this matches.
- `mainfile_name_re`: A regular expression that can be used to match against the name and path of the file.
- `supported compressions`: A list of [gz, bz2], if the parser supports compressed files
- `mainfile_alternative`: If True files are mainfile if no mainfile_name_re matching file
is present in the same directory.
- `mainfile_contents_dict`: A dictionary to match the contents of the file. If provided
will load the file and match the value of the key(s) provided.
The parser is run only on files where this matches.
- `mainfile_name_re`: A regular expression that can be used to match against the name and
path of the file.
- `supported compressions`: A list of [`gz`, `bz2`] if the parser supports compressed
files.
- `mainfile_alternative`: If `True`, a file is `mainfile` unless another file in the same
directory matches `mainfile_name_re`.
- `mainfile_contents_dict`: A dictionary to match the contents of the file. If provided,
it will load the file and match the value of the key(s) provided.
Not all of these attributes have to be used. Those that are given must all match in order
to use the parser on a file.
The nomad infrastructure keeps a list of parser objects (in `nomad/parsing/parsers.py::parsers`).
These parsers are considered in the order they appear in the list. The first matching parser
is used to parse a given file.
The NOMAD infrastructure keeps a list of parser objects (in
`nomad/parsing/parsers.py::parsers`). These parsers are considered in the order they
appear in the list. The first matching parser is used to parse a given file.
While each parser project should provide its own tests, a single example file should be
added to the infrastructure parser tests (`tests/parsing/test_parsing.py`).
Once the parser is added, it becomes also available through the command line interface and
normalizers are applied as well:
```sh
nomad parser tests/data/example.out
```shell
nomad parse tests/data/example.out
```
## Developing an existing parser
To refine an existing parser, you should install the parser via the `nomad-lab` package:
```sh
```shell
pip install nomad-lab
```
Close the parser project on top:
```sh
Clone the parser project:
```shell
git clone <parser-project-url>
cd <parser-dir>
```
Either remove the installed parser and pip install the cloned version:
```sh
rm -rf <path-to-your-python-env>/lib/python3.7/site-packages/<parser-module-name>
Either remove the installed parser and `pip install` the cloned version:
```shell
rm -rf <path-to-your-python-env>/lib/python3.9/site-packages/<parser-module-name>
pip install -e .
```
Or use `PYTHONPATH` so that the cloned code takes precedence over the installed code:
```sh
PYTHONPATH=. nomad parser <path-to-example-file>
Or set `PYTHONPATH` so that the cloned code takes precedence over the installed code:
```shell
PYTHONPATH=. nomad parse <path-to-example-file>
```
Alternatively, you can also do a full developer setup of the NOMAD infrastructure and
......
# Extending the search
# How to extend the search
## The search indices
NOMAD uses elasticsearch as the underlying search engine. The respective indices
are automatically populate during processing and other NOMAD operations. The indices
NOMAD uses Elasticsearch as the underlying search engine. The respective indices
are automatically populated during processing and other NOMAD operations. The indices
are built from some of the archive information of each entry. These are mostly the
sections `metadata` (ids, user metadata, other "administrative" and "internal" metadata)
and `results` (a summary of all extracted (meta-)data). However, these sections are not
indexed verbatim. What exactly and how it is indexed is determined by the metainfo
and the `elasticsearch` metainfo extension.
and `results` (a summary of all extracted (meta)data). However, these sections are not
indexed verbatim. What exactly and how it is indexed is determined by the Metainfo
and the `elasticsearch` Metainfo extension.
### The elasticsearch metainfo extension
### The `elasticsearch` Metainfo extension
Here is the definition of `results.material.elements` as an example:
```py
```python
class Material(MSection):
...
elements = Quantity(
......@@ -30,68 +30,73 @@ class Material(MSection):
```
Extensions are denoted with the `a_` prefix as in `a_elasticsearch`.
Since extensions can have all kinds of values, the elasticsearch extension is rather
Since extensions can have all kinds of values, the `elasticsearch` extension is rather
complex and uses the `Elasticsearch` class.
There can be multiple values. Each `Elasticsearch` instance configures a different part
of the index. This means that the same quantity can be indexed multiple time. For example,
if you need a text- and a keyword-based search for the same data. Here
is a version of the `metadata.mainfile` definition as another example:
```py
mainfile = metainfo.Quantity(
type=str, categories=[MongoEntryMetadata, MongoSystemMetadata],
description='The path to the mainfile from the root directory of the uploaded files',
a_elasticsearch=[
Elasticsearch(_es_field='keyword'),
Elasticsearch(
mapping=dict(type='text', analyzer=path_analyzer.to_dict()),
field='path', _es_field='')
]
)
is a version of the `metadata.mainfile` definition as another example:
```python
mainfile = metainfo.Quantity(
type=str, categories=[MongoEntryMetadata, MongoSystemMetadata],
description='The path to the mainfile from the root directory of the uploaded files',
a_elasticsearch=[
Elasticsearch(_es_field='keyword'),
Elasticsearch(
mapping=dict(type='text', analyzer=path_analyzer.to_dict()),
field='path', _es_field='')
]
)
```
### The different indices
The first (optional) argument for `Elasticsearch` determines where the data is indexed.
There are three principle places:
- the entry index (default, `entry_type`)
- the entry index (`entry_type`, default)
- the materials index (`material_type`)
- the entries within the materials index (`material_entry_type`)
#### Entry index
This is the default and is used even if another (additional) value is given. All data
is put into the entry index.
#### Materials index
This is a separate index from the entry index and contains aggregated material information.
Each document in this index represents a material. We use a hash over some material
properties (elements, system type, symmetry) to define what a material is and which entries
belong to which material.
This is a separate index from the entry index and contains aggregated material
information. Each document in this index represents a material. We use a hash over some
material properties (elements, system type, symmetry) to define what a material is and
which entries belong to which material.
Some parts of the material documents contain the material information that is always
the same for all entries of this material. Examples are elements, formulas, symmetry.
#### Material entries
The materials index also contains entry specific information that allows to filter
materials for the existence of entries with certain criteria. Examples are
publish status, user metadata, used method, or property data.
The materials index also contains entry-specific information that allows to filter
materials for the existence of entries with certain criteria. Examples are publish status,
user metadata, used method, or property data.
### Adding quantities
In principle, all quantities could be added to the index, but for convention and simplicity,
only quantities defined in the sections `metadata` and `results` should be added. This
means that if you want to add custom quantities from your parser, for example, you will
also need to customize the results normalizer to copy or reference parsed data.
In principle, all quantities could be added to the index, but for convention and
simplicity, only quantities defined in the sections `metadata` and `results` should be
added. This means that if you want to add custom quantities from your parser, for example,
you will also need to customize the results normalizer to copy or reference parsed data.
## The search API
The search API does not have to change. It automatically supports all quantities with
the eleasticsearch extensions. The keys that you can use in the API are the metainfo
paths of the respective quantities, e.g. `results.material.elements` or `mainfile` (note
that the `metadata.` prefix is always omitted). If there are multiple elasticsearch
annotations for the same quantity, all but one define a `field` parameter, which
is added to the quantity path, e.g. `mainfile.path`.
The search API does not have to change. It automatically supports all quantities with the
`elasticsearch` extension. The keys that you can use in the API are the Metainfo paths of
the respective quantities, e.g. `results.material.elements` or `mainfile` (note that the
`metadata.` prefix is always omitted). If there are multiple `elasticsearch` annotations
for the same quantity, all but one define a `field` parameter, which is added to the
quantity path, e.g. `mainfile.path`.
## The search web interface
Comming soon ...
Coming soon ...
\ No newline at end of file
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment