Skip to content
Snippets Groups Projects
Commit b0d59050 authored by Markus Scheidgen's avatar Markus Scheidgen
Browse files

Added more documentation on data, schemas, and plugins.

Changelog: Added
parent a459ac0f
No related branches found
No related tags found
1 merge request!1390Added more learn documentation on data and schemas.
Pipeline #172320 passed
Showing
with 331 additions and 111 deletions
This guide is about using NOMAD's REST APIs directly, e.g. via Python's *request*.
To access the processed data with our client library `nomad-lab` follow
[How to access the processed data](archive_query.md). You watch our
[How-to access the processed data](archive_query.md). You watch our
[video tutorial on the API](../tutorial.md#access-data-via-api).
## Different options to use the API
......@@ -46,7 +46,7 @@ API functions that allows you to try these functions in the browser.
Install the [NOMAD Python client library](pythonlib.md) and use it's `ArchiveQuery`
functionality for a more convenient query based access of archive data following the
[How to access the processed data](archive_query.md) guide.
[How-to access the processed data](archive_query.md) guide.
## Using request
......@@ -250,7 +250,7 @@ the API:
- Raw files, the files as they were uploaded to NOMAD.
- Archive data, all of the extracted data for an entry.
There are also different entities (see also [Datamodel](../learn/how_nomad_works.md#datamodel-uploads-entries-files-datasets)) with different functions in the API:
There are also different entities (see also [Datamodel](../learn/basics.md)) with different functions in the API:
- Entries
- Uploads
......
......@@ -5,7 +5,7 @@ based on the schema rather than plain JSON. See also this guide on using
to work with processed data.
As a requirement, you have to install the `nomad-lab` Python package. Follow the
[How to install nomad-lab](pythonlib.md) guide.
[How-to install nomad-lab](pythonlib.md) guide.
## Getting started
......
# How to run a parser
# How-to run a parser
You can find a [list of all parsers](../reference/parsers.md) and supported files in the reference.
......
......@@ -13,7 +13,7 @@ button. This will bring you to the upload page.
Click the `CREATE ENTRY` button. This will bring-up a dialog to choose an ELN schema.
All ELNs (as any entry in NOMAD) needs to follow a schema. You can choose from uploaded
custom schemas or NOMAD build-in schemas. You can choose the `Basic ELN` to create a
custom schemas or NOMAD built-in schemas. You can choose the `Basic ELN` to create a
simple ELN entry.
The name of your ELN entry, will be the filename for your ELN without the `.archive.json`
......@@ -33,5 +33,5 @@ click the `ADD EXAMPLE UPLOADS` button. The `Electronic Lab Notebook` example, w
contain a schema and entries that instantiate different parts of the schema.
The *ELN example sample (`sample.archive.json`) demonstrates what you can do.
Follow the [How to write a schema](../schemas/basics.md) and [How to define ELN](../schemas/elns.md)
Follow the [How-to write a schema](../schemas/basics.md) and [How-to define ELN](../schemas/elns.md)
guides to create you own customized of ELNs.
......@@ -170,7 +170,7 @@ Please follow the following rules when logging:
- If a logger is not already provided, only use
:py:func:`nomad.utils.get_logger` to acquire a new logger. Never use the
build-in logging directly. These logger work like the system loggers, but
built-in logging directly. These logger work like the system loggers, but
allow you to pass keyword arguments with additional context data. See also
the [structlog docs](https://structlog.readthedocs.io/en/stable/).
- In many context, a logger is already provided (e.g. api, processing, parser, normalizer).
......
# How to write a normalizer
# How-to write a normalizer
## Introduction
......
......@@ -34,14 +34,14 @@ It covers th whole publish, explore, analyze cycle:
</div>
<div markdown="block">
### How to guides
### How-to guides
The documentation provides step-by-step instructions for a wide range of tasks. For example:
- [How to upload and publish data](data/upload.md)
- [How to write a custom ELN](schemas/elns.md)
- [How to run a parser locally](apis/local_parsers.md)
- [How to install NOMAD Oasis](oasis/install.md)
- [How-to upload and publish data](data/upload.md)
- [How-to write a custom ELN](schemas/elns.md)
- [How-to run a parser locally](apis/local_parsers.md)
- [How-to install NOMAD Oasis](oasis/install.md)
</div>
......
docs/learn/architecture.png

81.6 KiB | W: | H:

docs/learn/architecture.png

89.1 KiB | W: | H:

docs/learn/architecture.png
docs/learn/architecture.png
docs/learn/architecture.png
docs/learn/architecture.png
  • 2-up
  • Swipe
  • Onion skin
docs/learn/archive-example.png

137 KiB

NOMAD is based on a *bottom-up* approach to data management. Instead of only supporting data in a specific
predefined format, we process files to extract data from an extendable variety of data formats.
Converting heterogenous files into homogeneous machine actionable processed data is the
basis to make data FAIR. It allows us to build search interfaces, APIs, visualization, and
analysis tools independent from specific file formats.
<figure markdown>
![datamodel](datamodel.png)
<figcaption>NOMAD's datamodel and processing</figcaption>
</figure>
## Uploads
Users create **uploads** to organize files. Think of an upload like a project:
many files can be put into a single upload and an upload can be structured with directories.
You can collaborate on uploads, share uploads, and publish uploads. The files in an
upload are called **raw files**.
Raw files are managed by users and they are never changed by NOMAD.
!!! note
As a *rule*, **raw files** are not changed during processing (or otherwise). However,
to achieve certain functionality, a parser, normalizer, or schema developer might decide to bend
this rule. Use-cases include the generation of more mainfiles (and entries) and updating
of related mainfiles to automatize ELNs, or
generating additional files to convert a mainfile into a standardized format like nexus or cif.
## Entries
All uploaded **raw files** are analysed to find files with a recognized format. Each file
that follows a recognized format is a **mainfile**. For each mainfile, NOMAD will create
a database **entry**. The entry is eternally matched to the mainfile. The entry id, for example,
is a hash over the upload id and the mainfile path (and an optional key) within the upload.
This **matching** process is automatic, and users cannot create entries
manually.
!!! note
We say that raw files are not changed by NOMAD and that users cannot create entries,
but what about ELNs? There is a *create entry* button in the UI?
However,
NOMAD will simply create an editable **mainfile** that indirectly creates an entry.
The user might use NOMAD as an editor to change the file, but the content is
determined by the users. Contrary to the processed data that is created
from raw files by NOMAD.
## Processing
Also the processing of entries is automatic. Initially and on each mainfile change,
the entry corresponding to the mainfile, will be processed. Processing consist of
**parsing**, **normalizing**, and storing the created data.
### Parsing
Parsers are small programs that transform data from a recognized *mainfile* into a
structured machine processable tree of data that we call the *archive* or [**processed data**](data.md)
of the entry. Only one parser is used for each entry. The used parser is determined
during matching and depends on the file format. Parsers can be added to NOMAD as
[plugins](../plugins/parsers.md); this is a list of [all built-in parsers](../reference/parsers.md).
!!! note
A special case is the parsing of NOMAD archive files. Usually a parser converts a file
from a source format into NOMAD's *archive* format for processed data. But users can
also create files following this format themselves. They can be uploaded either as `.json` or `.yaml` files
by using the `.archive.json` or `.archive.yaml` extension. In these cases, we also considering
these files as mainfiles and they are also going to be processed. Here the parsing
is a simple syntax check and basically just copying the data, but normalization might
still modify and augment the data substantially. One use-case for these *archive* files,
are ELNs. Here the NOMAD UI acts as an editor for a respective `.json` file, but on each save, the
corresponding file is going through all the regular processing steps. This allows
ELN schema developers to add all kinds of functionality such as updating referenced
entries, parsing linked files, or creating new entries for automation.
### Normalizing
While parsing converts a mainfile into processed data, normalizing is only working on the
processed data. Learn more about why to normalize in the documentation on [structured data](./data.md).
There are two principle ways to implement normalization in NOMAD:
**normalizers** and **normalize** functions.
[Normalizers](../develop/normalizers.md) are small programs that take processed data as input.
There is a list of normalizers registered in the [NOMAD configuration](../reference/config.md#normalize).
In the future, normalizers might be
added as plugins as well. They run in the configured order. Every normalizer is run
on all entries and the normalizer might decide to do something or not, depending on what
it sees in the processed data.
Normalize functions are special functions implemented as part of section definitions
in [Python schemas](../plugins/schemas.md#writing-schemas-in-python-compared-to-yaml-schemas).
There is a special normalizer that will go through all processed data and execute these
function if they are defined. Normalize functions get the respective section instance as
input. This allows [schema plugin](../plugins/schemas.md) developers to add normalizing to their sections.
Read about our [structured data](./data.md) to learn more about the different sections.
### Storing and indexing
As a last technical step, the processed data is stored and some information is passed
into the search index. The store for processed data is internal to NOMAD and processed
data cannot be accessed directly and only via the [archive API](../apis/api.md#access-processed-data-archives)
or [ArchiveQuery](../apis/archive_query.md) Python library functionality.
What information is stored in the search index is determined
by the *metadata* and *results* sections and cannot be changed by users or plugins.
However, all scalar values in the processed data are also index as key-values pairs.
!!! attention
This part of the documentation should be more substantiated. There will be a learn section
about the search soon.
## Files
We already said that all uploaded files are **raw files**. Recognized files that have an
entry are called **mainfiles**. Only the mainfile of the entry is
passed to the parser during processing. However, a parser can call other tools or read other files.
Therefore, we consider all files in the same directory of the mainfile as **auxillary files**,
even though there is not necessarily a formal relationship with the entry. If
formal relationships with aux files are established, e.g. via a reference to the file within
the processed data, is up to the parser.
## Datasets
Users can build collections of entries to form **datasets**. You can imagine datasets
like tags or albums in other systems. Each entry can be contain in many datasets and
a dataset can hold many entries. Datasets can also overlap. Datasets are only
indirectly related to files. The main purpose of **datasets** in NOMAD is to have citable
collections of data. Users can get a DOI for their datasets. Datasets have no influence
on the processing of data.
docs/learn/data-flow.png

128 KiB

# Structured data and the NOMAD Metainfo
NOMAD structures data into **sections**, where each section can contain data and more sections.
This allows to browse complex data like you would browse files and directories on your computer.
Each section follows a **definition** and all the contained data and sub-section have a
specific name, description, possible type, shape, and unit. This means that all data follows a **schema**.
This not only helps the human exploration, but also makes it machine interpretable,
increases consistency and interoperability, enables search, APIs, visualization, and
analysis.
<figure markdown>
![processed data screenshot](screenshot.png)
<figcaption>Browsing structured data in the NOMAD UI (<a href="https://nomad-lab.eu/prod/v1/gui/search/entries/entry/id/zQJMKax7xk384h_rx7VW_-6bRIgi/data/run/0/system/0/atoms/positions">link</a>)</figcaption>
</figure>
## Schema language
The bases for structured data are schemas written in a **schema language**. Our
schema language is called the **NOMAD Metainfo** language. It
defines the tools to define sections, organize definitions into **packages**, and define
section properties (**sub-sections** and **quantities**).
<figure markdown>
![schema language](schema_language.png)
<figcaption>The NOMAD Metainfo schema language for structured data definitions</figcaption>
</figure>
Packages contain section definitions, section definitions contain definitions for
sub-sections and quantities. Sections can inherit the properties of other sections. While
sub-sections allow to define containment hierarchies for sections, quantities can
use section definitions (or other quantity definitions) as a type to define references.
If you are familiar with other schema languages and means to defined structured data
(json schema, XML schema, pydantic, database schemas, ORM, etc.), you might recognize
these concept under different names. Sections are similar to *classes*, *concepts*, *entities*, or *tables*.
Quantities are related to *properties*, *attributes*, *slots*, *columns*.
Sub-sections might be called *containment* or *composition*. Sub-sections and quantities
with a section type also define *relationships*, *links*, or *references*.
Our guide on [how-to write a schema](../schemas/basics.md) explains these concepts with an example.
## Schema
NOMAD represents many different types of data. Therefore, we cannot speak of just *the one*
schema. The entirety of NOMAD schemas is called the **NOMAD Metainfo**.
Definitions used in the NOMAD Metainfo fall into three different categories. First,
we have sections that define a **shared entry structure**. Those are independent of the
type of data (and processed file type). They allow to find all generic parts without
any deeper understanding of the specific data. Second, we have definitions of
**re-usable base sections** for shared common concepts and their properties.
Specific schemas can use and extend these base sections. Base sections define a fixed
interface or contract that can be used to build tools (e.g. search, visualizations, analysis)
around them. Lastly, there are **specific schemas**. Those re-use base sections and
complement the shared entry structure. They define specific data structures to represent
specific types of data.
<figure markdown>
![schema language](schema.png)
<figcaption>
The three different categories of NOMAD schema definitions
</figcaption>
</figure>
### Shared entry structure
The processed data (archive) of each entry share the same structure. They all instantiate
the same root section `EntryArchive`. They all share common sections `metadata:EntryMetadata`
and `results:Results`. They also all contain a *data* section, but the used section
definition varies depending on the type of data of the specific entry. There is the
literal `data:EntryData` sub-section. Here `EntryData` is abstract and specific entries
will use concrete definitions that inherit from `EntryData`. There are also specific *data*
sections, like `run` for simulation data and `nexus` for nexus data.
!!! attention
The results, originally only designed for computational data, will soon be revised
an replaced by a different section. However, the necessity and function of a section
like this remains.
<figure markdown>
![schema language](super_structure.png)
<figcaption>
All entries instantiate the same section share the same structure.
</figcaption>
</figure>
### Base sections
Base section is a very loose category. In principle, every section definition can be
inherited from or can be re-used in different contexts. There are some dedicated (or even abstract)
base section definitions (mostly defined in the `nomad.datamodel.metainfo` package and sub-packages),
but schema authors should not strictly limit themselves to these definitions.
The goal is to re-use as much as possible and to not re-invent the same sections over
and over again. Tools build around certain base section, provide an incentive to
use them.
!!! attention
There is no detailed how-to or reference documentation on the existing base sections
and how to use them yet.
One example for re-usable base section is the [workflow package](../schemas/workflows.md).
These allow to define workflows in a common way. They allow to place workflows in
the shared entry structure, and the UI provides a card with workflow visualization and
navigation for all entries that have a workflow inside.
!!! attention
Currently there are two version of the workflow schema. They are stored in two
top-level `EntryArchive` sub-sections (`workflow` and `workflow2`). This
will change soon to something that supports multiple workflows used in
specific schemas and results.
### Specific schemas
Specific schemas allow users and plugin developers to describe their data in all detail.
However, users (and machines) not familiar with the specifics, will struggle to interpret
these kinda of data. Therefore, it is important to also translate (at least some of) the data
into a more generic and standardized form.
<figure markdown>
![schema language](data.png)
<figcaption>
From specific data to more general interoperable data.
</figcaption>
</figure>
The **results** section provides a shared structure designed around base section definitions.
This allows you to put (at least some of) your data where it is easy to find, and in a
form that is easy to interpret. Your non-interoperable, but highly
detailed data needs to be transformed into an interoperable (but potentially limited) form.
Typically, a parser will be responsible to populate the specific schema, and the
interoperable schema parts (e.g. section results) are populated during normalization.
This allows to separate certain aspects of conversions and potentially enables re-use
for normalization routines. The necessary effort for normalization depends on how much
the specific schema deviates from base-sections. There are three levels:
- the parser (or uploaded archive file) populates section results directly
- the specific schema re-uses the base sections used for the results and normalization
can be automated
- the specific schema represents the same information differently and a translating
normalization algorithm needs to be implemented.
### Exploring the schema
All built-in definitions that come with NOMAD or one of the installed plugins can
be explored with the [Metainfo browser](https://nomad-lab.eu/prod/v1/gui/analyze/metainfo/nomad.datamodel.datamodel.EntryArchive). You can start with the root section `EntryArchive`
and browse based on sub-sections, or explore the Metainfo through packages.
To see all user provided uploaded schemas, you can use a [search for the sub-section `definition`](https://nomad-lab.eu/prod/v1/gui/search/entries?quantities=definitions).
The sub-section `definition` is a top-level `EntryArchive` sub-section. See also our
[how-to on writing and uploading schemas](http://127.0.0.1:8001/schemas/basics.html#uploading-schemas).
### Contributing to the Metainfo
The shared entry structure (including section results) is part of the NOMAD source-code.
It interacts with core functionality and needs to be highly controlled.
Contributions here are only possible through merge requests.
Base sections can be contributed via plugins. Here they can be explored in the Metainfo
browser, your plugin can provide more tools, and you can make use of normalize functions.
See also our [how-to on writing schema plugins](../plugins/schemas.md). You could
also provide base sections via uploaded schemas, but those are harder to explore and
distribute to other NOMAD installations.
Specific schemas can be provided via plugins or as uploaded schemas. When you upload
schemas, you most likely also upload data in archive files (or use ELNs to edit such files).
Here you can also provide schemas and data in the same file. In many case
specific schemas will be small and only re-combine existing base sections.
See also our
[how-to on writing schemas](http://127.0.0.1:8001/schemas/basics.html).
## Data
All processed data in NOMAD instantiates Metainfo schema definitions and the *archive* of
each entry is always an instance of `EntryArchive`. This provides an abstract structure
for all data. However, it is independent of the actual representation of data in computer memory
or how it might be stored in a file or database.
The Metainfo has many serialized forms. You can write `.archive.json` or `.archive.yaml`
files yourself. NOMAD internally stores all processed data in [message pack](https://msgpack.org/). Some
of the data is stored in mongodb or elasticsearch. When you request processed data via
API, you receive it in JSON. When you use the [ArchiveQuery](../apis/archive_query.md), all data is represented
as Python objects (see also [here](http://127.0.0.1:8001/plugins/schemas.html#starting-example)).
No matter what the representation is, you can rely on the structure, names, types, shapes, and units
defined in the schema to interpret the data.
docs/learn/data.png

44.6 KiB

docs/learn/datamodel.png

34.8 KiB | W: | H:

docs/learn/datamodel.png

153 KiB | W: | H:

docs/learn/datamodel.png
docs/learn/datamodel.png
docs/learn/datamodel.png
docs/learn/datamodel.png
  • 2-up
  • Swipe
  • Onion skin
docs/learn/how-does-nomad-work.png

114 KiB

# How does NOMAD work?
## Managing data based on automatically extract rich metadata
![how does nomad work](how-does-nomad-work.png)
NOMAD is based on a *bottom-up* approach. Instead of only managing data of a specific
predefined format, we use parsers and processing to support an extendable variety of
data formats. Uploaded *raw* files are analysed and files with a recognized format are parsed.
Parsers are small programs that transform data from the recognized *mainfiles* into a common machine
processable version that we call *archive*. The information in the common archive representation
drives everything else. It is the based for our search interface, the representation of materials
and their properties, as well as all analytics.
## A common hierarchical machine processable format for all data
![archive example](archive-example.png)
The *archive* is a hierarchical data format with a strict schema.
All the information is organized into logical nested *sections*.
Each *section* comprised a set of *quantities* on a common subject.
All *sections* and *quantities* are supported by a formal schema that defines names, descriptions, types, shapes, and units.
We sometimes call this data *archive* and the schema *metainfo*.
## Datamodel: *uploads*, *entries*, *files*, *datasets*
Uploaded *raw* files are managed in *uploads*.
Users can create *uploads* and use them like projects.
You can share them with other users, incrementally add and modify data in them, publish (incl. embargo) them, or transfer them between NOMAD installations.
As long as an *upload* is not published, you can continue to provide files, delete the upload again, or test how NOMAD is processing your files.
Once an upload is published, it becomes immutable.
<figure markdown>
![datamodel](datamodel.png){ width=600 }
<figcaption>NOMAD's main entities</figcaption>
</figure>
An *upload* can contain an arbitrary directory structure of *raw* files.
For each recognized *mainfile*, NOMAD creates an entry.
Therefore, an *upload* contains a list of *entries*.
Each *entry* is associated with its *mainfile*, an *archive*, and all other *auxiliary* files in the same directory.
*Entries* are automatically aggregated into *materials* based on the extract materials metadata.
*Entries* (of many uploads) can be manually curated into *datasets*for which you can also get a DOI.
\ No newline at end of file
docs/learn/schema.png

58.5 KiB

docs/learn/schema_language.png

102 KiB

# Schemas and Structured Data in NOMAD
NOMAD stores all processed data in a *well defined*, *structured*, and *machine readable*
format. Well defined means that each element is supported by a formal definition that provides
a name, description, location, shape, type, and possible unit for that data. It has a
hierarchical structure that logically organizes data in sections and subsections and allows
cross-references between pieces of data. Formal definitions and corresponding
data structures enable the machine processing of NOMAD data.
![archive example](archive-example.png)
## The Metainfo is the schema for Archive data.
The Archive stores descriptive and structured information about materials-science
data. Each entry in NOMAD is associated with one Archive that contains all the processed
information of that entry. What information can possibly exist in an archive, how this
information is structured, and how this information is to be interpreted is governed
by the Metainfo.
## On schemas and definitions
Each piece of Archive data has a formal definition in the Metainfo. These definitions
provide data types with names, descriptions, categories, and further information that
applies to all incarnations of a certain data type.
Consider a simulation `Run`. Each
simulation run in NOMAD is characterized by a *section*, that is called *run*. It can contain
*calculation* results, simulated *systems*, applied *methods*, the used *program*, etc.
What constitutes a simulation run is *defined* in the metainfo with a *section definition*.
All other elements in the Archive (e.g. *calculation*, *system*, ...) have similar definitions.
Definitions follow a formal model. Depending on the definition type, each definition
has to provide certain information: *name*, *description*, *shape*, *units*, *type*, etc.
## Types of definitions
- *Sections* are the building block for hierarchical data. A section can contain other
sections (via *subsections*) and data (via *quantities*).
- *Subsections* define a containment relationship between sections.
- *Quantities* define a piece of data in a section.
- *References* are special quantities that allow to define references from a section to
another section or quantity.
- *Categories* allow to categorize definitions.
- *Packages* are used to organize definitions.
## Interfaces
The Archive format and Metainfo schema is abstract and not not bound to any
specific storage format. Archive and Metainfo can be represented in various ways.
For example, NOMAD internally stores archives in a binary format, but serves them via
API in json. Users can upload archive files (as `.archive.json` or `.archive.yaml`) files.
Metainfo schema can be programmed with Python classes, but can also be uploaded as
archive files (the Metainfo itself is just a specific Archive schema). The following
chart provides a sense of various ways that data can be entered into NOMAD:
![nomad data flow](data-flow.png)
There are various interface to provide or retrieve Archive data and Metainfo schemas.
The following documentation sections will explain a few of them.
docs/learn/screenshot.png

393 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment