Added more documentation on data, schemas, and plugins.

Changelog: Added

Added more documentation on data, schemas, and plugins.
b0d59050 · Markus Scheidgen · a459ac0f · b0d59050 · b0d59050 · b0d59050
Commit b0d59050 authored 1 year ago by Markus Scheidgen
--- a/docs/apis/api.md
+++ b/docs/apis/api.md
 This guide is about using NOMAD's REST APIs directly, e.g. via Python's *request*.

 To access the processed data with our client library `nomad-lab` follow
-[How to access the processed data](archive_query.md). You watch our
+[How-to access the processed data](archive_query.md). You watch our
 [video tutorial on the API](../tutorial.md#access-data-via-api).

 ## Different options to use the API
@@ -46,7 +46,7 @@ API functions that allows you to try these functions in the browser.

 Install the [NOMAD Python client library](pythonlib.md) and use it's `ArchiveQuery`
 functionality for a more convenient query based access of archive data following the
-[How to access the processed data](archive_query.md) guide.
+[How-to access the processed data](archive_query.md) guide.

 ## Using request

@@ -250,7 +250,7 @@ the API:
 - Raw files, the files as they were uploaded to NOMAD.
 - Archive data, all of the extracted data for an entry.

-There are also different entities (see also [Datamodel](../learn/how_nomad_works.md#datamodel-uploads-entries-files-datasets)) with different functions in the API:
+There are also different entities (see also [Datamodel](../learn/basics.md)) with different functions in the API:

 - Entries
 - Uploads

--- a/docs/apis/archive_query.md
+++ b/docs/apis/archive_query.md
@@ -5,7 +5,7 @@ based on the schema rather than plain JSON. See also this guide on using
 to work with processed data.

 As a requirement, you have to install the `nomad-lab` Python package. Follow the
-[How to install nomad-lab](pythonlib.md) guide.
+[How-to install nomad-lab](pythonlib.md) guide.


 ## Getting started

--- a/docs/apis/local_parsers.md
+++ b/docs/apis/local_parsers.md
-# How to run a parser
+# How-to run a parser

 You can find a [list of all parsers](../reference/parsers.md) and supported files in the reference.


--- a/docs/data/eln.md
+++ b/docs/data/eln.md
@@ -13,7 +13,7 @@ button. This will bring you to the upload page.

 Click the `CREATE ENTRY` button. This will bring-up a dialog to choose an ELN schema.
 All ELNs (as any entry in NOMAD) needs to follow a schema. You can choose from uploaded
-custom schemas or NOMAD build-in schemas. You can choose the `Basic ELN` to create a
+custom schemas or NOMAD built-in schemas. You can choose the `Basic ELN` to create a
 simple ELN entry.

 The name of your ELN entry, will be the filename for your ELN without the `.archive.json`
@@ -33,5 +33,5 @@ click the `ADD EXAMPLE UPLOADS` button. The `Electronic Lab Notebook` example, w
 contain a schema and entries that instantiate different parts of the schema.
 The *ELN example sample (`sample.archive.json`) demonstrates what you can do.

-Follow the [How to write a schema](../schemas/basics.md) and [How to define ELN](../schemas/elns.md)
+Follow the [How-to write a schema](../schemas/basics.md) and [How-to define ELN](../schemas/elns.md)
 guides to create you own customized of ELNs.
--- a/docs/develop/guides.md
+++ b/docs/develop/guides.md
@@ -170,7 +170,7 @@ Please follow the following rules when logging:

 - If a logger is not already provided, only use
  :py:func:`nomad.utils.get_logger` to acquire a new logger. Never use the
-  build-in logging directly. These logger work like the system loggers, but
+  built-in logging directly. These logger work like the system loggers, but
  allow you to pass keyword arguments with additional context data. See also
  the [structlog docs](https://structlog.readthedocs.io/en/stable/).
 - In many context, a logger is already provided (e.g. api, processing, parser, normalizer).

--- a/docs/develop/normalizers.md
+++ b/docs/develop/normalizers.md
-# How to write a normalizer
+# How-to write a normalizer

 ## Introduction


--- a/docs/index.md
+++ b/docs/index.md
@@ -34,14 +34,14 @@ It covers th whole publish, explore, analyze cycle:
 </div>
 <div markdown="block">

-### How to guides
+### How-to guides

 The documentation provides step-by-step instructions for a wide range of tasks. For example:

- [How to upload and publish data](data/upload.md)
- [How to write a custom ELN](schemas/elns.md)
- [How to run a parser locally](apis/local_parsers.md)
- [How to install NOMAD Oasis](oasis/install.md)
+- [How-to upload and publish data](data/upload.md)
+- [How-to write a custom ELN](schemas/elns.md)
+- [How-to run a parser locally](apis/local_parsers.md)
+- [How-to install NOMAD Oasis](oasis/install.md)

 </div>


--- a/docs/learn/architecture.png
+++ b/docs/learn/architecture.png
--- a/docs/learn/archive-example.png
+++ b/docs/learn/archive-example.png
--- a/docs/learn/basics.md
+++ b/docs/learn/basics.md
+
+NOMAD is based on a *bottom-up* approach to data management. Instead of only supporting data in a specific
+predefined format, we process files to extract data from an extendable variety of data formats.
+
+Converting heterogenous files into homogeneous machine actionable processed data is the
+basis to make data FAIR. It allows us to build search interfaces, APIs, visualization, and
+analysis tools independent from specific file formats.
+
+<figure markdown>
+  ![datamodel](datamodel.png)
+  <figcaption>NOMAD's datamodel and processing</figcaption>
+</figure>
+
+## Uploads
+
+Users create **uploads** to organize files. Think of an upload like a project:
+many files can be put into a single upload and an upload can be structured with directories.
+You can collaborate on uploads, share uploads, and publish uploads. The files in an
+upload are called **raw files**.
+Raw files are managed by users and they are never changed by NOMAD.
+
+!!! note
+    As a *rule*, **raw files** are not changed during processing (or otherwise). However,
+    to achieve certain functionality, a parser, normalizer, or schema developer might decide to bend
+    this rule. Use-cases include the generation of more mainfiles (and entries) and updating
+    of related mainfiles to automatize ELNs, or
+    generating additional files to convert a mainfile into a standardized format like nexus or cif.
+
+
+## Entries
+
+All uploaded **raw files** are analysed to find files with a recognized format. Each file
+that follows a recognized format is a **mainfile**. For each mainfile, NOMAD will create
+a database **entry**. The entry is eternally matched to the mainfile. The entry id, for example,
+is a hash over the upload id and the mainfile path (and an optional key) within the upload.
+This **matching** process is automatic, and users cannot create entries
+manually.
+
+!!! note
+    We say that raw files are not changed by NOMAD and that users cannot create entries,
+    but what about ELNs? There is a *create entry* button in the UI?
+
+    However,
+    NOMAD will simply create an editable **mainfile** that indirectly creates an entry.
+    The user might use NOMAD as an editor to change the file, but the content is
+    determined by the users. Contrary to the processed data that is created
+    from raw files by NOMAD.
+
+
+## Processing
+
+Also the processing of entries is automatic. Initially and on each mainfile change,
+the entry corresponding to the mainfile, will be processed. Processing consist of
+**parsing**, **normalizing**, and storing the created data.
+
+
+### Parsing
+
+Parsers are small programs that transform data from a recognized *mainfile* into a
+structured machine processable tree of data that we call the *archive* or [**processed data**](data.md)
+of the entry. Only one parser is used for each entry. The used parser is determined
+during matching and depends on the file format. Parsers can be added to NOMAD as
+[plugins](../plugins/parsers.md); this is a list of [all built-in parsers](../reference/parsers.md).
+
+
+!!! note
+    A special case is the parsing of NOMAD archive files. Usually a parser converts a file
+    from a source format into NOMAD's *archive* format for processed data. But users can
+    also create files following this format themselves. They can be uploaded either as `.json` or `.yaml` files
+    by using the `.archive.json` or `.archive.yaml` extension. In these cases, we also considering
+    these files as mainfiles and they are also going to be processed. Here the parsing
+    is a simple syntax check and basically just copying the data, but normalization might
+    still modify and augment the data substantially. One use-case for these *archive* files,
+    are ELNs. Here the NOMAD UI acts as an editor for a respective `.json` file, but on each save, the
+    corresponding file is going through all the regular processing steps. This allows
+    ELN schema developers to add all kinds of functionality such as updating referenced
+    entries, parsing linked files, or creating new entries for automation.
+
+### Normalizing
+
+While parsing converts a mainfile into processed data, normalizing is only working on the
+processed data. Learn more about why to normalize in the documentation on [structured data](./data.md).
+There are two principle ways to implement normalization in NOMAD:
+**normalizers** and **normalize** functions.
+
+[Normalizers](../develop/normalizers.md) are small programs that take processed data as input.
+There is a list of normalizers registered in the [NOMAD configuration](../reference/config.md#normalize).
+In the future, normalizers might be
+added as plugins as well. They run in the configured order. Every normalizer is run
+on all entries and the normalizer might decide to do something or not, depending on what
+it sees in the processed data.
+
+Normalize functions are special functions implemented as part of section definitions
+in [Python schemas](../plugins/schemas.md#writing-schemas-in-python-compared-to-yaml-schemas).
+There is a special normalizer that will go through all processed data and execute these
+function if they are defined. Normalize functions get the respective section instance as
+input. This allows [schema plugin](../plugins/schemas.md) developers to add normalizing to their sections.
+Read about our [structured data](./data.md) to learn more about the different sections.
+
+### Storing and indexing
+
+As a last technical step, the processed data is stored and some information is passed
+into the search index. The store for processed data is internal to NOMAD and processed
+data cannot be accessed directly and only via the [archive API](../apis/api.md#access-processed-data-archives)
+or [ArchiveQuery](../apis/archive_query.md) Python library functionality.
+What information is stored in the search index is determined
+by the *metadata* and *results* sections and cannot be changed by users or plugins.
+However, all scalar values in the processed data are also index as key-values pairs.
+
+!!! attention
+    This part of the documentation should be more substantiated. There will be a learn section
+    about the search soon.
+
+## Files
+
+We already said that all uploaded files are **raw files**. Recognized files that have an
+entry are called **mainfiles**. Only the mainfile of the entry is
+passed to the parser during processing. However, a parser can call other tools or read other files.
+Therefore, we consider all files in the same directory of the mainfile as **auxillary files**,
+even though there is not necessarily a formal relationship with the entry. If
+formal relationships with aux files are established, e.g. via a reference to the file within
+the processed data, is up to the parser.
+
+## Datasets
+
+Users can build collections of entries to form **datasets**. You can imagine datasets
+like tags or albums in other systems. Each entry can be contain in many datasets and
+a dataset can hold many entries. Datasets can also overlap. Datasets are only
+indirectly related to files. The main purpose of **datasets** in NOMAD is to have citable
+collections of data. Users can get a DOI for their datasets. Datasets have no influence
+on the processing of data.
--- a/docs/learn/data-flow.png
+++ b/docs/learn/data-flow.png
--- a/docs/learn/data.md
+++ b/docs/learn/data.md
+# Structured data and the NOMAD Metainfo
+
+NOMAD structures data into **sections**, where each section can contain data and more sections.
+This allows to browse complex data like you would browse files and directories on your computer.
+Each section follows a **definition** and all the contained data and sub-section have a
+specific name, description, possible type, shape, and unit. This means that all data follows a **schema**.
+This not only helps the human exploration, but also makes it machine interpretable,
+increases consistency and interoperability, enables search, APIs, visualization, and
+analysis.
+
+<figure markdown>
+  ![processed data screenshot](screenshot.png)
+  <figcaption>Browsing structured data in the NOMAD UI (<a href="https://nomad-lab.eu/prod/v1/gui/search/entries/entry/id/zQJMKax7xk384h_rx7VW_-6bRIgi/data/run/0/system/0/atoms/positions">link</a>)</figcaption>
+</figure>
+
+
+## Schema language
+
+The bases for structured data are schemas written in a **schema language**. Our
+schema language is called the **NOMAD Metainfo** language. It
+defines the tools to define sections, organize definitions into **packages**, and define
+section properties (**sub-sections** and **quantities**).
+
+<figure markdown>
+  ![schema language](schema_language.png)
+  <figcaption>The NOMAD Metainfo schema language for structured data definitions</figcaption>
+</figure>
+
+Packages contain section definitions, section definitions contain definitions for
+sub-sections and quantities. Sections can inherit the properties of other sections. While
+sub-sections allow to define containment hierarchies for sections, quantities can
+use section definitions (or other quantity definitions) as a type to define references.
+
+If you are familiar with other schema languages and means to defined structured data
+(json schema, XML schema, pydantic, database schemas, ORM, etc.), you might recognize
+these concept under different names. Sections are similar to *classes*, *concepts*, *entities*, or  *tables*.
+Quantities are related to *properties*, *attributes*, *slots*, *columns*.
+Sub-sections might be called *containment* or *composition*. Sub-sections and quantities
+with a section type also define *relationships*, *links*, or *references*.
+
+Our guide on [how-to write a schema](../schemas/basics.md) explains these concepts with an example.
+
+## Schema
+
+NOMAD represents many different types of data. Therefore, we cannot speak of just *the one*
+schema. The entirety of NOMAD schemas is called the **NOMAD Metainfo**.
+Definitions used in the NOMAD Metainfo fall into three different categories. First,
+we have sections that define a **shared entry structure**. Those are independent of the
+type of data (and processed file type). They allow to find all generic parts without
+any deeper understanding of the specific data. Second, we have definitions of
+**re-usable base sections** for shared common concepts and their properties.
+Specific schemas can use and extend these base sections. Base sections define a fixed
+interface or contract that can be used to build tools (e.g. search, visualizations, analysis)
+around them. Lastly, there are **specific schemas**. Those re-use base sections and
+complement the shared entry structure. They define specific data structures to represent
+specific types of data.
+
+<figure markdown>
+  ![schema language](schema.png)
+  <figcaption>
+    The three different categories of NOMAD schema definitions
+  </figcaption>
+</figure>
+
+### Shared entry structure
+
+The processed data (archive) of each entry share the same structure. They all instantiate
+the same root section `EntryArchive`. They all share common sections `metadata:EntryMetadata`
+and `results:Results`. They also all contain a *data* section, but the used section
+definition varies depending on the type of data of the specific entry. There is the
+literal `data:EntryData` sub-section. Here `EntryData` is abstract and specific entries
+will use concrete definitions that inherit from `EntryData`. There are also specific *data*
+sections, like `run` for simulation data and `nexus` for nexus data.
+
+!!! attention
+    The results, originally only designed for computational data, will soon be revised
+    an replaced by a different section. However, the necessity and function of a section
+    like this remains.
+
+<figure markdown>
+  ![schema language](super_structure.png)
+  <figcaption>
+    All entries instantiate the same section share the same structure.
+  </figcaption>
+</figure>
+
+### Base sections
+
+Base section is a very loose category. In principle, every section definition can be
+inherited from or can be re-used in different contexts. There are some dedicated (or even abstract)
+base section definitions (mostly defined in the `nomad.datamodel.metainfo` package and sub-packages),
+but schema authors should not strictly limit themselves to these definitions.
+The goal is to re-use as much as possible and to not re-invent the same sections over
+and over again. Tools build around certain base section, provide an incentive to
+use them.
+
+!!! attention
+    There is no detailed how-to or reference documentation on the existing base sections
+    and how to use them yet.
+
+One example for re-usable base section is the [workflow package](../schemas/workflows.md).
+These allow to define workflows in a common way. They allow to place workflows in
+the shared entry structure, and the UI provides a card with workflow visualization and
+navigation for all entries that have a workflow inside.
+
+!!! attention
+    Currently there are two version of the workflow schema. They are stored in two
+    top-level `EntryArchive` sub-sections (`workflow` and `workflow2`). This
+    will change soon to something that supports multiple workflows used in
+    specific schemas and results.
+
+### Specific schemas
+
+Specific schemas allow users and plugin developers to describe their data in all detail.
+However, users (and machines) not familiar with the specifics, will struggle to interpret
+these kinda of data. Therefore, it is important to also translate (at least some of) the data
+into a more generic and standardized form.
+
+<figure markdown>
+  ![schema language](data.png)
+  <figcaption>
+    From specific data to more general interoperable data.
+  </figcaption>
+</figure>
+
+The **results** section provides a shared structure designed around base section definitions.
+This allows you to put (at least some of) your data where it is easy to find, and in a
+form that is easy to interpret. Your non-interoperable, but highly
+detailed data needs to be transformed into an interoperable (but potentially limited) form.
+
+Typically, a parser will be responsible to populate the specific schema, and the
+interoperable schema parts (e.g. section results) are populated during normalization.
+This allows to separate certain aspects of conversions and potentially enables re-use
+for normalization routines. The necessary effort for normalization depends on how much
+the specific schema deviates from base-sections. There are three levels:
+
+- the parser (or uploaded archive file) populates section results directly
+- the specific schema re-uses the base sections used for the results and normalization
+can be automated
+- the specific schema represents the same information differently and a translating
+normalization algorithm needs to be implemented.
+
+### Exploring the schema
+
+All built-in definitions that come with NOMAD or one of the installed plugins can
+be explored with the [Metainfo browser](https://nomad-lab.eu/prod/v1/gui/analyze/metainfo/nomad.datamodel.datamodel.EntryArchive). You can start with the root section `EntryArchive`
+and browse based on sub-sections, or explore the Metainfo through packages.
+
+To see all user provided uploaded schemas, you can use a [search for the sub-section `definition`](https://nomad-lab.eu/prod/v1/gui/search/entries?quantities=definitions).
+The sub-section `definition` is a top-level `EntryArchive` sub-section. See also our
+[how-to on writing and uploading schemas](http://127.0.0.1:8001/schemas/basics.html#uploading-schemas).
+
+### Contributing to the Metainfo
+
+The shared entry structure (including section results) is part of the NOMAD source-code.
+It interacts with core functionality and needs to be highly controlled.
+Contributions here are only possible through merge requests.
+
+Base sections can be contributed via plugins. Here they can be explored in the Metainfo
+browser, your plugin can provide more tools, and you can make use of normalize functions.
+See also our [how-to on writing schema plugins](../plugins/schemas.md). You could
+also provide base sections via uploaded schemas, but those are harder to explore and
+distribute to other NOMAD installations.
+
+Specific schemas can be provided via plugins or as uploaded schemas. When you upload
+schemas, you most likely also upload data in archive files (or use ELNs to edit such files).
+Here you can also provide schemas and data in the same file. In many case
+specific schemas will be small and only re-combine existing base sections.
+See also our
+[how-to on writing schemas](http://127.0.0.1:8001/schemas/basics.html).
+
+## Data
+
+All processed data in NOMAD instantiates Metainfo schema definitions and the *archive* of
+each entry is always an instance of `EntryArchive`. This provides an abstract structure
+for all data. However, it is independent of the actual representation of data in computer memory
+or how it might be stored in a file or database.
+
+The Metainfo has many serialized forms. You can write `.archive.json` or `.archive.yaml`
+files yourself. NOMAD internally stores all processed data in [message pack](https://msgpack.org/). Some
+of the data is stored in mongodb or elasticsearch. When you request processed data via
+API, you receive it in JSON. When you use the [ArchiveQuery](../apis/archive_query.md), all data is represented
+as Python objects (see also [here](http://127.0.0.1:8001/plugins/schemas.html#starting-example)).
+
+No matter what the representation is, you can rely on the structure, names, types, shapes, and units
+defined in the schema to interpret the data.
--- a/docs/learn/data.png
+++ b/docs/learn/data.png
--- a/docs/learn/datamodel.png
+++ b/docs/learn/datamodel.png
--- a/docs/learn/how-does-nomad-work.png
+++ b/docs/learn/how-does-nomad-work.png
--- a/docs/learn/how_nomad_works.md
+++ b/docs/learn/how_nomad_works.md
-# How does NOMAD work?
-
-## Managing data based on automatically extract rich metadata
-![how does nomad work](how-does-nomad-work.png)
-
-NOMAD is based on a *bottom-up* approach. Instead of only managing data of a specific
-predefined format, we use parsers and processing to support an extendable variety of
-data formats. Uploaded *raw* files are analysed and files with a recognized format are parsed.
-Parsers are small programs that transform data from the recognized *mainfiles* into a common machine
-processable version that we call *archive*. The information in the common archive representation
-drives everything else. It is the based for our search interface, the representation of materials
-and their properties, as well as all analytics.
-
-## A common hierarchical machine processable format for all data
-![archive example](archive-example.png)
-
-The *archive* is a hierarchical data format with a strict schema.
-All the information is organized into logical nested *sections*.
-Each *section* comprised a set of *quantities* on a common subject.
-All *sections* and *quantities* are supported by a formal schema that defines names, descriptions, types, shapes, and units.
-We sometimes call this data *archive* and the schema *metainfo*.
-
-## Datamodel: *uploads*, *entries*, *files*, *datasets*
-
-Uploaded *raw* files are managed in *uploads*.
-Users can create *uploads* and use them like projects.
-You can share them with other users, incrementally add and modify data in them, publish (incl. embargo) them, or transfer them between NOMAD installations.
-As long as an *upload* is not published, you can continue to provide files, delete the upload again, or test how NOMAD is processing your files.
-Once an upload is published, it becomes immutable.
-
-<figure markdown>
-  ![datamodel](datamodel.png){ width=600 }
-  <figcaption>NOMAD's main entities</figcaption>
-</figure>
-
-An *upload* can contain an arbitrary directory structure of *raw* files.
-For each recognized *mainfile*, NOMAD creates an entry.
-Therefore, an *upload* contains a list of *entries*.
-Each *entry* is associated with its *mainfile*, an *archive*, and all other *auxiliary* files in the same directory.
-*Entries* are automatically aggregated into *materials* based on the extract materials metadata.
-*Entries* (of many uploads) can be manually curated into *datasets*for which you can also get a DOI.
\ No newline at end of file
--- a/docs/learn/schema.png
+++ b/docs/learn/schema.png
--- a/docs/learn/schema_language.png
+++ b/docs/learn/schema_language.png
--- a/docs/learn/schemas.md
+++ b/docs/learn/schemas.md
-# Schemas and Structured Data in NOMAD
-
-NOMAD stores all processed data in a *well defined*, *structured*, and *machine readable*
-format. Well defined means that each element is supported by a formal definition that provides
-a name, description, location, shape, type, and possible unit for that data. It has a
-hierarchical structure that logically organizes data in sections and subsections and allows
-cross-references between pieces of data. Formal definitions and corresponding
-data structures enable the machine processing of NOMAD data.
-
-![archive example](archive-example.png)
-
-## The Metainfo is the schema for Archive data.
-The Archive stores descriptive and structured information about materials-science
-data. Each entry in NOMAD is associated with one Archive that contains all the processed
-information of that entry. What information can possibly exist in an archive, how this
-information is structured, and how this information is to be interpreted is governed
-by the Metainfo.
-
-## On schemas and definitions
-Each piece of Archive data has a formal definition in the Metainfo. These definitions
-provide data types with names, descriptions, categories, and further information that
-applies to all incarnations of a certain data type.
-
-Consider a simulation `Run`. Each
-simulation run in NOMAD is characterized by a *section*, that is called *run*. It can contain
-*calculation* results, simulated *systems*, applied *methods*, the used *program*, etc.
-What constitutes a simulation run is *defined* in the metainfo with a *section definition*.
-All other elements in the Archive (e.g. *calculation*, *system*, ...) have similar definitions.
-
-Definitions follow a formal model. Depending on the definition type, each definition
-has to provide certain information: *name*, *description*, *shape*, *units*, *type*, etc.
-
-## Types of definitions
-
- *Sections* are the building block for hierarchical data. A section can contain other
-  sections (via *subsections*) and data (via *quantities*).
- *Subsections* define a containment relationship between sections.
- *Quantities* define a piece of data in a section.
- *References* are special quantities that allow to define references from a section to
-  another section or quantity.
- *Categories* allow to categorize definitions.
- *Packages* are used to organize definitions.
-
-## Interfaces
-The Archive format and Metainfo schema is abstract and not not bound to any
-specific storage format. Archive and Metainfo can be represented in various ways.
-For example, NOMAD internally stores archives in a binary format, but serves them via
-API in json. Users can upload archive files (as `.archive.json` or `.archive.yaml`) files.
-Metainfo schema can be programmed with Python classes, but can also be uploaded as
-archive files (the Metainfo itself is just a specific Archive schema). The following
-chart provides a sense of various ways that data can be entered into NOMAD:
-
-![nomad data flow](data-flow.png)
-
-There are various interface to provide or retrieve Archive data and Metainfo schemas.
-The following documentation sections will explain a few of them.
--- a/docs/learn/screenshot.png
+++ b/docs/learn/screenshot.png