Commit 3f984a55 authored by Markus Scheidgen's avatar Markus Scheidgen
Browse files

Updated the metainfo documentation.

parent b3384164
......@@ -12,6 +12,7 @@ and infrastructure with a simplyfied architecture and consolidated code base.
dev_guidelines
api_tutorial
api
ops
metainfo
parser_tutorial
reference
ops
Metainfo
========
.. automodule:: nomad.metainfo
# Copyright 2018 Markus Scheidgen
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an"AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
The NOMAD meta-info allows to define schemas for physics data independent of the used
storage format. It allows to define physics quantities with types, complex shapes
(vetors, matrices, etc.), units, links, and descriptions. It allows to organize large
amounts of these quantities in containment hierarchies of extendable sections, references
between sections, and additional quantity categories.
NOMAD uses the meta-info to define all archive data, repository meta-data, (and encyclopedia
data). The meta-info provides a convenient Python interface to create,
manipulate, and access data. We also use it to map data to various storage formats,
including JSON, (HDF5), mongodb, and elastic search.
Starting example
----------------
.. code-block:: python
from nomad.metainfo import MSection, Quantity, SubSection, Units
class System(MSection):
\"\"\"
A system section includes all quantities that describe a single a simulated
system (a.k.a. geometry).
\"\"\"
n_atoms = Quantity(
type=int, description='''
A Defines the number of atoms in the system.
''')
atom_labels = Quantity(type=Enum(ase.data.chemical_symbols), shape['n_atoms'])
atom_positions = Quantity(type=float, shape=['n_atoms', 3], unit=Units.m)
simulation_cell = Quantity(type=float, shape=[3, 3], unit=Units.m)
pbc = Quantity(type=bool, shape=[3])
class Run(MSection):
systems = SubSection(sub_section=System, repeats=True)
We define simple metainfo schema with two `sections` called ``System`` and ``Run``. Sections
allow to organize related data into, well, `sections`. Each section can have two types of
properties: `quantities` and `sub-sections`. Sections and their properties are defined with
Python classes and their attributes.
Each `quantity` defines a piece of data. Basic quantity attributes are its `type`, `shape`,
`unit`, and `description`.
`Sub-sections` allow to place section into each other and there allow to form containment
hierarchies or sections and the respective data in them. Basic sub-section attributes are
`sub_section`(i.e. a reference to the section definition of the sub-section) and `repeats`
(determines if a sub-section can be contained once or multiple times).
The above simply defines a schema, to use the schema and create actual data, we have to
instantiate the above classes:
.. code-block:: python
run = Run()
system = run.m_create(System)
system.n_atoms = 3
system.atom_labels = ['H', 'H', 'O']
print(system.atom_labels)
print(n_atoms = 3)
Section `instances` can be used like regular Python objects: quantities and sub-sections
can be set and access like any other Python attribute. Special meta-info methods, starting
with ``m_`` allow us to realize more complex semantics. For example ``m_create`` will
instantiate a sub-section and add it to the `parent` section in one step.
Another example for an ``m_``-method is:
.. code-block:: python
run.m_to_json(indent=2)
This will serialize the data into JSON:
.. code-block:: JSON
{
"m_def" = "Run",
"systems": [
{
"n_atoms" = 3,
"atom_labels" = [
"H",
"H",
"O"
]
}
]
}
Definitions
-----------
.. autoclass:: Definition
Quantities
----------
.. autoclass:: Quantity
.. _metainfo-sections:
Sections
--------
With sections it is paramount to always be clear what is talked about. The lose
term `section` can reference one of the following three:
* `section definition`
Which is a Python object that represents the definition of
a section, its sub-sections and quantities. `Section definitions` should not be not
written directly. `Section definitions` are objects of :class:`Section`.
* `secton class`
Which is a Python class and :class:`MSection` decendant that is
used to express a `section defintion` in Python. Each `section class` is tightly
associated with its `section definition`. The `section definition` can be access
with the class attribute ``m_def``. The `section definition` is automatically created
from the `section class` upon defining the class through metaclass vodoo.
* `section instance`
The instance (object) of a `section class`, it `follows` the
definition associated with the instantiated `section class`. The followed
section definition can be accessed with the object attribute ``m_def``.
A `section class` looks like this:
.. code-block:: python
class SectionName(BaseSection):
''' Section description '''
m_def = Section(**section_attributes)
quantity_name = Quantity(**quantity_attributes)
sub_section_name = SubSection(**sub_section_attributes)
The various Python elements of this class are mapped to a respective `section definition`
attributes after the class was defined. The ``SectionName`` becomes the `name`. The
``BaseSection`` is either :class:`MSection` or if it is another `section class`, this
`section classes` `section definition` becomes a member of `base_sections`. The
``section_attributes`` become additional attributes of the `section definition`. The
various ``Quantity`` and ``SubSection`` become the `quantities` and `sub_sections`.
Each `section class` has to directly or indirectly extend :class:`MSection`. This will
provided certain class and object features to all `section classes` and all `section instances`.
Read :ref:metainfo-reflection to learn more.
.. autoclass:: Section
Sub-Sections
------------
.. autoclass:: SubSection
.. _metainfo-categories
Categories
----------
.. autoclass:: Quantity
Packages
--------
.. autoclass:: Package
.. _metainfo-custom-types:
Custom data types
-----------------
.. autoclass:: DataType
:members:
.. autoclass:: Enum
.. _metainfo-reflection
Reflection and custom data storage
----------------------------------
When manipulating metainfo data in Python, all data is represented as Python objects, where
objects correspond to `section instance` and their attributes to `quantity values` or
`section instances` of sub-sections. By defining sections with `section classes` each
of these Python objects already has an interface that allows to get/set quantities and
sub-sections. But often this interface is too limited, or the specific section and
quantity definitions are unknown when writing code.
.. autoclass:: MSection
:members:
:class:`MSection` does not keep all its data directly, but uses a data object that
decends from :class:`MData`.
.. autoclass:: MData
:members:
.. autoclass:: MDataDict
.. autoclass:: MetainfoError
.. autoclass:: DeriveError
.. autoclass:: MetainfoReferenceError
.. _metainfo-urls:
References and metainfo URLs
----------------------------
When in Python memory, quantity values that reference other sections simply contain a
Python reference to the respective `section instance`. However, upon serializing/storing
metainfo data, these references have to be represented differently.
Currently this metainfo implementation only supports references within a single
section hierarchy (e.g. the same JSON file). References are stored as paths from the
root section, over sub-sections, to the references section. Each path segment is
the name of the sub-section or an index in a repeatable sub-section:
``/system/0/symmetry``.
References are automatically serialized by :py:meth:`MSection.m_to_dict`. When de-serializing
data with :py:meth:`MSection.m_from_dict` these references are not resolved right away,
because the references section might not yet be available. Instead references are stored
as :class:`MProxy` instances. These objects are automatically replaced by the referenced
object when a respective quantity is accessed.
.. autoclass:: MProxy
A more complex example
----------------------
.. literalinclude:: ../nomad/metainfo/example.py
:language: python
"""
from .metainfo import MSection, MCategory, Definition, Property, Quantity, SubSection, \
Section, Category, Package, Enum, Datetime, m_package, units
Section, Category, Package, Enum, Datetime, MProxy, MetainfoError, DeriveError, \
MetainfoReferenceError, DataType, MData, MDataDict, m_package, units
......@@ -12,128 +12,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
The NOMAD meta-info allows to define physics data quantities. These definitions are
necessary for all computer representations of respective data (e.g. in Python,
search engines, data-bases, and files).
This modules provides various Python interfaces for
- defining meta-info data
- to create and manipulate data that follows these definitions
- to (de-)serialize meta-info data in JSON (i.e. represent data in JSON formatted files)
Here is a simple example that demonstrates the definition of System related quantities:
.. code-block:: python
class System(MSection):
\"\"\"
A system section includes all quantities that describe a single a simulated
system (a.k.a. geometry).
\"\"\"
n_atoms = Quantity(
type=int, description='''
A Defines the number of atoms in the system.
''')
atom_labels = Quantity(type=Enum(ase.data.chemical_symbols), shape['n_atoms'])
atom_positions = Quantity(type=float, shape=['n_atoms', 3], unit=Units.m)
simulation_cell = Quantity(type=float, shape=[3, 3], unit=Units.m)
pbc = Quantity(type=bool, shape=[3])
class Run(MSection):
systems = SubSection(sub_section=System, repeats=True)
Here, we define a `section` called ``System``. The section mechanism allows to organize
related data into, well, sections. Sections form containment hierarchies. Here
containment is a parent-child (whole-part) relationship. In this example many ``Systems``,
are part of one ``Run``. Each ``System`` can contain values for the defined quantities:
``n_atoms``, ``atom_labels``, ``atom_positions``, ``simulation_cell``, and ``pbc``.
Quantities allow to state type, shape, and physics unit to specify possible quantity
values.
Here is an example, were we use the above definition to create, read, and manipulate
data that follows these definitions:
.. code-bock:: python
run = Run()
system = run.m_create(System)
system.n_atoms = 3
system.atom_labels = ['H', 'H', 'O']
print(system.atom_labels)
print(run.m_to_json(ident=2))
This last statement, will produce the following JSON:
.. code-block:: JSON
{
"m_def" = "Run",
"System": [
{
"m_def" = "System",
"m_parent_index" = 0,
"n_atoms" = 3,
"atom_labels" = [
"H",
"H",
"O"
]
}
]
}
This is the JSON representation, a serialized version of the Python representation in
the example above.
Sections can be extended with new quantities outside the original section definition.
This provides the key mechanism to extend commonly defined parts with (code) specific
quantities:
.. code-block:: Python
class Method(nomad.metainfo.common.Method):
x_vasp_incar_ALGO=Quantity(
type=Enum(['Normal', 'VeryFast', ...]),
links=['https://cms.mpi.univie.ac.at/wiki/index.php/ALGO'])
\"\"\"
A convenient option to specify the electronic minimisation algorithm (as of VASP.4.5)
and/or to select the type of GW calculations.
\"\"\"
All meta-info definitions and classes for meta-info data objects (i.e. section instances)
inherit from :class:` MSection`. This base-class provides common functions and properties
for all meta-info data objects. Names of these common parts are prefixed with ``m_``
to distinguish them from user defined quantities. This also constitute's the `reflection`
interface (in addition to Python's build in ``getattr``, ``setattr``) that allows to
create and manipulate meta-info data, without prior program time knowledge of the underlying
definitions.
.. autoclass:: MSection
The following classes can be used to define and structure meta-info data:
- sections are defined by sub-classes :class:`MSection` and using :class:`Section` to
populate the classattribute `m_def`
- quantities are defined by assigning classattributes of a section with :class:`Quantity`
instances
- references (from one section to another) can be defined with quantities that use
section definitions as type
- dimensions can use defined by simply using quantity names in shapes
- categories (former `abstract type definitions`) can be given in quantity definitions
to assign quantities to additional specialization-generalization hierarchies
See the reference of classes :class:`Section` and :class:`Quantities` for details.
.. autoclass:: Section
.. autoclass:: Quantity
"""
from typing import Type, TypeVar, Union, Tuple, Iterable, List, Any, Dict, Set, \
Callable as TypingCallable, cast
from collections.abc import Iterable as IterableABC
......@@ -172,6 +50,7 @@ class DeriveError(MetainfoError):
class MetainfoReferenceError(MetainfoError):
""" An error indicating that a reference could not be resolved. """
pass
# Metainfo quantity data types
......@@ -187,7 +66,11 @@ class Enum(list):
class MProxy():
""" A placeholder object that acts as reference to a value that is not yet resolved. """
""" A placeholder object that acts as reference to a value that is not yet resolved.
Attributes:
url: The reference represented as an URL string.
"""
def __init__(self, url: str):
self.url = url
......@@ -463,8 +346,7 @@ class MData:
Metainfo data objects store the data of a single section instance. This interface
constitutes the minimal functionality for accessing and modifying section data.
Different implementations of this interface, can realize different storage backends,
or include different rigorosity of type and shape checks.
Different implementations of this interface, can realize different storage backends.
All section instances will implement this interface, usually be delegating calls to
a standalone implementation of this interface. This allows to configure various
......@@ -518,7 +400,7 @@ class MData:
class MDataDict(MData):
""" A simple dict backed implementaton of :class:`MData`. """
""" A simple dict backed implementaton of :class:`MData`. It is used by default. """
def __init__(self, dct: Dict[str, Any] = None):
if dct is None:
......@@ -595,40 +477,34 @@ class MDataDict(MData):
class MSection(metaclass=MObjectMeta):
"""Base class for all section instances on all meta-info levels.
All metainfo objects instantiate classes that inherit from ``MSection``. Each
section or quantity definition is an ``MSection``, each actual (meta-)data carrying
section is an ``MSection``. This class consitutes the reflection interface of the
meta-info, since it allows to manipulate sections (and therefore all meta-info data)
without having to know the specific sub-class.
All `section instances` indirectly instantiate the :class:`MSection` and therefore all
members of :class:`MSection` are available on all `section instances`. :class:`MSection`
provides many special attributes and functions (they all start with ``m_``) that allow
to reflect on a `section's definition` and allow to manipulate the `section instance`
without a priori knowledge of the `section defintion`.
It also carries all the data for each section. All sub-classes only define specific
sections in terms of possible sub-sections and quantities. The data is managed here.
The reflection insterface for reading and manipulating quantity values consists of
Pythons build in ``getattr``, ``setattr``, and ``del``, as well as member functions
:func:`m_add_value`, and :func:`m_add_values`.
Attributes:
m_def: The `section definition` that this `section instance` follows as a
:class:`Section` object.
Sub-sections and parent sections can be read and manipulated with :data:`m_parent`,
:func:`m_sub_section`, :func:`m_create`.
m_parent:
If this section is a sub-section, this references the parent section instance.
.. code-block:: python
m_parent_sub_section:
If this section is a sub-section, this is the :class:`SubSection` that defines
this relationship.
system = run.m_create(System)
assert system.m_parent == run
assert run.m_sub_section(System, system.m_parent_index) == system
m_parent_index:
For repeatable sections, parent keep a list of sub-sections. This is the index
of this section in the respective parent sub-section list.
m_data: The :class:`MData` implementations that stores the section data. It keeps
the quantity values and sub-section. It should only be read directly
(and never manipulated).
Attributes:
m_def: The section definition that defines this sections, its possible
sub-sections and quantities.
m_parent: The parent section instance that this section is a sub-section of.
m_parent_sub_section: The sub section definition that holds this section in the parent.
m_parent_index: For repeatable sections, parent keep a list of sub-sections for
each section definition. This is the index of this section in the respective
parent sub-section list.
m_data: The dictionary that holds all data of this section. It keeps the quantity
values and sub-section. It should only be read directly (and never manipulated)
if you are know what you are doing. You should always use the reflection interface
if possible.
"""
m_def: 'Section' = None
......@@ -889,6 +765,7 @@ class MSection(metaclass=MObjectMeta):
return value
def m_is_set(self, quantity_def: 'Quantity') -> bool:
""" True if the given quantity is set. """
quantity_def = self.__resolve_synonym(quantity_def)
if quantity_def.derived is not None:
return True
......@@ -1111,7 +988,7 @@ class MSection(metaclass=MObjectMeta):
return section
def m_to_json(self, **kwargs):
"""Returns the data of this section as a json string. """
""" Returns the data of this section as a json string. """
return json.dumps(self.m_to_dict(), **kwargs)
def m_all_contents(self) -> Iterable[Content]:
......@@ -1123,7 +1000,7 @@ class MSection(metaclass=MObjectMeta):
yield content
def m_contents(self) -> Iterable[Content]:
"""Returns an iterable over all direct subs sections. """
""" Returns an iterable over all direct subs sections. """
for sub_section_def in self.m_def.all_sub_sections.values():
if sub_section_def.repeats:
index = 0
......@@ -1151,6 +1028,7 @@ class MSection(metaclass=MObjectMeta):
return '%s/%s' % (self.m_parent.m_path().rstrip('/'), segment)
def m_root(self, cls: Type[MSectionBound] = None) -> MSectionBound:
""" Returns the first parent of the parent section that has no parent; the root. """
if self.m_parent is None:
return cast(MSectionBound, self)
else:
......@@ -1228,6 +1106,7 @@ class MSection(metaclass=MObjectMeta):
return errors
def m_all_validate(self):
""" Evaluates all constraints in the whole section hierarchy, incl. this section. """
errors: List[str] = []
for section, _, _, _ in itertools.chain([(self, None, None, None)], self.m_all_contents()):
for error in section.m_validate():
......@@ -1274,6 +1153,36 @@ class MCategory(metaclass=MObjectMeta):
# Metainfo M3 (i.e. definitions of definitions)
class Definition(MSection):
""" A common base for all metainfo definitions.
All metainfo `definitions` (sections, quantities, sub-sections, packages, ...) share
some common attributes. These are defined in a common base: all
metainfo items extend this common base and inherit from ``Definition``.
Attributes:
name: Each `definition` has a name. Names have to be valid Python identifier.
They can contain letters, numbers and _, but must not start with a number.
This also qualifies them as identifier in most storage formats, databases,
makes them URL safe, etc.
Names must be unique within the :class:`Package` or :class:`Section` that
this definition is part of.
description: The description can be an arbitrary human readable text that explains
what this definition is about.
links: Each definition can be accompanied by a list of URLs. These should point
to resources that further explain the definition.
categories: All metainfo definitions can be put into one or more `categories`.
Categories allow to organize the definitions themselves. It is different from
sections, which organize the data (e.g. quantity values) and not the definitions
of data (e.g. quantities definitions). See :ref:`metainfo-categories` for more details.
Additional helper functions for `definitions`:
.. automethod:: all_definitions
"""
__all_definitions: Dict[Type[MSection], List[MSection]] = {}
......@@ -1293,7 +1202,11 @@ class Definition(MSection):
@classmethod
def all_definitions(cls: Type[MSectionBound]) -> Iterable[MSectionBound]:
""" Returns all definitions of this definition class. """
""" Class method that returns all definitions of this class.
This can be used to get a list of all globally available `defintions` or a certain
kind. E.g. to get all `quantities`: ``Quantity.all_definitions()``.
"""
return cast(Iterable[MSectionBound], Definition.__all_definitions.get(cls, []))
def on_set(self, quantity_def, value):
......@@ -1310,16 +1223,102 @@ class Property(Definition):
class Quantity(Property):
"""Used to define quantities that store a certain piece of (meta-)data.
""" Definition of an atomic piece of data.
Quantities are the basic building block with meta-info data. The Quantity class is
used to define quantities within sections. A quantity definition
is a (physics) quantity with name, type, shape, and potentially a unit.
Quantity definitions are the main building block of meta-info schemas. Each quantity
represents a single piece of data.
In Python terms, quantities are descriptors. Descriptors define how to get, set, and
delete values for a object attribute. Meta-info descriptors ensure that
type and shape fit the set values.
"""
To define quantities, use objects of this class as classattribute values in
`section classes`. The name of a quantity is automatically taken from its `section class`
attribute. You can provide all other attributes to the constructor with keyword arguments
See :ref:`metainfo-sections` to learn about `section classes`.
In Python terms, ``Quantity`` is a descriptor. Descriptors define how to get and
set attributes in a Python object. This allows us to use sections like regular
Python objects and quantity like regular Python attributes.
Beyond basic :class:`Definition` attributes, Quantities are defined with the following
attributes.
Attributes:
type:
Defines the datatype of quantity values. This is the type of individual elements
in a potentially complex shape. If you define a list of integers for example,
the `shape` would be list and the `type` integer:
``Quantity(type=int, shape=['0..*'])``.
The `type` can be one of:
- a build-in primitive Python type: ``int``, ``str``, ``bool``, ``float``
- an instance of :class:`Enum`, e.g. ``Enum('one', 'two', 'three')``
- a section to define references to other sections as quantity values
- a custom meta-info :class:`DataType`, see :ref:`metainfo-custom-types`
- a numpy `dtype`, e.g. ``np.dtype('float32')``
- ``typing.Any`` to support any value
If set to `dtype`, this quantity will use a numpy array to store values internally.
If a regular (nested) Python list is given, it will be automatically converted.
The given `dtype` will be used in the numpy array.
To define a reference, either a `section class` or instance of :class:`Section`
can be given. See :ref:`metainfo-sections` for details. Instances of the given section
constitute valid values for this type. Upon serialization, references section
instance will represented with metainfo URLs. See :ref:`metainfo-urls`.