Alvin Noe Ladines · b52d8ef0 · 7ede1f24 · 6abab8d4 · 64711e99 · cb7cf390
--- a/docs/howto/customization/mapping_parser.md 0 → 100644

+ 217

− 0
+++ b/docs/howto/customization/mapping_parser.md 0 → 100644

+ 217

− 0
+# How to write data to archive with MappingParser
+
+`MappingParser` is a generic parser class implemented in
+`nomad.parsing.file_parser/mapping_parser.py` to handle the conversion to and from a
+data object and a python dictionary. We refer to an instance of the
+this class as 'mapping parser' throughout this section. In the following, the abstract
+properties and methods of the mapping parser are explained. The various implementations of
+the mapping parser are also defined and `Mapper` which is required to convert a
+mapping parser into another mapping parser is explained as well.
+
+## MappingParser
+
+The mapping parser has several abstract properties and methods and the most important
+ones are listed in the following:
+
+- `filepath`: path to the input file to be parsed
+- `data_object`: object resulting from loading the file in memory with `load_file`
+- `data`: dictionary representation of `data_object`
+- `mapper`: instance of `Mapper` required by `convert`
+- `load_file`: method to load the file given by `filepath`
+- `to_dict`: method to convert `data_object` into `data`
+- `from_dict`: method to convert `data` into `data_object`
+- `convert`: method to convert to another mapping parser
+
+`data_object` can be an `XML` element tree or a `metainfo` section for example depending on
+the inheriting class. In order to convert a mapping parser to another parser,
+the target parser must provide a [`Mapper`](#mapper) object. We refer to this simply as
+mapper throughout.
+
+In the following, we describe the currently implemented mapping parsers.
+
+### XMLParser
+
+This is mapping parser for XML files. It uses [`lxml`](https://lxml.de/) to
+load the file as an element tree. The dictionary is generated by iteratively parsing the
+elements of the tree in `to_dict`. The values parsed from element `text` are automatically
+converted to a corresponding data type. If attributes are present, the value is wrapped in
+a dictionary with key given by `value_key` ('__value' by default) while the attribute keys
+are prefixed by `attribute_prefix` ('@' by default). The following XML:
+
+```xml
+<a>
+  <b name='item1'>name</b>
+  <b name='item2'>name2</b>
+</a>
+```
+
+will be converted to:
+
+```python
+    data = {
+      'a' : {
+        'b': [
+          {'@name': 'item1', '__value': 'name'},
+          {'@name': 'item2', '__value': 'name2'}
+        ]
+      }
+    }
+```
+
+The conversion can be reversed using the `from_dict` method.
+
+### HDF5Parser
+
+This is the mapping parser for HDF5 files. It uses [`h5py`](https://www.h5py.org/) to load
+the file as an HDF5 group. Similar to [XMLParser](#xmlparser), the HDF5 datasets are
+iteratively parsed from the underlying groups and if attributes are present these are
+also parsed. The `from_dict` method is also implemented to convert a dictionary into an
+HDF5 group.
+
+### MetainfoParser
+
+This is the mapping parser for NOMAD archive files or metainfo sections.
+It accepts a schema root node annotated with `MappingAnnotation` as `data_object`.
+`create_mapper` generates the actual mapper as matching the `annotation_key`.
+If a `filepath` is specified, it instead falls back on the [`ArchiveParser`](--ref--).  <!-- TODO: add reference -->
+
+The annotation should always point to a parsed value via a `path` (JMesPath format).
+It may optionally specify a multi-argument `operator` for data mangling.  <!-- most operators are binary, would change the name -->
+In this case, specify a tuple consisting of:
+
+- the operator name, defined within the same scope.
+- a list of paths with the corresponding values for the operator arguments.  <!-- @Alvin: can you verify? -->
+
+Similar to `MSection`, it can be converted to (`to_dict`) or from (`from_dict`) a Python `dict`.
+Other attributes are currently accessible.
+
+```python
+from nomad.datamodel.metainfo.annotations import Mapper as MappingAnnotation
+
+class BSection(ArchiveSection):
+    v = Quantity(type=np.float64, shape=[2, 2])
+    v.m_annotations['mapping'] = dict(
+        xml=MappingAnnotation(mapper='.v'),
+        hdf5=MappingAnnotation(mapper=('get_v', ['.v[0].d'])),
+    )
+
+    v2 = Quantity(type=str)
+    v2.m_annotations['mapping'] = dict(
+        xml=MappingAnnotation(mapper='.c[0].d[1]'),
+        hdf5=MappingAnnotation(mapper='g.v[-2]'),
+    )
+
+class ExampleSection(ArchiveSection):
+    b = SubSection(sub_section=BSection, repeats=True)
+    b.m_annotations['mapping'] = dict(
+        xml=MappingAnnotation(mapper='a.b1'), hdf5=MappingAnnotation(mapper='.g1')
+    )
+
+ExampleSection.m_def.m_annotations['mapping'] = dict(
+    xml=MappingAnnotation(mapper='a'), hdf5=MappingAnnotation(mapper='g')
+)
+
+parser = MetainfoParser()
+p.data_object = ExampleSection(b=[BSection()])
+p.annotation_key = 'xml'
+p.mapper
+# Mapper(source=Path(path='a'....
+```
+
+### Converting mapping parsers
+
+The following is a sample python code to illustrate the mapping of the contents of an
+HDF5 file to an archive. First, we create a `MetainfoParser` object for the archive. The
+annotation key is set to `hdf5` which will generate a
+[mapper](#mapper) from the `hdf5` annotations defined in the definitions. Essentially,
+only metainfo sections and quantities with the `hdf5` annotation will be mapped. The mapper
+will contain paths for the source (HDF5) and the target (archive). The archive is then
+set to the archive parser `data_object`. Here, the archive already contains some data
+which should be merged to data that will be parsed. Next, a parser for HDF5 data is
+created. We use a custom class of the `HDF5Parser` which implements the `get_v` method
+defined in `BSection.v` In this example, we do not read the data from the HDF5 file but
+instead generate it from a dictionary by using the `from_dict` method. By invoking the
+`convert` method, the archive parser data object is populated with the corresponding
+HDF5 data.
+
+```python
+    class ExampleHDF5Parser(HDF5Parser):
+        @staticmethod
+        def get_v(value):
+            return np.array(value)[1:, :2]
+
+    archive_parser = MetainfoParser()
+    archive_parser.annotation_key = 'hdf5'
+    archive_parser.data_object = ExampleSection(b=[BSection(v=np.eye(2))])
+
+    hdf5_parser = ExampleHDF5Parser()
+    d = dict(
+        g=dict(
+            g1=dict(v=[dict(d=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))]),
+            v=['x', 'y', 'z'],
+            g=dict(
+                c1=dict(
+                    i=[4, 6],
+                    f=[
+                        {'@index': 0, '__value': 1},
+                        {'@index': 2, '__value': 2},
+                        {'@index': 1, '__value': 1},
+                    ],
+                    d=[dict(e=[3, 0, 4, 8, 1, 6]), dict(e=[1, 7, 8, 3, 9, 1])],
+                ),
+                c=dict(v=[dict(d=np.eye(3), e=np.zeros(3)), dict(d=np.ones((3, 3)))]),
+            ),
+        )
+    )
+    hdf5_parser.from_dict(d)
+
+    hdf5_parser.convert(archive_parser)
+
+    # >>> archive_parser.data_object
+    # ExampleSection(b, b2)
+    # >>> archive_parser.data_object.b[1].v
+    # array([[4., 5.],
+    #   [7., 8.]])
+```
+
+## Mapper
+
+A mapper is necessary in order to convert a mapping parser to a target mapping parser
+by mapping data from the source to the target. There are three kinds of mapper: `Map`,
+`Evaluate` and `Mapper` each inheriting from `BaseMapper`. A mapper has attributes
+source and target which define the paths to the source data and target, respectively.
+`Map` is intended for mapping data directly from source to target. The path to the data is
+given by the attribute `path`. `Evaluate` will execute a function defined by
+`function_name` with the arguments given by the mapped values of the paths in
+`function_args`. Lastly, `Mapper` allows the nesting of mappers by providing a list of
+mappers to its attribute `mapper`. All the paths are instances of `Path` with the string
+value of the path to the data given by the attribute `path`. The value of path should
+follow the [jmespath specifications](https://jmespath.org/specification.html) but could be
+prefixed by `.` which indicates that this is a path relative to the parent. This will communicate to the
+mapper which source to get the data.
+
+```python
+    Mapper(
+        source=Path(path='a.b2', target=Path(path='b2'), mapper=[
+            Mapper(
+                source=Path(path='.c', parent=Path(path='a.b2')),
+                target=Path(path='.c', parent=Path(path='b2')), mapper=[
+                    Map(
+                        target=Path(
+                            path='.i', parent=Path(path='.c', parent=Path(path='b2'))
+                        ),
+                        path=Path(
+                            path='.d', parent=Path(path='.c' parent=Path(path='a.b2'))
+                        )
+                    ),
+                    Evaluate(
+                        target=Path(
+                            path='.g', parent=Path(path='.c', parent=Path(path='b2'))
+                        ),
+                        function_name='slice', function_args=[Path(path='a.b2.c.f.g.i')]
+                    )
+                ]
+            )
+        ),
+    )
+```