Normalization refactor
We currently have a concept of "normalization" that is separate from parsing. For this we have separate classes that are run in a pre-determined order based on what is configured in nomad.yaml. In addition, schemas may define a "normalize" function that specified schema-specific functionality that is run by a special MetainfoNormalizer
, typically at the very end.
There are certain limitations to this approach:
- A single global order for the normalizers does not work for every parser and schema.
- The distinction between parsing and normalizing is often blurred.
- It is hard to track the execution order of normalizers
- It is hard to understand the difference between a
Normalizer
and anormalize
-function. - It is hard for a plugin developer to make use of the existing normalization code in
nomad.normalizing
.
In order to simplify this, we could think of a simpler strategy, such as:
- All processing of an entry could be moved into a single
normalize
/parse
function of the schema. In the case of python schemas this is almost the case already. In the case of a parser, we would explicitly define a schema as well that would just combine the contents of the currentmetainfo
folder andparser.py
module. Essentially parser == schema. - There would no longer be a set of separate
Normalizers
that would be run after parsing: each schema can still make use of their functionaliy, but they would be in full control of the call order. - For parsers/schemas that behave very similarly (such as all electronic structure parsers) there could be base classes that trigger a common set of normalization procedures, but the schema could overwrite this at any point.
- The current code in
nomad.normalizing
is based on classes that operate on archives. The classes typically contain different methods for processing different parts of the data. It is hard for a plugin developer to use the existing normalization functionality since it is abstracted inside these monolithic classes. The situation could be improved by providing several smaller pure functions inside thenomad.normalizing.<name>
modules.
I feel that this would bridge the gap between parsers and Python schemas and would make plugin development much more intuitive.