Restructure parser plug ins

removed parsers label

mentioned in issue #1744

created branch 1536-restructure-parser-plug-ins to address this issue

mentioned in merge request !1538 (closed)

mentioned in issue #1836

The described issue is related to a NOMAD plugin or plugin under development. Please refer all further discussion to this issue about refactoring NOMAD plugins.

New parent issue: #1836

Hi @ladinesa @jrudz @ndaelman,

I think we could start to make decisions about the organization of the parser plugins, and I would like to hear your opinions on:

How the structure of a parser-plugin should be? I think we can simplify the current one to be something simpler, a sub-folder nomad-coe/nomad-parser-plugin-<name>/parser/ which then contains only two modules (no subfolder metainfo), parser.py and extended_schema.py. As I understood yesterday, this is going to be merged eventually, but why not simplify our lives already? Plus, this structure could also be used once the reader and writer combine in a schema for a file so that we can maintain modules per file (mainfile.py, <auxiliary_file_1_name>.py,..., here mainfile can be named as it is, the auxiliary file modules can have their own name).
Should we keep using x_<codename>_ for code-specific quantities? I'd say we can stop using these, as the original intent of base classes is to inherit from somewhere else and extend the metainfo with user-defined quantities without the need of that convention. However, I think there should be some discussion and development on how to handle situations like overwriting quantities (we should have more under control this, as normalization depends on what a Quantity is assumed to be), and perhaps other topics.
What about not adding certain unused (or not developed enough) parser plugins as sub-modules in the central NOMAD? I agree we should move them to plugins, but maybe we can skip linking them as submodules until they are develop enough.
And lastly, I think we should start promoting a little bit this structure for devs, so that they can develop and maintain their own parsers. For this, I think we could create an skeleton that they can start working in (with comments and explanations of what to populate), I wonder whether we can refine the parser example for this. --> work out the documentation by using nomad-parser-plugin-example as the template for devs to use.

Let me know what you think. I am facing these questions in #1846 (closed), #1878, #1773; I guess you all have your own situations (like MOLPRO, elphbolt, and other MD parsers).

I'm not sure why we have to readdress these points again... Maybe it's just a lack of note-taking in older meetings? Anyhow, if there are new developments that appear in the issues you linked, pls provide the actual thread / note where they're being discussed.

Going point-by-point in a generic fashion:

The plugin structure has been set by Markus S. They are the same for everyone in NOMAD. If you want to streamline it, talk to him. I agree that the subfolder structure is a bit convoluted for smaller projects, but that could be easily mitigated by a template.

1.1 @ladinesa can probably explain this better, but metainfo is mostly for descriptive purposes. This will likely become important when searching for plugins in a "store front". It will also become the new location of where to specify file matching.

1.2 The structure you propose with auxiliary files is exactly what we started doing already (see molpro parser). If you ask @ladinesa for a review, he'll point this out. mainfile.py is just called parser.py. No value in changing the name. Actually, semantically, parser.py conveys the structure more correctly.

Should we keep using x_<codename>_ for code-specific quantities? I'd say we can stop using these [...]

I agree with the sentiment. Some old parsers, like FHI-aims, in particular need a redo to gut these properties. Still, as you mentioned, then there are the control aspects:

2.1 Normalization handling: this is almost trivial, as the dev should know when to overwrite normalization. To facilitate this process, we should provide minimal functions that target individual (groups of) quantities. The only "problem" appears when higher sections touch lower-section quantities. We should thus (a) refrain from cross-normalizing as much as possible and (b) clarify to any dev that some quantities are used in different sections too. They are then responsible with inheriting and extending / overwriting these as well.

2.2 Name clashes: these are very hard to foretell, and arguably the only case where x_<program_name> still serves a purpose. I think the best strategy here is to be very clear on the terminology in the standard schema (probably develop a terminology list / ontology that runs ahead of our current quantities) and hope that conflicting cases only occur rarely.

I'm not sure what you mean... Each oasis admin decides themselves which plugins are installed. On your local pc, you are that admin. As the plugins are rolled out, you won't have to bother with any parsers, except those you are developing at that moment. Obviously, all standard ones should be available in the central service.

I think we could create an skeleton that they can start working

Skeleton? We should have both a template and documentation on how to write / contribute to a parser, yes. We already set a date for the latter...

Sorry, I didn't know these have been discussed already. In my issues, they are definitely not.

I meant the folder's module structure, but I saw your molpro structure and this is pretty much what I had in mind, so I will try to follow it. I am also curious where would you put the metainfo extension if you need to (which is what I said we could avoid doing, a sub-folder nomad-parser-plugin-/parser/metainfo/).
Ok
Yeah, sorry for the confusion: I meant in the central NOMAD via the simulation-parsers submodule; are we going to connect all the parser plugins, even if these are empty?
Indeed, this goes into the documentation and using nomad-parser-plugin-example for it. I will edit my comment solving this question.

Sorry, I didn't know these have been discussed already. In my issues, they are definitely not.

No prob, I guess that there are too many discussions that go in parallel. The conclusions could be communicated better...

If there are aspects that I overlooked but that pop up in your issues, then feel free to point them out.

I meant only parsers that have been updated, of course. Since the projects are stand-alone, we can do this on a case-by-case basis. In practice, when we have the plugin version of a parser, we ask Markus to install that one. Deprecation is as simple as removing the matching from the central code (and once all relevant parsers have been substituted, removing the old dependency). This will likely entail reprocessing any calculation of that code (and workflows that use them).

In terms of the x_<code_name>: I agree partially that we may want to deprecate this particular. However, I would opt to keep some sort of standard naming for storing data that is in the raw file but that we don't recognize and cannot connect to the NOMAD schema. These quantities do not need to be stored individually but should still be available in the archive. In the H5MD parser I have opted for subsections named, e.g., x_h5md_custom_calculations (but I would be happy to remove the label of this particular parser, e.g., custom_calculations), which ends up being a list of custom observables that the user stores in the archive but is not recognized as observables existing in the schema. I anticipate this to be a way for users to easily store custom data in the archive, but also propose additions to the schema but pointing to these custom quantities in existing archives.

@ndaelman Yeah, this is what I have in mind too. In any case, some parsers do not have more than a few dozens of entries or so.

@jrudz let's summarize then what we can teach people on the usage of our base classes and some conventions:

An user inherits a base class defined in our data schema, e.g., Outputs.
They extend it by inherit in some class CustomOutputs(Outputs) or VASPOutputs(Outputs).
Outputs.normalize() take care of only normalizing things that are defined in the base class. CustomOutputs.normalize() can handle custom normalizations. Just as @ndaelman in 2.1 and 2.2 before.

See that in any of these steps we do not need to use x_<codename>, just be careful that normalizations in the simulation base classes normalize what they should.

mentioned in issue #1886 (closed)

mentioned in merge request !1686 (merged)

Restructure parser plug ins

Designs

Child items ...

Activity