Metainfo entries for Machine Learning models, workflows and output

A user asked for advice on how he could have the output of his ML models as NOMAD entries. In particular, he is interested in crystal structures as the output of his ML predictions. But this could be extended to any kind of prediction for example, bulk_modulus values predicted by formulas and so on.

In the same way, as we provide infrastructure for generating entries for output generated by DFT codes, we could also do so for ML for Materials Informatics. Maybe there is something already available for this or a concept that I am not aware of. @lucamghi please, feel free to educate me on this topic. Obviously, with the upcoming variety of ML codes, writing code-specific parsers do not sound to me like a scalable and sustainable way to proceed. But maybe the new NOMAD features coming along like the custom schemas can provide some opportunities to offer a solution in this regard.

I give an example of a particular case: One user is doing ML for generative crystal structures. The output of the prediction is a set of crystal structures that he would like to have as nomad entries and enjoy the features and interoperability that NOMAD offers. One option would be to have in an upload the following:

A notebook that the user uses to train and run his model, that hopefully could be run directly on NOMAD even after the upload is published.
The output of the prediction (structure files, i.e. .cif files)
A custom schema file customML.schema.archive.yaml. This schema contains sections and quantities that can deal with the output. For the crystal structure case, we could have a base section BaseSectionWithStructureFile with a Quantity called structure_file. This structure file gets parsed into the nomad metainfo populating run.system.atoms and then calling the normalizers to populate results.materials creating relatively rich NOMAD entries. This schema will necessarily need to have some metadata with a minimal description of the code used, the ML method, and so on. This could be provided as a base class with ELN components or just be written in an archive.json file directly that maps a custom schema.

We could then design an exemplary upload for this kind of workflow that users can adapt to their workflow.

With the help of @himanel1 and @ladinesa I have written an ElnWithStructureFile base section which kind of does this normalization, but to make it work we needed to do a change in nomad/normalizing/method.py. All of this is in this branch as a proof of concept, but I have no idea of what would be the necessary minimal metadata to describe the ML workflow that should go along with these entries.

@mscheidg, @lucamghi, @afekete what do you think about this topic? Please, feel free to tag anyone else that should be interested in the topic.

Edited Sep 30, 2022 by Jose Marquez Prieto