We were talking about a full refactoring of the section run into data. The idea is to slowly build the metadata schema for data maintaining the three main sections: System, Method and Calculation.
The goal is to have normalize() specifically in each of the main sections (with specific defined functions inside), and tackle normalization there (instead of in SystemNormalizer, MethodNormalizer and all the potential list of normalizers for Calculation).
Furthermore, in this refactoring we will heavily clean-up the sub-sections and quantities currently deprecated, unused or wrongly defined in the metainfo.
Lastly, we should also refactor the plugin projects structure (as Pavel pointed out and as we know from the EELSDB parser, it is a bit confusing). An open question here: how much of the current parsers we want to maintain? IMO: we should deprecate some of them, as they are actually empty or only populating very simple metadata.
As a dynamic (we can modify this at will):
Start with System and slowly build the quantities needed to define it.
After that and add extensive testing, go into Method.
Finally, we take care of Calculation. This is the most heavily refactored one.
Some simple rules on developing:
Always add typing for passing arguments to functions.
Be as general as possible. One of the current problems on run is that some quantities are a bit restrictive because of the packing that was done before; in other words, keep as much as possible a clean class structures (e.g., avoid the Dos or BandStructure current structure).
Add extensive descriptions in the normalize functions. Be verbose on which quantities are minimal and which ones are extra.
Always ask for feedback on some unknown quantities.
✓
3 of 3 checklist items completed
· Edited
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
This is quite a challenge, but I like the ambition. A few tips, concerns, add-ons:
Involve @ladinesa into these steps as closely as possible. You will probably mess up his parsers alot.
I had similar thoughts about the workflowEntryArchive subsection. We are also thinking to improve the general normalize-function design. We should discuss/plan the other refactors together in a meeting with @mscheidg, @hnaesstroem, @ladinesa, @himanel1, @ndaelman, @jrudz, @pizarroj
I like to add a requirement: Make sure that the simulation package is basically a plugin and not imported by any other nomad-FAIR module. We currently have ase, pymatgen, mdanalysis, mdtraj, matid and other quite heavy dependencies in the main nomad-lab installation. This refactor could help to make some of them optional.
Try to write down some documentation/best-practices on the use of composition vs. inheritance, naming conventions, etc. before hand
Thanks for the feedback @mscheidg, this all sounds reasonable to me. From the Area C side, we will put together an initial strategy/proposal and then call a meeting with the above-mentioned people to discuss.
@himanel1@jrudz I have a couple of minor doubts regarding the schema for System:
Why AtomsGroup is a separate sub-section of Atoms? Shouldn't it be under Atoms?
Prototype refers to AFLOW metadata. Can we move this into the AFLOW plugin? With the new User-defined search this could go there. Or is there anything else I didn't find about this metadata?
This is a class, how can it be under Atoms without being a sub-section?
Not saying that this is relevant in this discussion, but you could use (multiple) inheritance to introduce the properties at the same level.
Alternatively, there's also composition (more flexible), but then you have to add a manual mapping if you want to avoid the layering of data structures.
AtomsGroup and Atomsnow live as separate sub-sections under System (system.atoms and system.atoms_group. But I would have grouped them in the same section, i.e., system.atoms describing the full atomic system, and system.atoms.atoms_group nesting subgrous of Atoms.
Ah, ok, I see. I cannot remember if I did this for any particular reason. It was basically the first thing I did when I joined, so probably not. From my perspective now I would say that it makes sense as a subsection of atoms.
@himanel1 any technical arguments from your side about keeping atoms_group out of atoms?
The only other thing we should consider maybe is the support for coarse-grained descriptions (which I would also like to tackle during this refactoring). There may be an argument to keep the atoms_group at a higher level (renamed to something more general), such that the grouping can reference either atoms or some coarser entities. I am not sure though, I need to think about it more.
There is an alternative which could prevent these potential issues: move all quantities inside Atoms to System, and keep AtomsGroup as a sub-section of System.
so we specialize System to include the usual atomic system parameters and in case there are atom groups we use CompositeSystem with components for atoms_group. basically, what @pizarroj just said i think.
I think you both should be included in the discussions in that repo, but we can also keep it to this issue (I can copy-paste the message here). What do you think @jrudz?
IMO the discussion there is maybe more extensive than Laurie and Alvin might want (or have time) to follow. Maybe we can try to summarize here once in a while when we have bigger breakthroughs or questions? I don't know, whatever is easiest for people I guess. Have you already shared your slides here?
@mscheidg We still have many conceptual things to figure out here, however, should we already schedule a short meeting to discuss more about how the refactoring would work in practice or would you prefer to start that discussion here first?
For the former, I imagine that 30 min would be enough to get the major questions out of the way and walk us through Area D's perspective?
I think early this morning I can share the branch with the agreed initial base sections and schema for System (or how we should call it from now on to differentiate from the baseclass.py one, ModelSystem). There are some details here that Area D missed but that I will try to explain later on. For now, I have only three points (mainly for @mscheidg@himanel1@ladinesa , please comment on this as you were involved one or another way on them) that I need help with:
Is OK if we fully deprecate Prototype, SpringerMaterial and Constraint? The first seems an AFLOW specific metainfo, the second could be safely deleted imo, and the third is unused (or at least, I didn't find any case where this is used).
@mscheidg what about deprecating, or leaving out for now, Descriptors and SOAP. I see someone called James Darby worked on this, but I am not 100% sure how this should make it into the base class, but please, feel free to argue against it.
@himanel1 what about the section topology in results? I see there you flattened everything, and Nathan pointed out yesterday that this is because of ES. Our idea here is to remove AtomsGroup and simply nest ModelSystem over itself. This, together with defining some of the quantities directly under ModelSystem is working already for the parent-child system trees, but I have the question of what would happen when we eventually connect data and results. We can also discuss this in a Zoom if you want, as there are some other details behind this that we should both know.
For @ndaelman and @jrudz , as an example testing I will be using Wannier90, as it will show all functionalities one way or another. But, do you want me to add some other parser to the testing?
From my side I don't think it's necessary to include another parser. Let's just see how it plays out for your example.
Small comment abount nr. 2: In the near future, I was planning to extend the descriptors class to include some other representations. It doesn't much matter to me if this is deprecated now and then revived once we have the base classes set up or carried over immediately, just thought I would mention it. If we do want to carry it over, we could discuss how to represent this in our system class early next week.
No problem, we can also keep it. I am curious about @mscheidg and James Darby usage of this, as maybe we could reuse it via normalization in some cases.
@himanel1 what about the section topology in results? I see there you flattened everything, and Nathan pointed out yesterday that this is because of ES. Our idea here is to remove AtomsGroup and simply nest ModelSystem over itself. This, together with defining some of the quantities directly under ModelSystem is working already for the parent-child system trees, but I have the question of what would happen when we eventually connect data and results. We can also discuss this in a Zoom if you want, as there are some other details behind this that we should both know.
@ndaelman is right: it is only flattened because of ES. I think nested ModelSystems are the way to go in your schema. If results.material.topology is still around when the computational data is migrated into the new schema, then I would still "flatten" it out there so that each part can be searched for in ES.
One of the objectives for refactoring into base sections was exactly to avoid nesting (in the JSON-sense):
It becomes a pain in the ass to work with, since we have to traverse an object-tree to add anything within the hierarchy, instead of a list... Of course, you wouldn't manually append to the list, but rather have a specialized method (attach or __add__) handle that. The only inputs would be the parent and child labels, as well as the group being added.
System acts as a ledger showing the various groupings. Since the hierarchy comes from multiple sources (e.g. parser, normalizer, uploader) the hierarchies are no longer simple partitions. Re-using building blocks or denoting overlap is easier in this structure.
This approach makes mapping to the topology almost trivial.
To be clear, I'd have a non-repeating System section, which contains a repeating section SubSystem that references its constituent components, as well as the identifier of it parent SubSystem.
@pizarroj Note how SubSystem would be a further abstraction of AtomicCell in your class diagram.
The latter will definitely be used, but I wouldn't keep ModelSystem leaner by using a sub_system. We can state the type at normalization using Python's duck typing, i.e. this sub_system is atomic system since it contains atom_types...
This suggestion is more in line with a graph-like design (where we are working towards), while the current version follows document-based paradigm (like ES).
Note that ES only applies to results and has no (little) bearing on data.
To the first comment: you are thinking on something like ModelSystem(sub_system=[1,2,...,N]) where any element of the list sub_system will contain something to point to its parent. This is what results.material.topology does right now.
But, I'll argue that it is much more cumbersome for someone to code that flattened list, rather than nest ModelSystem over itself. I think the later is much more natural from the parser standpoint, while the former can be extracted when mapping that nested structure to results.materials.topology with the current scripts in an automated way and help ES to handle these situations.
Btw, ModelSystem has to repeat anyways, as we want to cover time evolutions. Your suggestion will results in a list of list, while this data model will in a list of nested sections.
In any case, things get much more complicated once we start talking about grouping different branch levels, and how to do it in a consistent way. I remember trying a flatten list as in results, but it was not clear for when there are multiple branches with multiple childs. In that case, having a flattened list is the real pain.
But, I'll argue that it is much more cumbersome for someone to code that flattened list [...]
The ease of working with either depends on the tools at our disposal for extending and navigating.
We should equip these classes with appropriate functionalities so a dev does not have to compose the structure directly:
For navigation the nested structure has m_xpath at best (and nested for-loops with error management at worst). In the nested approach you still have to keep track of the exact depth in the hierarchy, maybe even the entire path. Conversely, if we have unique naming system (e.g. given name + index), this would be much more easily powered in a list. In other words, you could jump in at any depth without concerning yourself too much with the full hierarchy structure.
Extending is trivial in lists, since it's just appending and updating some metadata, i.e. parents, children, depth, branch_number(?). In nested structures, meanwhile, you either have to populate an entire branch level at once or rely on some sophisticated navigation tools (see previous bullet point). The code logic in our parsers is not always that linear, though... This is a huge time loss that I've faced time-and-time again in the parsers, not just a one-off.
I remember trying a flatten list as in results, but it was not clear for when there are multiple branches with multiple childs.
Can you elaborate pls?
I'm not sure what the exact problem was that you encountered.
If it helps the readability, we can trigger a sort at normalization time.
Btw, ModelSystem has to repeat anyways, as we want to cover time evolutions.
This is a separate discussion.
Have we settled on containing the full time-evolution in data vs workflow?
Honestly, a combined approach seems best there.
Your suggestion will results in a list of list, while this data model will in a list of nested sections.
This is indeed one option of showing multiple systems, though it is a time-first approach (and limited to individual entries).
I guess that users may be interested in the time-evolution of a specific subsystem, in which case a hierarchy-first approach, where you pick the relevant subsystem and then trace it over time, may be preferable.
Another question here is how we handle changing subsystems.
E.g. is a subsystem still the same if its composition changes?
Note: we should also have a step-index next to a time-label.
Is OK if we fully deprecate Prototype, SpringerMaterial and Constraint? The first seems an AFLOW specific metainfo, the second could be safely deleted imo, and the third is unused (or at least, I didn't find any case where this is used).
I think SpingerMaterial can be deprecated. It probably contains things functional_type and compound_type that we have already deemed quite useless and they are not currenlty shown in the GUI.
The information in Prototype is quite useful, but I'm not sure if it needs its own section. In the results, prototype_aflow_id and prototype_formula are part of the material.symmetry section, which makes more sense to me.
Constraint I know nothing about, so I'm fine with removing it.
The information in Prototype is quite useful, but I'm not sure if it needs its own section. In the results, prototype_aflow_id and prototype_formula are part of the material.symmetry section, which makes more sense to me.
I was checking further, and indeed you are right, those 2 quantites (+ strukturberich_designation ) are used. However, I think we can move them to the Symmetry section under model_system and extract them by normalization.
Before going in holidays, I wanted to push some initial try on the new ModelSystem schema, and how this can use base sections, and be used as a base section for computations. The changes are not so heavy, and it was mainly a clean-up together with some improvements in the way we define and normalize the system.
Please, feel free to branch out and run the setup dev env. Here you have some file to quickly test how it looks:
Include or not Descriptors under ModelSystem. Personally, I am in favor of including them, with their normalization within the class.
Improve the front-end when a section inherits from EntryData (right now, the description and the way of naming quantities is not very good, plus the description of some quantities does not appear when hovering, but rather there is some ? symbol which is very annoying to use).
Currently, data and results are not connected. We will tackle this when everything is ready first, then we transfer the mappings from data to the new results section. My dream is that this ModelSystem section can be mapped directly (or at least, just with a minimal effort) in results.
There are some "TODO"s in the schemas and normalizations. Feel free to give feedback on these (otherwise, I will go personally to someone and ask ).
With that said, enjoy Christmas and your holidays. We can talk early next year about the next steps.
@pizarroj Thanks for the update, I just had a brief walk through the example, great start! Thanks a lot for your efforts. Looking forward to discussing more with everyone next year
I am adding also a UML diagram with all the quantities and functions defined there, just in case you prefer to take a quick look.
Here, if the class name does not have color it is a base class (as defined in baseclasses.py) otherwise it is an Area C specific class which will be defined in a plugin schema. Computation as in the turquoise box is what will appear in data to be filled up by parsers and custom schemas. The arrows have the same meaning as defined by Hampus (empty white head means "inherits from", fill black rombo means "contained in", normal arrow means "references to").
(or dark-mode, if you prefer so)
I used draw.io and created a repo to keep this kind of diagram there. I am sure the style is too complex for adding this to the documentation, so feel free to suggest other ideas.
The described issue is related to a NOMAD plugin or plugin under development.
Please refer all further discussion to this issue about refactoring NOMAD plugins.