nomad-FAIR issueshttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues2020-12-18T16:07:58Zhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/4013-tiered (archive) storage2020-12-18T16:07:58ZMarkus Scheidgen3-tiered (archive) storage## motivation
Reading many calculation from an upload's msg-pack archive file is very time consuming. The optimisation potential is limited due unforeseeable nature of archive request patterns. While "what" is requested only varies among...## motivation
Reading many calculation from an upload's msg-pack archive file is very time consuming. The optimisation potential is limited due unforeseeable nature of archive request patterns. While "what" is requested only varies among a few popular sections, those sections are spread over the archive file. We either read the whole file or do a lot of little reads. There is no good middle ground. We could try to have a better layout in the file, but would destroy the simplicity of just using the metainfo defined structure.
## tiered architecture
Therefore, we leave the archive msg-pack files as they are and put other storage that only contains the popular data in a different layout in front. In a sense we already have elastic search as another datastore and will now add a new mongodb collection `archive` in between. Each tier will contain some of the data of the tier below. The lowest tier (msg-pack) is the source of truth. With the exception of user-metadata, where the mongo collection `calc` is the source of truth.
||es|mongo|msg|
|-|-|-|-|
|ids|x|x|x|
|metadata|partial|x|x|
|user-metadata|x|-|-|
|workflows/enc|partial|x|x|
|run|-|-|x|
The `archive.py` module will hide this 3-tiered storage system behind the existing query mechanism `upload_id x [calc_id] x schema -> json/dict`. Instead of always reading everything from files, we now analyse the schema based on annotations in the metainfo to determine what information can be read from what datasource.
## references
In addition the new implementation will (partially or conditionally) "resolve" references. The referenced sections are also added to the output. Their placement in lists will be honoured by filling the lists with `null` values for absent sections. This way the client will be able to resolve references normally.
## metainfo annotations and categories
We use annotations (and categories?) to control what goes into what tear. For elastic search, mongo:calc (`MongoMetadata`), user-metadata (`EditableUserMetadata`) we already have something like this. For the beginning, we will try to go by a category called `FrequentlyAccessed` and assign it to `EntryMetadata/section_metadata` and `Workflow/section_workflow`.Markus ScheidgenMarkus Scheidgenhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1436A base section for custom schemas to normalize and populate results from crys...2023-05-08T12:47:15ZJose Marquez PrietoA base section for custom schemas to normalize and populate results from crystal structure filesWe need a base section that allows us to normalize and populate `results` from crystal structure files. This might be related to #1137 and partly to #1038We need a base section that allows us to normalize and populate `results` from crystal structure files. This might be related to #1137 and partly to #1038https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/392A better selection and presentation of search properties2020-12-18T16:07:58ZMarkus ScheidgenA better selection and presentation of search propertiesThe [properties tab](https://nomad-lab.eu/prod/rae/beta/gui/search?visualization=properties) on the search is a little all over the place.
- select a meaningful set of quantities (maybe mix with results from encyclopedia processing?)
- ...The [properties tab](https://nomad-lab.eu/prod/rae/beta/gui/search?visualization=properties) on the search is a little all over the place.
- select a meaningful set of quantities (maybe mix with results from encyclopedia processing?)
- pick good labels
- add tooltips (based on metainfo description with link to metainfo)
Replaces #355Lauri HimanenLauri Himanenhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/788A convenient metainfo schema file format2022-04-28T07:53:09ZMarkus ScheidgenA convenient metainfo schema file formatThe metainfo already has "json-compatible" serialization and deserialization functions (`MSection.m_from_dict()`, `MSection.m_to_dict()`). These can also be used for schemas, which technically are just "normal" metainfo data. In principl...The metainfo already has "json-compatible" serialization and deserialization functions (`MSection.m_from_dict()`, `MSection.m_to_dict()`). These can also be used for schemas, which technically are just "normal" metainfo data. In principle, it is easy to use yaml, along-side json to deserialize (i.e. parse) a yaml-based schema. However, ...
- the structure is based on dicts and arrays. Packages have `section_definitions`, Sections have `quantities`, etc. But, most of these array elements have a unique name that could be used in a more convinient dict.
- plural aliases would be nice (e.g. `base_section: foo` in addtion to `base_sections: ['foo']`
- some of the schema terms could use some more friendly aliases, e.g. use `sections` instead of `section_definitions`
- some types are serialized rather complicated. Instead of needing to use `{"type_kind": "python", "type_data": "str"}`, `{"type_kind": "numpy", "type_data": "float64"}`, `{"type_kind": "reference", "type_data": "#/section_definitions/0"}`, we could guess the right "kind" from simply string values like `str`, `np.float64`, or `#/section_definitions/0`.
- references are based on the dict array structure, which is not ideal for the same reason as before. Something like `#/section_definitions/0` referring to the first section definition in the array of `section_definitions` could be replaced by something based on section name. E.g., `#/Process` in the case the the first section of the package is named `Process`.
We are only considering the deserialize direction.
Example:
```yaml
m_def: 'nomad.metainfo.metainfo.Package'
sections:
Sample:
base_section: 'nomad.datamodel.metainfo.Sample'
quantities:
sample_id:
type: str
description: |
This is a description with *markup* using [markdown](https://markdown.org).
It can have multiple lines, because yaml allows to easily do this.
m_annotations:
eln:
component: StringEditComponent
Process:
quantities:
samples:
type: '#/Sample'
shape: ['*']
sub_sections:
samples:
section_def: '#/Sample'
repeats: true
SpecialProcess:
base_section: '#/Process'
quantities:
values:
type: np.float64
shape: [3, 3]
```
How to approach this?
- [x] replicate the given example in Python
- [x] serialize this Python example to yaml and json with `m_to_dict()`
- [x] compare with the given yaml above to understand the points given in this issue
- [x] `MSection.m_from_dict` is used to deserialise. Implement Package, Section, Quantity, SubSection specific overwrites for `m_from_dict` that resolve the convenience notation into the regular form before calling the `super` implementation
- [x] Add good error handling. This format will be used by end-users.
- [x] Add extensive tests (also for the error handling)Mohammad NakhaeeMohammad Nakhaeehttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/488Adapt encyclopedia normalizer to workflows2021-02-11T06:11:59ZMarkus ScheidgenAdapt encyclopedia normalizer to workflowsThe encyclopedia normalizer depends on framesequence and sampling method to determine a calculation type. @ladinesa started to remove the use of these old metainfo definitions. @himanel1 should adapt the encyclopedia normalizer. For now ...The encyclopedia normalizer depends on framesequence and sampling method to determine a calculation type. @ladinesa started to remove the use of these old metainfo definitions. @himanel1 should adapt the encyclopedia normalizer. For now it should look for workflow information and use this. Only if no workflows are present, I should revert to the old behaviour. Once the migration to workflows is complete this should also be removed.
Currently (branch new-vasp-parser) this breaks the CI/CD because `tests/data/api/enc_private_material.zip::private/vasprun.xml` is parsed by the new vasp parser and no framesequence is generated anymore -> calc_type=unavailable -> no enc entry -> broken assertion in `tests/app/flask/test_api_encyclopedia.py::TestEncyclopedia::test_material`
Btw, the test case is terrible as it depends on multiple parsers, test data, etc. If one of these deps changes, its really hard to figure out, why this fails. Ideally this could be broken down into multiple test cases at some point (e.g. when implementing a new enc API based on flask api).v0.10.0Lauri HimanenLauri Himanenhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/427Add a licence user-metadata in preparation for OASIS etc.2021-03-02T11:38:01ZMarkus ScheidgenAdd a licence user-metadata in preparation for OASIS etc.v0.10.0https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/438Add categories for normalized metadata2021-06-17T16:41:04ZLauri HimanenAdd categories for normalized metadatahttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1555Add deprecation chip2023-12-21T16:06:51ZNathan DaelmanAdd deprecation chipConsidering our current workflow, where we first have to reparse new quantities before deprecating old ones, we should have a mechanism to communicate the upcoming updates to the user.
I suggest some kind of "_deprecated_" chip (like th...Considering our current workflow, where we first have to reparse new quantities before deprecating old ones, we should have a mechanism to communicate the upcoming updates to the user.
I suggest some kind of "_deprecated_" chip (like the _repeated_ annotation that we have in the Metainfo) that is shown both in Metainfo and the Entry Data.
I think a chip is much more visible than, for example, updating the definition.
Maybe it should also link to the quantity replacing it, in case there is one.https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/501Add Pandas object (datatype) support to NOMAD2022-07-22T13:57:02ZAviral VaidAdd Pandas object (datatype) support to NOMADCurrently, NOMAD supports python base variable types such as `int`, `float`, `str`, _etc._ and `numpy` arrays.
A lot of upcoming data is expected to be columnar, and `pandas` is one of the standard libraries that is used to treat such d...Currently, NOMAD supports python base variable types such as `int`, `float`, `str`, _etc._ and `numpy` arrays.
A lot of upcoming data is expected to be columnar, and `pandas` is one of the standard libraries that is used to treat such data. NOMAD is also using the `pints` python package to keep track of units in scientific data.
The `pints` library has an extension available for the `pandas` library called `pint_pandas`. It adds a parameter "unit" to `pandas Series`. Operations involving different columns also behave as expected, such as multiplication between two quantities and throwing an error if two quantities with unmatched units are being added or subtracted. The examples can be seen on the documentation page of `[pint_pandas](https://pint.readthedocs.io/en/stable/pint-pandas.html)`.Amir GolparvarAmir Golparvarhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/782Add support for indexing metainfo Units2022-04-01T12:45:51ZLauri HimanenAdd support for indexing metainfo UnitsThe ElasticSearch extension does not currently allow indexing `nomad.metainfo.metainfo.Unit`, which is basically a string when serialized. Should be added to the list of supported metainfo types.The ElasticSearch extension does not currently allow indexing `nomad.metainfo.metainfo.Unit`, which is basically a string when serialized. Should be added to the list of supported metainfo types.https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1037Align metainfo with nexus even more2023-01-27T15:43:06ZMarkus ScheidgenAlign metainfo with nexus even moreWe still have some nexus situations that are hard to map to nomad even with #837 done.
We try to press fields into quantities. But fields might contain so many additional data: the actual name, the actual unit used, multiple values, etc...We still have some nexus situations that are hard to map to nomad even with #837 done.
We try to press fields into quantities. But fields might contain so many additional data: the actual name, the actual unit used, multiple values, etc. It might be the easiest to differentiate between a simple quantity and a complex quantity. Simple quantities can be serialised with just the value and the definition provides all context, complex quantities are serialized as objects with value and additional metadata.
Now we can put more functions on the definitions that will require additional metadata in the data, like `variable`, `attributes`, `dimensionality`, `repeats` (fields that can occures multiple times).
Here is some pseudo code that shows the difference between simple and complex serialisation:
```
q.is_complex = q.variable or q.attributes or q.repeats or (q.dimensionality and not q.unit)
if not q.is_complex:
q.json = q.value
else:
q.json = [
{
m_source_name = 'realName1',
m_source_unit = 'A',
m_value = q.value,
attr_1 = 'some value'
},
{
m_source_name = 'realName2',
m_source_unit = 'm',
m_value = q.value
}
]
```
This should be backwards compatible. The archive browser needs to support the complex serialisation as well. It should be checked if this would somehow affects the archive API.
With both serializations at the same time, the serialization would stay lean if possible and be more verbose if necessary.
Other benefits: this would also make us more flexible in the future as more and more metadata can be added to quantities. The complex serialisation is more in line how hdf5 works. We could get rid of the more hacky attribute, source_unit, source_name `@`-based syntax.Theodore ChangTheodore Changhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/837Align nexus schema definition language with metainfo2022-09-06T11:11:39ZMarkus ScheidgenAlign nexus schema definition language with metainfoAdd stuff to `metainfo.py`:
- [ ] names (variable names, names that are not valid identifier)
- [x] groups and fields with attributes
- [x] units on data
- X "dimension" instead of unit –– this is purely informative on the schema level f...Add stuff to `metainfo.py`:
- [ ] names (variable names, names that are not valid identifier)
- [x] groups and fields with attributes
- [x] units on data
- X "dimension" instead of unit –– this is purely informative on the schema level for the metainfo and can be stored with "more"
- X required property –– this is purely informative on the schema level for the metainfo and can be stored with "more"
- [x] descriptions for enum values (#655)
Fix the rest:
- [ ] apply changes to gui
- [ ] apply changes to nxdl conversion
- [ ] apply changes to nexus parserMarkus ScheidgenMarkus Scheidgenhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/705Allow custom metainfo definitions as part of the data2022-05-13T08:26:44ZMarkus ScheidgenAllow custom metainfo definitions as part of the dataFind lots of relevant information here: https://docs.google.com/presentation/d/1Q2OYB88rubQ-7abvWMlp2TsGLZv7zLSJdR0EZckcbOE/edit?usp=sharing
- [x] server-side implementation of cross entry references and definition references
- [x] Pyth...Find lots of relevant information here: https://docs.google.com/presentation/d/1Q2OYB88rubQ-7abvWMlp2TsGLZv7zLSJdR0EZckcbOE/edit?usp=sharing
- [x] server-side implementation of cross entry references and definition references
- [x] Python client context
- [x] Javascript implementation
- [x] Adopted browser GUI
!the following is an older description:
Currently, all metainfo definitions have to be part of NOMAD python sources. They are used to generate the javascript version as well. But, we could also allow some metainfo definitions to be localized in an archive.
Actually, all metainfo definitions should be referenced by the archive anyways. This means that definitions need to be referencable via more global URLs. We should have a generic mechanism that covers both build-in and on-the-fly definitions. This would be a natural extension for url references across archives, if you consider metainfos being archives themselves.
- requires URLs for metainfo definitions that can be used to proxy definitions
- of course these URLs need to be resolvable via some API
- urls are used to keep a registry of metainfo definitions. This can be made up of "predefined" definitions that are part of the installation, and definitions loaded on the fly.
### Metainfo in archive
An `EntryArchive` could contain a metainfo package as one of its sub-sections. This package could extend the regular
`EntryArchive` with more definitions.
```yml
m_def: 'https://nomad-lab.eu/prod/v1/api/v1/metainfo#nomad.datamodel/section_definitions/EntryArchive'
metainfo:
section_definitions:
- name: EntryArchive
base_sections:
- 'https://nomad-lab.eu/prod/v1/api/v1/metainfo#nomad.datamodel/section_definitions/EntryArchive'
extends_base_section: true
sub_sections:
- name: custom_data
section_def: '#/metainfo/section_definitions/CustomData'
- name: CustomData
quantities:
- name: my_field
- type: str
custom_data:
my_field: 'Some data'
```
- Has huge potential for name collisions in `EntryArchive`
- The package in each archive would need to be registered, e.g. with `http://mynomad.de/api/v1/entries/{}/archive#metainfo`
- kinda bad if lots of entries would contain the same definitions
### Metainfo in upload
If archives always have to reference the definitions of its root section, an archive could reference a section definition
in the same upload. This section could inherit from the regular `EntryArchive`
Archive:
```yml
m_def: '../upload/mainfile/mymetainfo.yml#/section_definitions/EntryArchive'
custom_data:
my_field: 'Some data'
```
Metainfo:
```yml
m_def: 'https://nomad-lab.eu/prod/v1/api/v1/metainfo#nomad.metainfo/section_definitions/Package'
url: ../uploads/files/mymetainfo.yml
section_definitions:
- name: EntryArchive
base_sections:
- 'https://nomad-lab.eu/prod/v1/api/v1/metainfo#nomad.datamodel/section_definitions/EntryArchive'
sub_sections:
- name: custom_data
section_def: '#/metainfo/section_definitions/CustomData'
- name: CustomData
quantities:
- name: my_field
- type: str
```
- does not have the disadvantages of metainfo in archive
- cannot be serialised into the archive automatically and the user must take care of providing the file and the right url
reference. Maybe we could add fully relative urls: `../mymetainfo.yml#/section_definitions/EntryArchive`.Markus ScheidgenMarkus Scheidgenhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/471Allow to metainfo reference to quantities2021-03-02T11:38:02ZMarkus ScheidgenAllow to metainfo reference to quantitiesThis is necessary for #470This is necessary for #470Markus ScheidgenMarkus Scheidgenhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/552Allow to modify sub-sections via python fields.2021-06-18T07:59:03ZMarkus ScheidgenAllow to modify sub-sections via python fields.We want to extend the metainfo to allow something like this:
```
parent.energy_total = Energy(value=1.0)
```
This would replace:
```
parent.m_add_sub_section(ParentClass.energy_total, Energy(value=1.0))
```
This would be possible, but ...We want to extend the metainfo to allow something like this:
```
parent.energy_total = Energy(value=1.0)
```
This would replace:
```
parent.m_add_sub_section(ParentClass.energy_total, Energy(value=1.0))
```
This would be possible, but its quite some work to make it work with repeating sub-sections. Here you want something like:
```
run.section_single_configuration_calculation.append(SCC())
```
Remember that we need to hook into the process and modify both parent and sub-section. To implement this, python properties are not enough, because for repeating sub-sections we need a special metainfo list class to hook into append (and other modifying methods). Possible, but quite some work.https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1680Annotations as definition properties2024-01-08T07:13:53ZHampus NaesstroemAnnotations as definition propertiesAnnotations are additional key-value pairs that can be added to any metainfo object. In practice however, there are predominantly used on definitions. Them not being definition properties seems to be confusion and creates more issues tha...Annotations are additional key-value pairs that can be added to any metainfo object. In practice however, there are predominantly used on definitions. Them not being definition properties seems to be confusion and creates more issues than it solves.
Topic was changed from the following, based on the discussion below:
### Adding ELN Annotation in Inheriting Section Fails
Trying to add a ELN annotation to an inheriting section using `m_copy()` fails (the NumberEditQuantity is not showing up for entries of type `B`):
```python
class A(EntryData):
some_property = Quantity(type=float)
class B(A):
some_property = A.some_property.m_copy()
some_property.a_eln = ELNAnnotation(component=ELNComponentEnum.NumberEditQuantity)
```
This does work for specifying the type of a sub section.https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1515API call does not return code-specific quantities2023-05-26T15:03:49ZNathan DaelmanAPI call does not return code-specific quantitiesWhen running a query formulated as
```
q = ArchiveQuery(query = {"entry_id:any": ["sfabyz06es5UbQZb3d0IR_PShAO8"]}, required = "*")
entry, = q.download()
for data in entry.run[0].method[0]:
print(data)
```
All the standard quantitie...When running a query formulated as
```
q = ArchiveQuery(query = {"entry_id:any": ["sfabyz06es5UbQZb3d0IR_PShAO8"]}, required = "*")
entry, = q.download()
for data in entry.run[0].method[0]:
print(data)
```
All the standard quantities are returned, but code/application-specific quantities (e.g. those starting with `x_fhiaims_`).
We should at least allow the option of returning these, perhaps even make it the default.
This issue was raised by @mkuban .https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/695ArchiveBrowser does not show inherited properties2021-12-15T09:06:39ZMarkus ScheidgenArchiveBrowser does not show inherited propertiesMarkus ScheidgenMarkus Scheidgenhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1845Attribute inheritance2024-02-29T11:46:44ZMarkus ScheidgenAttribute inheritanceThe current metainfo does not support attribute inheritance. Attributes are primarily used in nexus. Here, the lack of attribute inheritance causes an excessive duplication of attribute definitions.
- [x] add inheritance to the metainf...The current metainfo does not support attribute inheritance. Attributes are primarily used in nexus. Here, the lack of attribute inheritance causes an excessive duplication of attribute definitions.
- [x] add inheritance to the metainfo python implementation
- [x] show attribute inheritance in the metainfo browser
- [x] adapt the generated nexus metainfoMarkus ScheidgenMarkus Scheidgenhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1487Bad geometry convergence extraction2023-12-21T15:58:54ZNathan DaelmanBad geometry convergence extractionI found this [this entry](https://nomad-lab.eu/prod/v1/staging/gui/search/entries/entry/id/oDNOV2gZM8Ts9LCiUcKfmWuKk07l) (entry_id: oDNOV2gZM8Ts9LCiUcKfmWuKk07l, upload_id: CXorGZWOS3Khzk7QCIwtQA), which highlights several areas where ou...I found this [this entry](https://nomad-lab.eu/prod/v1/staging/gui/search/entries/entry/id/oDNOV2gZM8Ts9LCiUcKfmWuKk07l) (entry_id: oDNOV2gZM8Ts9LCiUcKfmWuKk07l, upload_id: CXorGZWOS3Khzk7QCIwtQA), which highlights several areas where our presentation of the convergence criteria in a geometry optimization should be improved.
The figure shows how the convergence criterion were apparently reached at step 7, yet the relaxation still continues.
In reality, the output shows that none of the 4 criteria were met at step 7.
This teaches us that:
- [ ] The extraction of the convergence parameters in `Crystal` have to be double-checked.
- [ ] The GUI could be more explicit in which stopping criterion is shown (namely `convergence_tolerance_energy_difference`).
@himanel1 @ Adrianna Wojas: how do you think that we could better convey the full range of criteria in the GUI?
Some codes stop once 1 is met, other require all to be met...
Should we for example have tabs following multiple properties (see above), each with 1 or 2 convergence criteria lines?
- [ ] Moreover, also wavefunction can be used as a convergence criterion and should be added.Nathan DaelmanNathan Daelman