nomad-FAIR issueshttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues2022-05-13T09:21:52Zhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/614Replace "domains" from parsers and normalisers with a more flexible system2022-05-13T09:21:52ZMarkus ScheidgenReplace "domains" from parsers and normalisers with a more flexible systemWe mostly removed domains (#591). They are now only used to assign normalizers to archives produced by certain parsers.
We need a more flexible normaliser system that is based on sections in the archive and not the archive as a whole.We mostly removed domains (#591). They are now only used to assign normalizers to archives produced by certain parsers.
We need a more flexible normaliser system that is based on sections in the archive and not the archive as a whole.https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/604Optimize parser matching2023-12-21T15:58:48ZMarkus ScheidgenOptimize parser matchingParser matching becomes increasingly stressful with more and more parsers added. An upload with a very large amount of files (>100k) takes more time to parser match than its processing timeout.
We need to optimise parser matching:
- co...Parser matching becomes increasingly stressful with more and more parsers added. An upload with a very large amount of files (>100k) takes more time to parser match than its processing timeout.
We need to optimise parser matching:
- consolidate regex patterns into a single regex. One for paths, one for contents.
- read file context (first few k) for regex matching in parallelhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/584Reprocessing to test metainfo refactor2021-11-16T07:44:53ZMarkus ScheidgenReprocessing to test metainfo refactorWe need to start some larger reprocessing to test after metainfo refactors. Test for:
- parsers working
- sections results and elastic index working
- results match old metadata
This concludes #419 #558 #557 #551 #540We need to start some larger reprocessing to test after metainfo refactors. Test for:
- parsers working
- sections results and elastic index working
- results match old metadata
This concludes #419 #558 #557 #551 #540v1.0.0-betahttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/581CLI command for downloading from aflow needs to be updated2023-12-21T15:39:01ZDavid SikterCLI command for downloading from aflow needs to be updatedThe CLI command **synchdb** defined in **nomad.cli.client.update_database** needs to be updated to run in nomad v>=1.0.0, if not else because it sets metadata during publish which should no longer be supported.The CLI command **synchdb** defined in **nomad.cli.client.update_database** needs to be updated to run in nomad v>=1.0.0, if not else because it sets metadata during publish which should no longer be supported.https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/572Remove config.fs.tmp and use a fix subdir of config.fs.staging instead2021-12-22T07:44:01ZMarkus ScheidgenRemove config.fs.tmp and use a fix subdir of config.fs.staging insteadWith the new incremental upload, we are extracting files to `config.fs.tmp` and then potentially mv those to `config.fs.staging`. Depending on the deployment, those directories might be on different volumes/mounts. Move would be actual a...With the new incremental upload, we are extracting files to `config.fs.tmp` and then potentially mv those to `config.fs.staging`. Depending on the deployment, those directories might be on different volumes/mounts. Move would be actual a copy+delete. Also this might cause issues, e.g. cannot mv files across volumes in k8s/docker (`Invalid cross-device link`, `mv: setting attribute 'security.selinux' for 'security.selinux': Permission denied`, `mv: listing attributes of 'staging/test': Cannot allocate memory`).
Anyhow, `config.fs.tmp` is exclusively used for staging purposes. Instead of having it configurable as a separate directory, we could use a fixed `tmp` sub-directory of `config.fs.staging` and remove `config.fs.tmp`. In the mean time, we configure `config.fs.tmp` to be a sub-dir of `config.fs.staging` on our clusters.
Further there is a `config.fs.local_tmp` parameter that is not used.
We should get rid of both `config.fs.local_tmp` and `config.fs.tmp`.https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/463Switch from Sphinx to mkdocs and mkdocstrings2021-12-22T07:46:26ZMarkus ScheidgenSwitch from Sphinx to mkdocs and mkdocstringsSphinx has a lot of downsides that can be solved with more modern doc approach.
- no mix of markdown and restructuredtext
- inconsistent references
- verbose markup
- weak type hint support
- auto... commands that no one understands
- er...Sphinx has a lot of downsides that can be solved with more modern doc approach.
- no mix of markdown and restructuredtext
- inconsistent references
- verbose markup
- weak type hint support
- auto... commands that no one understands
- errors during creation that no one understands
- consistent theme with mkdocs materialhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/460Refactor (the use of) config.api_url2021-01-11T14:45:39ZMarkus ScheidgenRefactor (the use of) config.api_urlThis function has several issues.
- we sometimes need the GUI url
- we have multiple APIs
- we currently use the ssl parameter to distinguish internal/external use. But even external might be without ssl (e.g. on an OASIS)This function has several issues.
- we sometimes need the GUI url
- we have multiple APIs
- we currently use the ssl parameter to distinguish internal/external use. But even external might be without ssl (e.g. on an OASIS)Markus ScheidgenMarkus Scheidgenhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/405Improved infrastructure for Materials-oriented data2020-11-03T06:51:39ZLauri HimanenImproved infrastructure for Materials-oriented data**Problem**
The way we store data related to materials deserves some additional thought, as we are increasingly getting involved with data and functionality that revolves around materials and not just entries/calculations.
We are alrea...**Problem**
The way we store data related to materials deserves some additional thought, as we are increasingly getting involved with data and functionality that revolves around materials and not just entries/calculations.
We are already including several external material-related data: AFLOW prototype data, springer data and most recently the similarity data. So unifying the way we ingest and store this kind of data is becoming important. This is also probably relevant for the eventual implementation of NOMAD Portal, which will aggregate material data from different sources.
After implementing the encyclopedia API with the current ES index, it has become clear that implementing a fast and flexible materials search cannot be done with only one index containing the data from entries. We will need a separate elasticsearch index for materials. This would make all material-oriented queries (encyclopedia) MUCH easier to implement and more performant.
**Metainfo**
I'm suggesting the following actions:
- Creating a new metainfo class called "Material". It will serve as our schema for all materials related information that is stored in ES/mongo/etc. This metainfo is completely separated from EntryArchive, so they do not share any parent metainfo. This approach allows us to consistently model the data contents and also handle the ES/mongo storage of the data through annotations. Notice that this metainfo is *separate* from the material related data that we store per entry. Initially only the similarity data will use this metainfo structure, but more can be added later.
- We need to create a common CLI interface for populating any materials data from external static sources (similarity, AFLOW, springer, etc.). These scripts would need to be called e.g. when setting up a new deployment. The actual data that is ingested should not be a part of the git repository, but instead should be linked through nomad.yml configuration variables. This is e.g. how the springer data is currently handled. To simplify things, only the similarity data will be initially handled with this mechanism and other data will be migrated later.
- When we refactor the way we store material-related data we can use this Material root metainfo as a single source that keeps getting updated as entries related to the material are added/removed. It would also serve as our primary search index for material-related queries.
**ElasticSearch**
On a technical level, we need to somehow manage calculation/material relationships with ES. This a fairly complex relationship, as there can be materials with zero to many calculations (zero in the future with the NOMAD Portal), and calculations with zero or 1 material. A typical child/parent relationship is thus not directly applicable. This is further complicated by the fact that some calculations are private or embargoed, meaning that they should only be visible in authenticated queries. There are, however, several ways to do this:
1. **Denormalization**: This is what we are doing currently. Each calculation carries information about its material.
PROS:
- This way we get along with one flat index. Thus the information about calculations is automatically in sync with the information shown for materials: deleted, embargoed and private calculations are automatically correctly handled for the material queries.
CONS:
- This increases the dataset for material queries 1000-fold, but this is not a problem for most queries. The main problem is that some more complex queries need to use deep aggregations to access information across the calculations. These operations are becoming prohibitively expensive.
2. [**Transforms**](https://www.elastic.co/guide/en/elasticsearch/reference/current/transform-overview.html): This is a mechanism that caches pre-defined aggregation results into a separate index for faster retrieval. The caching is typically done periodically.
PROS:
- This would allow us to mainly work with a flat index, and let ES do the bulk of the work for us. Thus the information about calculations is automatically (eventually, due to the refresh interval) in sync with the information shown for materials: deleted, embargoed and private calculations are automatically correctly handled for the material queries.
CONS:
- The downside is that have to work in the aggregation framework to define the cached results (becomes fairly difficult), there is no possibility to have materials with zero calculations (NOMAD Portal), and that we cannot manually affect the materials index layout or entries (e.g. inserting external information about similarities or prototypes). Also, there will be a delay between successful processing and the calculation showing up in the index.
3. **Completely separate index**: A new ES index with a custom data layout for material data that also duplicates a selected part of the calculation data. Basically denormalization once again, but this time the other way around: calculation data denormalized into a materials index. Syncing between material data and newly added calculations is done by automatically running ingestion pipelines when a new calculation entry is added. In order to handle calculation deletion and data visibility (embargoed, private), a smart data layout is needed together with processes that automatically update this index together with material data. Perhaps storing calculations as an inner/nested objects of a material would be a good option?
PROS:
- Maximum performance. Material queries will have the same time-complexity as queries for individual calculations, which is the best that ES can offer. All types of relationships are handled, adding external data is not a problem.
CONS:
- We have to be careful in setting up the syncing of data correctly. Especially handling the calculation visibility will become quite hard. Proper testing of the syncing is paramount.
- As the calculation data that is associated with a material is duplicated, the material index will become quite large. The order of magnitude will be similar to the calculations index.Lauri HimanenLauri Himanenhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/308Band structure: normalizer and metainfo update2023-01-10T09:54:38ZLauri HimanenBand structure: normalizer and metainfo updateThe band structure processing should be refactored. This would consist of the following steps:
* [ ] Update band structure metainfo so that the duplicate normalized values are removed (`section_k_band` vs `section_k_band_normalized`) a...The band structure processing should be refactored. This would consist of the following steps:
* [ ] Update band structure metainfo so that the duplicate normalized values are removed (`section_k_band` vs `section_k_band_normalized`) and add new metainfo for the reciprocal cell and band gaps. The metainfo for `k_band_path_normalized_is_standard` should be renamed and moved to `section_k_band`. Also, the name of the whole section could be renamed, as `section_k_band` to me implies a single band, whereas in reality it contains all the bands. I would suggest "electronic_band_structure" (I think phonon band structure should be put under a "phonon_band_structure" instead of using a flag to separate between different kinds.
* [ ] The shape of the energy values is currently [number_of_spin_channels, number_of_k_points_per_segment, number_of_band_segment_eigenvalues]. This makes sense from a parser perspective, as the output is typically stratified over k-points. However, when the band should be analyzed or visualized it makes more sense to store it in shape [number_of_spin_channels, number_of_band_segment_eigenvalues, number_of_k_points_per_segment]. This way the bands would be stored as a contiguous block of memory and can be easily and effectively looped over for visualization or band gap analysis. On the parser side, this would only require swapping the axes 1 and 2 in the numpy array before storing the values.
* [ ] Create a BandStructureNormalizer that will:
* [x] Calculate band gaps
* [ ] Add labels for high-symmetry points according to the Setyawan/Curtarolo standard. There is a partial implementation in the VASP parser. It, however, supports only a subset of Bravais lattices for some reason.
* [ ] Check if the path follows the Setyawan/Curtarolo standard.https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/307Parser re-compile "submatchers" all the time2023-12-21T15:58:47ZMarkus ScheidgenParser re-compile "submatchers" all the timeMany of the legacy NOMAD CoE parsers use `SimpleMatcher`s (SM). In order to use a parse tree of SMs, the tree has to be "compiled". This takes quite a while and should only be done once for each parse tree. Unfortunately, most parsers do...Many of the legacy NOMAD CoE parsers use `SimpleMatcher`s (SM). In order to use a parse tree of SMs, the tree has to be "compiled". This takes quite a while and should only be done once for each parse tree. Unfortunately, most parsers do not allow that: the SM tree is build and compiled for each parser run.
I managed to add a cache to the compile function, so that each SM tree is only compiled once. While some parsers only create the SM tree once, some parser don't. In principle this should be avoidable, but the code structure does not allow it.
Examples of such parsers are:
- quantum espresso
- crystal
- cp2k
- cpmd
Besides this, the parsers are not opted for reuse at all. While the legacy/nomadcore modules suggest reusability at some places (e.g. parser vs. context, interface vs. parsers), it is not thought through and lots of initialisation is done again, again, and again.
Tasks:
- replace simple_parser.mainFunction, baseclasses.ParserInterface with a unified interface that really promotes parser reuse
- rewrite parsers, one by one, to use this interface
- cleanup the parser code in the process: pep8, dead-code, unnecessary imports
- testhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/263Optimized use of Elasticsearch2020-02-21T08:22:46ZMarkus ScheidgenOptimized use of ElasticsearchIn all our dealings with elasticsearch we are requesting the whole source document. This is then also used in the API/GUI communication. This is regardless of whether we need all information, or not.
- [x] only transfer calc_id/upload_...In all our dealings with elasticsearch we are requesting the whole source document. This is then also used in the API/GUI communication. This is regardless of whether we need all information, or not.
- [x] only transfer calc_id/upload_id when scanning to stream download data
- [x] do not transfer quantities to clients if not explicitly requestedhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/229Instable uploading GUI for new uploads2020-06-04T14:50:20ZMarkus ScheidgenInstable uploading GUI for new uploadsUploads provided through the UI are using a fake upload object, because the upload does not yet exist on the server. This might lead to unexpected bugs and makes handling in the respective components hard. If possible there should be a d...Uploads provided through the UI are using a fake upload object, because the upload does not yet exist on the server. This might lead to unexpected bugs and makes handling in the respective components hard. If possible there should be a different solution to displaying uploading files.Markus ScheidgenMarkus Scheidgenhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/47Parsing without extraction2023-12-21T15:39:05ZMarkus ScheidgenParsing without extractionWe should explore the possibility to parse files directly from .zip files. This probably means that existing parsers need to be checked for external file handling in case of:
- multiple raw files per calculcation
- external references be...We should explore the possibility to parse files directly from .zip files. This probably means that existing parsers need to be checked for external file handling in case of:
- multiple raw files per calculcation
- external references between calculations (e.g. exciting)https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/35Mime type based parsing, decompression2019-08-09T07:02:54ZMarkus ScheidgenMime type based parsing, decompression- [x] mime based matching #99
- [x] parser matching with compression support
- [ ] integration with parsers- [x] mime based matching #99
- [x] parser matching with compression support
- [ ] integration with parsershttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1710Refactor normalizing workflow tests2023-12-21T15:58:56ZAlvin Noe LadinesRefactor normalizing workflow testsSome tests under normalizing/test_workflow.py should technically be in datamodel/metainfo/test_workflow.py.
It is understandable that the two are confused but I personally think that all tests regarding the metainfo
defs including the im...Some tests under normalizing/test_workflow.py should technically be in datamodel/metainfo/test_workflow.py.
It is understandable that the two are confused but I personally think that all tests regarding the metainfo
defs including the implementation of normalize should be under the latter. Only normalizations done in
nomad/normalizing/workflow.py should be in the other.
@pizarrojAlvin Noe LadinesAlvin Noe Ladineshttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1637Some Tabular parser features to revise or add2023-10-05T06:59:56ZAndrea AlbinoSome Tabular parser features to revise or addWhile me and @jschumann were testing independently the parser, we noticed some features to fix or to add:
- the following file has a repeated sub_sub_section along the row (the columns "Element" and "fraction"). If one has only two bloc...While me and @jschumann were testing independently the parser, we noticed some features to fix or to add:
- the following file has a repeated sub_sub_section along the row (the columns "Element" and "fraction"). If one has only two blocks of this sub_sub_section, the parsing is fine. If one adds a third block of columns, some errors are raised.
- the filename of generated entries is currently the name of the m_def class + a sequential number, I think it should be picked from a quantity of the m_def class.
- empty cells in excel file should not raise errors, just not fill the quantity
- after loading and parsing one of this schemas and excel files, if I hit the reprocess button some entries change their process status in failure and opening them the error "Something went wrong in this part of the app (Javascript error)." is displayed.
- These generated entries cannot be singularly deleted until I save them a second time and they are actually written into physical files
To test the issue above, create an entry choosing "CatalystCollection" class and drop the excel file inside there:
[entry_schema_basi.archive.yaml](/uploads/eee887b85facc5f272f06a30f7ff4944/entry_schema_basi.archive.yaml)
[test_upload_w_3_elements.xlsx](/uploads/5ce89535e09f8934f4ba61e6b1cfc7ea/test_upload_w_3_elements.xlsx)
@amgo is following us in testingAmir GolparvarAmir Golparvarhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1603Replace the celery-based example data usage with the ExampleData2023-11-06T14:07:46ZAhmed IlyasReplace the celery-based example data usage with the ExampleDataRelated to #1580Related to #1580Ahmed IlyasAhmed Ilyashttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1497Store RFC timestamp seed in mongodb2023-05-23T15:06:47ZMarkus ScheidgenStore RFC timestamp seed in mongodbThe source of truth in NOMAD are raw files. And everything that is not in a raw file, gets its truth from mongodb: comments, datasets, references (i.e. urls). Storing the RFC timestamps only in the archive does not fit this policy. Loadi...The source of truth in NOMAD are raw files. And everything that is not in a raw file, gets its truth from mongodb: comments, datasets, references (i.e. urls). Storing the RFC timestamps only in the archive does not fit this policy. Loading it from the old archive upon reprocessing is very ugly. Also these timestamps might become very important and should be handled with proper care.
We should store the RFC timestamp in mongodb, similar to `comment`, `references`, `datasets`, `entry_coauthors`. It does not necessarily need to be the full timestamp section with all properties, but some string that the actual timestamp model in the archive can be re-created from. The archive will still contain the timestamp and from there it is visible in the UI, but it will be read from mongo upon reprocessing and not from an old archive version.Theodore ChangTheodore Changhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1484Rewrite msgpack reader/writer2023-06-23T12:57:15ZTheodore ChangRewrite msgpack reader/writerThe existing msgpack reader ignores fields with primitive values.The existing msgpack reader ignores fields with primitive values.Theodore ChangTheodore Changhttps://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1233Optimization for the GUI test2023-05-16T16:11:06ZMohammad NakhaeeOptimization for the GUI testMohammad NakhaeeMohammad Nakhaee