Improved infrastructure for Materials-oriented data
Problem
The way we store data related to materials deserves some additional thought, as we are increasingly getting involved with data and functionality that revolves around materials and not just entries/calculations.
We are already including several external material-related data: AFLOW prototype data, springer data and most recently the similarity data. So unifying the way we ingest and store this kind of data is becoming important. This is also probably relevant for the eventual implementation of NOMAD Portal, which will aggregate material data from different sources.
After implementing the encyclopedia API with the current ES index, it has become clear that implementing a fast and flexible materials search cannot be done with only one index containing the data from entries. We will need a separate elasticsearch index for materials. This would make all material-oriented queries (encyclopedia) MUCH easier to implement and more performant.
Metainfo
I'm suggesting the following actions:
-
Creating a new metainfo class called "Material". It will serve as our schema for all materials related information that is stored in ES/mongo/etc. This metainfo is completely separated from EntryArchive, so they do not share any parent metainfo. This approach allows us to consistently model the data contents and also handle the ES/mongo storage of the data through annotations. Notice that this metainfo is separate from the material related data that we store per entry. Initially only the similarity data will use this metainfo structure, but more can be added later.
-
We need to create a common CLI interface for populating any materials data from external static sources (similarity, AFLOW, springer, etc.). These scripts would need to be called e.g. when setting up a new deployment. The actual data that is ingested should not be a part of the git repository, but instead should be linked through nomad.yml configuration variables. This is e.g. how the springer data is currently handled. To simplify things, only the similarity data will be initially handled with this mechanism and other data will be migrated later.
-
When we refactor the way we store material-related data we can use this Material root metainfo as a single source that keeps getting updated as entries related to the material are added/removed. It would also serve as our primary search index for material-related queries.
ElasticSearch
On a technical level, we need to somehow manage calculation/material relationships with ES. This a fairly complex relationship, as there can be materials with zero to many calculations (zero in the future with the NOMAD Portal), and calculations with zero or 1 material. A typical child/parent relationship is thus not directly applicable. This is further complicated by the fact that some calculations are private or embargoed, meaning that they should only be visible in authenticated queries. There are, however, several ways to do this:
-
Denormalization: This is what we are doing currently. Each calculation carries information about its material.
PROS:
- This way we get along with one flat index. Thus the information about calculations is automatically in sync with the information shown for materials: deleted, embargoed and private calculations are automatically correctly handled for the material queries.
CONS:
- This increases the dataset for material queries 1000-fold, but this is not a problem for most queries. The main problem is that some more complex queries need to use deep aggregations to access information across the calculations. These operations are becoming prohibitively expensive.
-
Transforms: This is a mechanism that caches pre-defined aggregation results into a separate index for faster retrieval. The caching is typically done periodically.
PROS:
- This would allow us to mainly work with a flat index, and let ES do the bulk of the work for us. Thus the information about calculations is automatically (eventually, due to the refresh interval) in sync with the information shown for materials: deleted, embargoed and private calculations are automatically correctly handled for the material queries.
CONS:
- The downside is that have to work in the aggregation framework to define the cached results (becomes fairly difficult), there is no possibility to have materials with zero calculations (NOMAD Portal), and that we cannot manually affect the materials index layout or entries (e.g. inserting external information about similarities or prototypes). Also, there will be a delay between successful processing and the calculation showing up in the index.
-
Completely separate index: A new ES index with a custom data layout for material data that also duplicates a selected part of the calculation data. Basically denormalization once again, but this time the other way around: calculation data denormalized into a materials index. Syncing between material data and newly added calculations is done by automatically running ingestion pipelines when a new calculation entry is added. In order to handle calculation deletion and data visibility (embargoed, private), a smart data layout is needed together with processes that automatically update this index together with material data. Perhaps storing calculations as an inner/nested objects of a material would be a good option?
PROS:
- Maximum performance. Material queries will have the same time-complexity as queries for individual calculations, which is the best that ES can offer. All types of relationships are handled, adding external data is not a problem.
CONS:
- We have to be careful in setting up the syncing of data correctly. Especially handling the calculation visibility will become quite hard. Proper testing of the syncing is paramount.
- As the calculation data that is associated with a material is duplicated, the material index will become quite large. The order of magnitude will be similar to the calculations index.