3-tiered (archive) storage
Reading many calculation from an upload's msg-pack archive file is very time consuming. The optimisation potential is limited due unforeseeable nature of archive request patterns. While "what" is requested only varies among a few popular sections, those sections are spread over the archive file. We either read the whole file or do a lot of little reads. There is no good middle ground. We could try to have a better layout in the file, but would destroy the simplicity of just using the metainfo defined structure.
Therefore, we leave the archive msg-pack files as they are and put other storage that only contains the popular data in a different layout in front. In a sense we already have elastic search as another datastore and will now add a new mongodb collection
archive in between. Each tier will contain some of the data of the tier below. The lowest tier (msg-pack) is the source of truth. With the exception of user-metadata, where the mongo collection
calc is the source of truth.
archive.py module will hide this 3-tiered storage system behind the existing query mechanism
upload_id x [calc_id] x schema -> json/dict. Instead of always reading everything from files, we now analyse the schema based on annotations in the metainfo to determine what information can be read from what datasource.
In addition the new implementation will (partially or conditionally) "resolve" references. The referenced sections are also added to the output. Their placement in lists will be honoured by filling the lists with
null values for absent sections. This way the client will be able to resolve references normally.
metainfo annotations and categories
We use annotations (and categories?) to control what goes into what tear. For elastic search, mongo:calc (
MongoMetadata), user-metadata (
EditableUserMetadata) we already have something like this. For the beginning, we will try to go by a category called
FrequentlyAccessed and assign it to