3-tiered (archive) storage
motivation
Reading many calculation from an upload's msg-pack archive file is very time consuming. The optimisation potential is limited due unforeseeable nature of archive request patterns. While "what" is requested only varies among a few popular sections, those sections are spread over the archive file. We either read the whole file or do a lot of little reads. There is no good middle ground. We could try to have a better layout in the file, but would destroy the simplicity of just using the metainfo defined structure.
tiered architecture
Therefore, we leave the archive msg-pack files as they are and put other storage that only contains the popular data in a different layout in front. In a sense we already have elastic search as another datastore and will now add a new mongodb collection archive
in between. Each tier will contain some of the data of the tier below. The lowest tier (msg-pack) is the source of truth. With the exception of user-metadata, where the mongo collection calc
is the source of truth.
es | mongo | msg | |
---|---|---|---|
ids | x | x | x |
metadata | partial | x | x |
user-metadata | x | - | - |
workflows/enc | partial | x | x |
run | - | - | x |
The archive.py
module will hide this 3-tiered storage system behind the existing query mechanism upload_id x [calc_id] x schema -> json/dict
. Instead of always reading everything from files, we now analyse the schema based on annotations in the metainfo to determine what information can be read from what datasource.
references
In addition the new implementation will (partially or conditionally) "resolve" references. The referenced sections are also added to the output. Their placement in lists will be honoured by filling the lists with null
values for absent sections. This way the client will be able to resolve references normally.
metainfo annotations and categories
We use annotations (and categories?) to control what goes into what tear. For elastic search, mongo:calc (MongoMetadata
), user-metadata (EditableUserMetadata
) we already have something like this. For the beginning, we will try to go by a category called FrequentlyAccessed
and assign it to EntryMetadata/section_metadata
and Workflow/section_workflow
.