Aggregations for custom quantities
We have two types of serchable items in our search index:
- "Static" quantities: Stored in our search index using the same hierarchy as in the metainfo. Each has it's own dedicated mapping in ES. More performant search, less scalable (mapping explosion).
- "Dynamic" quantities: stored as nested fields under
searchable_quantities. Any compatible primitive quantities stored under data are currently indexed. Currently no proper search, less performant search, more scalable. Currently contains:
- Quantities from schema plugins
- Quantities from schemas contained in the data itself
- Nexus quantities stored under the 'native'
On the level of the API interface, we should harmonize the way this information can be searched. The idea is to modify the search API to support aggregations and searches for dynamic quantities transparently. This would allow us to expose the data for use in filter menus, search boxes, widgets etc.
Range aggregations (numbers and datetimes)
How it works
- During app bootup, all plugins defined in
nomad.yamlare loaded and
config.plugins.filtered_valueswill contain all enabled plugins.
- From each schema (schema is a section inheriting from
EntryData) contained in a plugin plugin, we extract all scalar quantities and automatically create
Elasticsearchannotation objects with
dynamic=Truefor each one. The id used to differentiate quantities is discussed below. These annotations will be saved into
EntryType.quantitiesjust like regular annotations. The search API will use this information to know which paths correspond to dynamic quantities. In the future, it should be also possible to control which quantities to index by looking at an ES annotation in the schema.
- As usual, the GUI will get information about all ES annotated quantities through
artifacts.js. This file contains the field 'searchQuantities' that is populated by looking at all registered ES annotations, no matter whether they are dynamic or static.
- The GUI may now perform regular search calls for dynamic quantities. API will trigger conversions to the correct calls:
- For aggregations,
_api_to_es_aggregationadds nested aggregation and additional filter aggregation that ensures that the aggregation targets the correct quantity.
- For queries,
_api_to_es_querywill ensure that an appropriate nested query is added with the additional filter that focuses the search on a specific path+value.
excludedirectives. Including a dynamic quantity will automatically include
searchable_quantitiesin the underlying ES call. Excluding
searchable_quantitiesor any of the dynamical quantities is allowed, and will trigger an extra filtering step by our API.
_es_to_api_aggregationmaps aggregation responses to use the quantity id instead of
_es_to_entry_dictwill transform the hits. The return structure will follow the dynamic quantity path instead of using
searchable_quantities. Include/exclude that cannot be handled by ES is also handled here.
- For aggregations,
How quantities are identified
We need a good way to distinguish quantities. Identifiers are needed e.g. in:
- In the API queries to properly resolve the target quantity
- In the response structure to identify which quantity is returned
- In our GUI forms when displaying a list of available quantities.
The identifier should have two, easily separable parts: one for the schema, another one for the quantity path within the schema. Note that instead of using the schema as identifier, one could also use the quantity definition + schema path as an identifier (we do store the definition in the searchable quantity). This is not currently implemented as it would be conceptually different from the static quantity searches and might lead to unexpected search hits from other schemas.
Schemas may be declared in a yaml file, or as Python classes inheriting from
EntryData. Multiple schemas can be defined per plugin/yaml file, and inheritance should be taken into account. The canocal way to identify a section defining a schema is to use the function
definition_reference. It will return a global reference of the following format:
nomadschemaexample.schema.MySection # Python
../uploads/<upload_id>/archive/<entry_id>#/definitions/section_definitions/2 # yaml
This syntax is problematic in search and in the GUI, mostly because of the
- In Elasticsearch,
.denotes hierarchy. This complicates the reconstruction of query results, and the checking of include/exclude directives.
- Our entire GUI expects quantities to be separated into sections using the
.character. We are e.g. using path strings (e.g. 'nomadschemaexample.schema.MySection.name)' to recursively look deeper into objects, which breaks things.
- We also need to be able to separate between the schema and the schema path. If the schema name contains dots, being able to do the separation would require the use of an additional character, most of which are already reserved for some other purpose.
The current solution is to replace dots with
/ in the python schema id and to strip out the leading dots in the YAML schema id.
The quantity id within the schema is fairly easy to construct, as it can be a regular, dot-separated path.
TLDR: With this, the final quantity identifiers look like this:
nomadschemaexample/schema/MySection.name # Python
/uploads/<upload_id>/archive/<entry_id>#/definitions/section_definitions/2.name # yaml
where the schema and schema path can be separated at the first occurrence of
Fields for a searchable quantity:
path: Path of the quantity in a specific schema definition.
definition: The definition for the quantity.
path_archive: Path of the value in the archive.
double_value: The actual value to store.
It would be possible to separate the
path attribute into two different parts:
schema_path, but there are currently a few reasons not to do this:
- In the 'User defined quantities' -menu we search for all unique quantities to list them out. This is done using an aggregation for a field. If we would split
path_schema, we would not be able to get unique quantities from them with terms aggregation, but we would need to use some sort of multi terms aggregation, which is currently not implemented in our API.
- At the moment, the search always targets a specific quantity in a specific schema, which means that we can identify the targeted value with one criteria based on
path: this is probably faster than matching two separate fields. If in the future we would want to enable searches that would target specific schemas, or specific paths, we would probably need to add these fields or perform a more complicated string match query.
Problem with schemas defined as part of data:
In order for the API to properly resolve dynamic quantities that are defined in a custom user-specific shema, the API should have access to their definition in order to translate the aggregations to the
searchable_quantities field. For the translation, we need the full definition (schema+schema_path+dtype) and since any user can have any number of custom quantities, we would need to perform a fairly complicated collection of API calls to resolve all of the custom quantity definitions from YAML files.
Instead of attempting to load all custom quantity definitions, we could support an extended syntax that contains also the data type to resolve the value field, e.g.:
/uploads/<upload_id>/entries/<entry_id>#firstname.lastname@example.org # yaml
In an app definition, these custom quantities could then be "registered" (registration should contain the data type and description) and used as usual.
I would argue that in most cases it is difficult to distinguish which quantities should be stored under
keyword_value and which quantities should be stored under
searchable_quantities. To simplify things, I propose that by default all string quantities are stored under a
text mapping, which has a multi-field for
keyword mapping as well to enable aggregations on it.
Turning more stuff into plugins
The most consistent way for supporting dynamic quantities would be to by default index all quantities from custom schemas and turn nexus + other schemas into plugins. This would allow scaling the searchable quantities by controlling which plugins are enabled.
Only scalar fields
The existing implementation is enabled only for scalar fields. Maybe this can be lifted?
No nested queries
Because we are flattening all quantities to a single list, information about the hierarchy is lost. This means that nested queries are not possible with our current setup for custom quantities. E.g. if an entry has the following structure:
it will not be possible to query for entries containing
Section which fulfills
quantityA = a and
quantityB = b simultaneously.
Limit of 10 000 nested docs
By default, a single document can contain 10 000 nested documents. So for very large archives, this limit may be hit at some point when searchable_quantities is populated.