Aggregations for custom quantities

We have two types of serchable items in our search index:

"Static" quantities: Stored in our search index using the same hierarchy as in the metainfo. Each has it's own dedicated mapping in ES. More performant search, less scalable (mapping explosion).
"Dynamic" quantities: stored as nested fields under searchable_quantities. Any compatible primitive quantities stored under data are currently indexed. Currently no proper search, less performant search, more scalable. Currently contains:
- Quantities from schema plugins
- Quantities from schemas contained in the data itself
- Nexus quantities stored under the 'native' nexus field

On the level of the API interface, we should harmonize the way this information can be searched. The idea is to modify the search API to support aggregations and searches for dynamic quantities transparently. This would allow us to expose the data for use in filter menus, search boxes, widgets etc.

Tasks

How it works

During app bootup, all plugins defined in nomad.yaml are loaded and config.plugins.filtered_values will contain all enabled plugins.
From each schema (schema is a section inheriting from EntryData) contained in a plugin plugin, we extract all scalar quantities and automatically create Elasticsearch annotation objects with dynamic=True for each one. The id used to differentiate quantities is discussed below. These annotations will be saved into EntryType.quantities just like regular annotations. The search API will use this information to know which paths correspond to dynamic quantities. In the future, it should be also possible to control which quantities to index by looking at an ES annotation in the schema.
As usual, the GUI will get information about all ES annotated quantities through artifacts.js. This file contains the field 'searchQuantities' that is populated by looking at all registered ES annotations, no matter whether they are dynamic or static.
The GUI may now perform regular search calls for dynamic quantities. API will trigger conversions to the correct calls:
- For aggregations, _api_to_es_aggregation adds nested aggregation and additional filter aggregation that ensures that the aggregation targets the correct quantity.
- For queries, _api_to_es_query will ensure that an appropriate nested query is added with the additional filter that focuses the search on a specific path+value.
- _api_to_es_required handles the include and exclude directives. Including a dynamic quantity will automatically include searchable_quantities in the underlying ES call. Excluding searchable_quantities or any of the dynamical quantities is allowed, and will trigger an extra filtering step by our API.
- _es_to_api_aggregation maps aggregation responses to use the quantity id instead of searchable_quantities.
- _es_to_entry_dict will transform the hits. The return structure will follow the dynamic quantity path instead of using searchable_quantities. Include/exclude that cannot be handled by ES is also handled here.

How quantities are identified

We need a good way to distinguish quantities. Identifiers are needed e.g. in:

In the API queries to properly resolve the target quantity
In the response structure to identify which quantity is returned
In our GUI forms when displaying a list of available quantities.

The identifier should have two, easily separable parts: one for the schema, another one for the quantity path within the schema. Note that instead of using the schema as identifier, one could also use the quantity definition + schema path as an identifier (we do store the definition in the searchable quantity). This is not currently implemented as it would be conceptually different from the static quantity searches and might lead to unexpected search hits from other schemas.

Schemas may be declared in a yaml file, or as Python classes inheriting from EntryData. Multiple schemas can be defined per plugin/yaml file, and inheritance should be taken into account. The canocal way to identify a section defining a schema is to use the function definition_reference. It will return a global reference of the following format:

nomadschemaexample.schema.MySection  # Python
../uploads/<upload_id>/archive/<entry_id>#/definitions/section_definitions/2  # yaml

This syntax is problematic in search and in the GUI, mostly because of the . character:

In Elasticsearch, . denotes hierarchy. This complicates the reconstruction of query results, and the checking of include/exclude directives.
Our entire GUI expects quantities to be separated into sections using the . character. We are e.g. using path strings (e.g. 'nomadschemaexample.schema.MySection.name)' to recursively look deeper into objects, which breaks things.
We also need to be able to separate between the schema and the schema path. If the schema name contains dots, being able to do the separation would require the use of an additional character, most of which are already reserved for some other purpose.

The current solution is to replace dots with / in the python schema id and to strip out the leading dots in the YAML schema id.

The quantity id within the schema is fairly easy to construct, as it can be a regular, dot-separated path.

TLDR: With this, the final quantity identifiers look like this:

nomadschemaexample/schema/MySection.name  # Python
/uploads/<upload_id>/archive/<entry_id>#/definitions/section_definitions/2.name  # yaml

where the schema and schema path can be separated at the first occurrence of .

Fields for a searchable quantity:

path: Path of the quantity in a specific schema definition.
definition: The definition for the quantity.
path_archive: Path of the value in the archive.
bool_value/text_value/long_value/double_value: The actual value to store.

It would be possible to separate the path attribute into two different parts: schema + schema_path, but there are currently a few reasons not to do this:

In the 'User defined quantities' -menu we search for all unique quantities to list them out. This is done using an aggregation for a field. If we would split path into schema and path_schema, we would not be able to get unique quantities from them with terms aggregation, but we would need to use some sort of multi terms aggregation, which is currently not implemented in our API.
At the moment, the search always targets a specific quantity in a specific schema, which means that we can identify the targeted value with one criteria based on path: this is probably faster than matching two separate fields. If in the future we would want to enable searches that would target specific schemas, or specific paths, we would probably need to add these fields or perform a more complicated string match query.

Problem with schemas defined as part of data:

In order for the API to properly resolve dynamic quantities that are defined in a custom user-specific shema, the API should have access to their definition in order to translate the aggregations to the searchable_quantities field. For the translation, we need the full definition (schema+schema_path+dtype) and since any user can have any number of custom quantities, we would need to perform a fairly complicated collection of API calls to resolve all of the custom quantity definitions from YAML files.

Instead of attempting to load all custom quantity definitions, we could support an extended syntax that contains also the data type to resolve the value field, e.g.:

/uploads/<upload_id>/entries/<entry_id>#definitions/0@str.name  # yaml

In an app definition, these custom quantities could then be "registered" (registration should contain the data type and description) and used as usual.

String quantities

I would argue that in most cases it is difficult to distinguish which quantities should be stored under keyword_value and which quantities should be stored under text_value in searchable_quantities. To simplify things, I propose that by default all string quantities are stored under a text mapping, which has a multi-field for keyword mapping as well to enable aggregations on it.

Turning more stuff into plugins

The most consistent way for supporting dynamic quantities would be to by default index all quantities from custom schemas and turn nexus + other schemas into plugins. This would allow scaling the searchable quantities by controlling which plugins are enabled.

Limitations

Only scalar fields

The existing implementation is enabled only for scalar fields. Maybe this can be lifted?

No nested queries

Because we are flattening all quantities to a single list, information about the hierarchy is lost. This means that nested queries are not possible with our current setup for custom quantities. E.g. if an entry has the following structure:

data: {
  Section: [
    {
      quantityA: a,
      quantityB: b,
    },
    {
      quantityA: a,
      quantityB: c,
    },
  ]
}

it will not be possible to query for entries containing Section which fulfills quantityA = a and quantityB = b simultaneously.

Limit of 10 000 nested docs

By default, a single document can contain 10 000 nested documents. So for very large archives, this limit may be hit at some point when searchable_quantities is populated.

Edited Oct 20, 2023 by Lauri Himanen