Aggregations for custom quantities
We have two types of serchable items in our search index:
- "Static" quantities: Stored in our search index using the same hierarchy as in the metainfo. Each has it's own dedicated mapping in ES. More performant search, less scalable (mapping explosion).
- "Dynamic" quantities: stored as nested fields under
searchable_quantities
. Any compatible primitive quantities stored under data are currently indexed. Currently no proper search, less performant search, more scalable. Currently contains:- Quantities from schema plugins
- Quantities from schemas contained in the data itself
- Nexus quantities stored under the 'native'
nexus
field
On the level of the API interface, we should harmonize the way this information can be searched. The idea is to modify the search API to support aggregations and searches for dynamic quantities transparently. This would allow us to expose the data for use in filter menus, search boxes, widgets etc.
Tasks
-
Aggregations: -
Terms aggregations -
Histogram aggregations -
Range aggregations (numbers and datetimes)
-
-
Query: -
Match/Term -
Range
-
-
Include/exclude -
Sorting -
Tests: -
Terms aggregation -
Histogram aggregation -
Range aggregation -
Query -
Include/exclude
-
How it works
- During app bootup, all plugins defined in
nomad.yaml
are loaded andconfig.plugins.filtered_values
will contain all enabled plugins. - From each schema (schema is a section inheriting from
EntryData
) contained in a plugin plugin, we extract all scalar quantities and automatically createElasticsearch
annotation objects withdynamic=True
for each one. The id used to differentiate quantities is discussed below. These annotations will be saved intoEntryType.quantities
just like regular annotations. The search API will use this information to know which paths correspond to dynamic quantities. In the future, it should be also possible to control which quantities to index by looking at an ES annotation in the schema. - As usual, the GUI will get information about all ES annotated quantities through
artifacts.js
. This file contains the field 'searchQuantities' that is populated by looking at all registered ES annotations, no matter whether they are dynamic or static. - The GUI may now perform regular search calls for dynamic quantities. API will trigger conversions to the correct calls:
- For aggregations,
_api_to_es_aggregation
adds nested aggregation and additional filter aggregation that ensures that the aggregation targets the correct quantity. - For queries,
_api_to_es_query
will ensure that an appropriate nested query is added with the additional filter that focuses the search on a specific path+value. -
_api_to_es_required
handles theinclude
andexclude
directives. Including a dynamic quantity will automatically includesearchable_quantities
in the underlying ES call. Excludingsearchable_quantities
or any of the dynamical quantities is allowed, and will trigger an extra filtering step by our API. -
_es_to_api_aggregation
maps aggregation responses to use the quantity id instead ofsearchable_quantities
. -
_es_to_entry_dict
will transform the hits. The return structure will follow the dynamic quantity path instead of usingsearchable_quantities
. Include/exclude that cannot be handled by ES is also handled here.
- For aggregations,
How quantities are identified
We need a good way to distinguish quantities. Identifiers are needed e.g. in:
- In the API queries to properly resolve the target quantity
- In the response structure to identify which quantity is returned
- In our GUI forms when displaying a list of available quantities.
The identifier should have two, easily separable parts: one for the schema, another one for the quantity path within the schema. Note that instead of using the schema as identifier, one could also use the quantity definition + schema path as an identifier (we do store the definition in the searchable quantity). This is not currently implemented as it would be conceptually different from the static quantity searches and might lead to unexpected search hits from other schemas.
Schemas may be declared in a yaml file, or as Python classes inheriting from EntryData
. Multiple schemas can be defined per plugin/yaml file, and inheritance should be taken into account. The canocal way to identify a section defining a schema is to use the function definition_reference
. It will return a global reference of the following format:
nomadschemaexample.schema.MySection # Python
../uploads/<upload_id>/archive/<entry_id>#/definitions/section_definitions/2 # yaml
This syntax is problematic in search and in the GUI, mostly because of the .
character:
- In Elasticsearch,
.
denotes hierarchy. This complicates the reconstruction of query results, and the checking of include/exclude directives. - Our entire GUI expects quantities to be separated into sections using the
.
character. We are e.g. using path strings (e.g. 'nomadschemaexample.schema.MySection.name)' to recursively look deeper into objects, which breaks things. - We also need to be able to separate between the schema and the schema path. If the schema name contains dots, being able to do the separation would require the use of an additional character, most of which are already reserved for some other purpose.
The current solution is to replace dots with /
in the python schema id and to strip out the leading dots in the YAML schema id.
The quantity id within the schema is fairly easy to construct, as it can be a regular, dot-separated path.
TLDR: With this, the final quantity identifiers look like this:
nomadschemaexample/schema/MySection.name # Python
/uploads/<upload_id>/archive/<entry_id>#/definitions/section_definitions/2.name # yaml
where the schema and schema path can be separated at the first occurrence of .
Fields for a searchable quantity:
-
path
: Path of the quantity in a specific schema definition. -
definition
: The definition for the quantity. -
path_archive
: Path of the value in the archive. -
bool_value
/text_value
/long_value
/double_value
: The actual value to store.
It would be possible to separate the path
attribute into two different parts: schema
+ schema_path
, but there are currently a few reasons not to do this:
- In the 'User defined quantities' -menu we search for all unique quantities to list them out. This is done using an aggregation for a field. If we would split
path
intoschema
andpath_schema
, we would not be able to get unique quantities from them with terms aggregation, but we would need to use some sort of multi terms aggregation, which is currently not implemented in our API. - At the moment, the search always targets a specific quantity in a specific schema, which means that we can identify the targeted value with one criteria based on
path
: this is probably faster than matching two separate fields. If in the future we would want to enable searches that would target specific schemas, or specific paths, we would probably need to add these fields or perform a more complicated string match query.
Problem with schemas defined as part of data:
In order for the API to properly resolve dynamic quantities that are defined in a custom user-specific shema, the API should have access to their definition in order to translate the aggregations to the searchable_quantities
field. For the translation, we need the full definition (schema+schema_path+dtype) and since any user can have any number of custom quantities, we would need to perform a fairly complicated collection of API calls to resolve all of the custom quantity definitions from YAML files.
Instead of attempting to load all custom quantity definitions, we could support an extended syntax that contains also the data type to resolve the value field, e.g.:
/uploads/<upload_id>/entries/<entry_id>#definitions/0@str.name # yaml
In an app definition, these custom quantities could then be "registered" (registration should contain the data type and description) and used as usual.
String quantities
I would argue that in most cases it is difficult to distinguish which quantities should be stored under keyword_value
and which quantities should be stored under text_value
in searchable_quantities
. To simplify things, I propose that by default all string quantities are stored under a text
mapping, which has a multi-field for keyword
mapping as well to enable aggregations on it.
Turning more stuff into plugins
The most consistent way for supporting dynamic quantities would be to by default index all quantities from custom schemas and turn nexus + other schemas into plugins. This would allow scaling the searchable quantities by controlling which plugins are enabled.
Limitations
Only scalar fields
The existing implementation is enabled only for scalar fields. Maybe this can be lifted?
No nested queries
Because we are flattening all quantities to a single list, information about the hierarchy is lost. This means that nested queries are not possible with our current setup for custom quantities. E.g. if an entry has the following structure:
data: {
Section: [
{
quantityA: a,
quantityB: b,
},
{
quantityA: a,
quantityB: c,
},
]
}
it will not be possible to query for entries containing Section
which fulfills quantityA = a
and quantityB = b
simultaneously.
Limit of 10 000 nested docs
By default, a single document can contain 10 000 nested documents. So for very large archives, this limit may be hit at some point when searchable_quantities is populated.