Skip to content

Possibility to use custom schemas in apps and search

Lauri Himanen requested to merge search into develop

This MR adds the possibility of using scalar quantities from custom schemas (both Python and YAML) in queries, aggregations and in app definitions. A new documentation for Apps is also included.

Highlights

  • That path in the data is no longer enough to target a quantity. E.g. data.sample.id may be defined by several different schemas. We need to add an identifier for the schema in the quantity name.

  • Schemas are identified using the qualified_name of the root section (can be fetched with section.qualified_name()):

    • Python schemas: path of the class name, e.g. nomadschemaexample.schema.MySchema
    • YAML schemas: entry_id:gTqaJYQ7IH20dl5PeX7ZPzkHudI8.MySchema

    The other option would be to use the reference syntax (can be fetched with definition_reference()), but the references to YAML files are very hard to use, as they look like this: ../uploads/Yl6DTVCVS1GqYRwEhmexrA/raw/schema.archive.yaml#/definitions/section_definitions/1.

  • The full quantity identifier is <path>#<schema_name> to target a quantity in the schema. E.g. data.sample.id#nomadschemaexample.schema.MySchema.

  • The GUI will support simplified rendering of schema identifiers to remove clutter from the UI, but in the app config and in our backend the quantities need to be identified by this full name.

  • Currently # is used as a separator between path and schema name. The choice is complicated by the fact that many separator are reserved for other purposes:

    • . denotes section hierarchy in paths, and is also used in schema identifiers.
    • : is used in the YAML schema name and also for query modifiers, e.g. material.elements:all: ['Si', 'C']
    • / is used in inner_section_definitions and in YAML schema paths
    • & is reserved for url query parameters
    • @ is reserved for indicating a hash digest for a definition. Used to e.g. distinguish between different versions of a schema.
    • Any operators commonly used in boolean logic (+, -, &, |) should not be used if we want to later add support for them in the search bar.
  • For technical reasons, the API calls targeting YAML quantities will need to include the data type. The current syntax is like this: <path>#<schema_name>#<dtype>. This data type is added fully transparently by the GUI, but it is up to the user to include it in manual API calls (you will get a meaningful warning if you omit it).

Example setups for testing

Python schema

  1. Get the code: git checkout search
  2. Add the test Python schema included in our source code to your PYTHONPATH: export PYTHONPATH="${PYTHONPATH}:/<root folder>/nomad-FAIR/tests/data/plugins"
  3. Copy and use this nomad.yaml file: nomad.yaml
  4. Boot up docker, appworker, GUI
  5. Login, create new upload, upload this file: dataset.zip
  6. Go to "Explore/My Python Schema" to try out different things.

YAML schema

  1. Get the code: git checkout search
  2. Boot up docker, appworker, GUI
  3. Login, create new upload, upload this file containing the schema: schema.archive.yaml. Note down the upload_id and entry_id for the schema.
  4. Modify this nomad.yaml file: nomad.yaml so that it uses the entry_id you got from the previous step.
  5. Use the upload_id in line 8 of this script: generator.py. Run the script, zip the produced dataset folder.
  6. Restart appworker with the new nomad.yaml file. Login, upload the zipped dataset.
  7. Go to "Explore/My YAML Schema" to try out different things.

Known limitations compared to "native" quantities

  • Search boxes cannot show suggestions for values (due to the hierarchy being flattened within ES, our suggestion mechanism cannot filter out the values to show).
  • Only scalar quantities are available (might be possible to get around, not sure)
  • Nested queries are not possible (due to the hierarchy being flattened in ES)
  • By default, a single document can contain 10 000 nested documents. So for very large archives, this limit may be hit at some point when search_quantities is populated.
Edited by Lauri Himanen

Merge request reports