v1 processing module

Processing v1

The processing module runs processors (from nomad.parsing, nomad.normalizing) on data. It addition it uses nomad.files and nomad.search to read/save data. Data is organised in entities Entry and Upload. It also maintains basic metadata and processing state for these entities. The work is distributed via celery.

Processors

We have two (or three) different types of processors:

mainfile -> entry parser -> archive
archive <-> normalizer
archives + mainfile -> workflow parser -> archive

A typical process, e.g. to process an Entry would be:

Upload:match -> mainfile -> Upload:call entry processing -> Entry:run entry parser -> Entry:run normalizer -> archive -> Entry:save

This already shows one problem. Because we run celery processes on entities (to use the entity for keeping processing state), a single logical process (e.g. process an Entry) is spread over multiple celery processes (e.g. Upload:match -> Upload: call entry -> Entry:parse+normalize+save -> Upload: do more steps).

Celery processing

Each celery process is tied to an Entry or Upload object and respective database entry. We can use the database entries to store a celery process' state and access it from everywhere at all times. Celery processes are implemented as decorated entity methods. This system has proved to be stable even for long running processes, error conditions, failing workers, etc. I would like to keep it this way and not use more complex celery (no redis, no canvas, etc.).

Keep in mind that there is a difference between celery processes (implemented as entity methods) and normal methods. An Entry process can still call an Upload method that is then run within this Entry process. Celery processes (via the methods that implement them) can be called by the API (e.g. by the user), the CLI, or other processes.

Pros and Cons of the old implementation

+ process and process state via celery and mongo
- tasks
- metadata handling
- entry embargo
- to many process specialisations: upload processing vs reprocessing vs oasis processing
- workflows: upload -> entries, phonos

Features

Upload

incremental process (former process and reprocess)
- match, identify entries
- find entries that need processing
- trigger entry processing
- update materials index (maybe this can be done by Entry)
publish
delete
sync with other nomads
admin operations (check consistency, index_all, re_pack etc.)

Entry

process entry
- parse
- normalize
- save (archive, mongo, elastic)

Synchronisation

Upload:process calls Entry:process and need to wait for all Entries (e.g. trigger more entries, update materials index)
Entry:process conditionally calls Upload:continue process

This part is a little ugly. But I still prefer it to more complex celery. It has worked in the past. Maybe it can be implemented more gracefully: The last action of Entry:process is to call an Upload method that checks if a condition to continue is reached (via state aggregated over all Entries) and triggers a new Upload-based celery process. This way the logical Upload:process is realized by multiple celery processes: one triggers the Entry processing, the other is triggered after all Entries have completed.

This should be the only sync mechanism we need. But how can we realize more complex scenarios?

Aggregated state

Uploads can aggregate Entry state via mongo. Instead of storing that all Entries have be processed, it can simply query if all entries have been processed.

Workflows

Some Entries depend on other Entries. Some Entries must be processed after the processing of others has completed.

Currently, we only implemented one cycle of Upload -> Entries -> Upload. I suggest two do many cycles: Upload -> Entries -> Upload -> more Entries -> Upload.

Slotting

Parser determine a slot number. If they depend on slot 1 entries, they choose 2. E.g. VASP is slot 1, Phonopy is slot 2. We do as many cycles as there are slots among the entries of an upload. The processing could be extended dynamically and does not need to be predetermined. E.g. a parser could slot himself again. This would be super simple and no complex celery workflows need to be created.

Why not celery canvas/workflows

requires redis
requires storing celery state and nomad data in two different places
provides lots of stuff, we don't need
workflows might be very large (gaussian uploads with 100.000+ entries)

Entry:`process`

The concrete process depends on the data type (e.g. parser). Parsers should nominate (or even call) normalisers (currently normaliser are selected via domains).

In the past, it was good enough to run the whole entry process in one celery process. This would be different, if need to just run (certain) normalisers.

Incremental Upload:`process`

process and re-process only parts of an upload
run full slot cycle, but only consider out of date Entries
processing based on Entry state
- out of date flag set by Upload API
- user requests reprocess
- watchdog sets out of date flag
- matching discovers new or out of date mainfiles
how to mark slot 2 entries as out of date? If there was at least one reprocessed slot 1 entry?

Entry metadata

provide default metadata at all times (during errors, etc.)
read user metadata from upload files (don't overwrite on reprocess)
maintain processing metadata
no more upload based user metadata (e.g. no publish with metadata)

Replaces #319 (closed).