v1 processing module
Processing v1
The processing module runs processors (from nomad.parsing
, nomad.normalizing
) on data. It addition it uses nomad.files
and nomad.search
to read/save data. Data is organised in entities Entry
and Upload
. It also maintains basic metadata and processing state for these entities. The work is distributed via celery.
Processors
We have two (or three) different types of processors:
- mainfile -> entry parser -> archive
- archive <-> normalizer
- archives + mainfile -> workflow parser -> archive
A typical process, e.g. to process an Entry
would be:
- Upload:match -> mainfile -> Upload:call entry processing -> Entry:run entry parser -> Entry:run normalizer -> archive -> Entry:save
This already shows one problem. Because we run celery processes on entities (to use the entity for keeping processing state), a single logical process (e.g. process an Entry
) is spread over multiple celery processes (e.g. Upload:match -> Upload: call entry -> Entry:parse+normalize+save -> Upload: do more steps).
Celery processing
Each celery process is tied to an Entry
or Upload
object and respective database entry. We can use the database entries to store a celery process' state and access it from everywhere at all times. Celery processes are implemented as decorated entity methods. This system has proved to be stable even for long running processes, error conditions, failing workers, etc. I would like to keep it this way and not use more complex celery (no redis, no canvas, etc.).
Keep in mind that there is a difference between celery processes (implemented as entity methods) and normal methods. An Entry
process can still call an Upload
method that is then run within this Entry
process. Celery processes (via the methods that implement them) can be called by the API (e.g. by the user), the CLI, or other processes.
Pros and Cons of the old implementation
- + process and process state via celery and mongo
- - tasks
- - metadata handling
- - entry embargo
- - to many process specialisations: upload processing vs reprocessing vs oasis processing
- - workflows: upload -> entries, phonos
Features
Upload
- incremental
process
(former process and reprocess)- match, identify entries
- find entries that need processing
- trigger entry processing
- update materials index (maybe this can be done by Entry)
publish
delete
-
sync
with other nomads - admin operations (check consistency,
index_all
,re_pack
etc.)
Entry
-
process
entry- parse
- normalize
- save (archive, mongo, elastic)
Synchronisation
- Upload:
process
calls Entry:process
and need to wait for all Entries (e.g. trigger more entries, update materials index) - Entry:
process
conditionally calls Upload:continueprocess
This part is a little ugly. But I still prefer it to more complex celery. It has worked in the past. Maybe it can be implemented more gracefully: The last action of Entry:process
is to call an Upload method that checks if a condition to continue is reached (via state aggregated over all Entries) and triggers a new Upload-based celery process. This way the logical Upload:process
is realized by multiple celery processes: one triggers the Entry processing, the other is triggered after all Entries have completed.
This should be the only sync mechanism we need. But how can we realize more complex scenarios?
Aggregated state
Uploads can aggregate Entry state via mongo. Instead of storing that all Entries have be processed, it can simply query if all entries have been processed.
Workflows
Some Entries depend on other Entries. Some Entries must be processed after the processing of others has completed.
Currently, we only implemented one cycle of Upload -> Entries -> Upload. I suggest two do many cycles: Upload -> Entries -> Upload -> more Entries -> Upload.
Slotting
Parser determine a slot number. If they depend on slot 1 entries, they choose 2. E.g. VASP is slot 1, Phonopy is slot 2. We do as many cycles as there are slots among the entries of an upload. The processing could be extended dynamically and does not need to be predetermined. E.g. a parser could slot himself again. This would be super simple and no complex celery workflows need to be created.
Why not celery canvas/workflows
- requires redis
- requires storing celery state and nomad data in two different places
- provides lots of stuff, we don't need
- workflows might be very large (gaussian uploads with 100.000+ entries)
process
Entry:The concrete process depends on the data type (e.g. parser). Parsers should nominate (or even call) normalisers (currently normaliser are selected via domains).
In the past, it was good enough to run the whole entry process in one celery process. This would be different, if need to just run (certain) normalisers.
process
Incremental Upload:- process and re-process only parts of an upload
- run full slot cycle, but only consider out of date Entries
- processing based on Entry state
- out of date flag set by Upload API
- user requests reprocess
- watchdog sets out of date flag
- matching discovers new or out of date mainfiles
- how to mark slot 2 entries as out of date? If there was at least one reprocessed slot 1 entry?
Entry metadata
- provide default metadata at all times (during errors, etc.)
- read user metadata from upload files (don't overwrite on reprocess)
- maintain processing metadata
- no more upload based user metadata (e.g. no publish with metadata)
Replaces #319 (closed).