v1 processing module
The processing module runs processors (from
nomad.normalizing) on data. It addition it uses
nomad.search to read/save data. Data is organised in entities
Upload. It also maintains basic metadata and processing state for these entities. The work is distributed via celery.
We have two (or three) different types of processors:
- mainfile -> entry parser -> archive
- archive <-> normalizer
- archives + mainfile -> workflow parser -> archive
A typical process, e.g. to process an
Entry would be:
- Upload:match -> mainfile -> Upload:call entry processing -> Entry:run entry parser -> Entry:run normalizer -> archive -> Entry:save
This already shows one problem. Because we run celery processes on entities (to use the entity for keeping processing state), a single logical process (e.g. process an
Entry) is spread over multiple celery processes (e.g. Upload:match -> Upload: call entry -> Entry:parse+normalize+save -> Upload: do more steps).
Each celery process is tied to an
Upload object and respective database entry. We can use the database entries to store a celery process' state and access it from everywhere at all times. Celery processes are implemented as decorated entity methods. This system has proved to be stable even for long running processes, error conditions, failing workers, etc. I would like to keep it this way and not use more complex celery (no redis, no canvas, etc.).
Keep in mind that there is a difference between celery processes (implemented as entity methods) and normal methods. An
Entry process can still call an
Upload method that is then run within this
Entry process. Celery processes (via the methods that implement them) can be called by the API (e.g. by the user), the CLI, or other processes.
Pros and Cons of the old implementation
- + process and process state via celery and mongo
- - tasks
- - metadata handling
- - entry embargo
- - to many process specialisations: upload processing vs reprocessing vs oasis processing
- - workflows: upload -> entries, phonos
process(former process and reprocess)
- match, identify entries
- find entries that need processing
- trigger entry processing
- update materials index (maybe this can be done by Entry)
syncwith other nomads
- admin operations (check consistency,
- save (archive, mongo, elastic)
processand need to wait for all Entries (e.g. trigger more entries, update materials index)
processconditionally calls Upload:continue
This part is a little ugly. But I still prefer it to more complex celery. It has worked in the past. Maybe it can be implemented more gracefully: The last action of Entry:
process is to call an Upload method that checks if a condition to continue is reached (via state aggregated over all Entries) and triggers a new Upload-based celery process. This way the logical Upload:
process is realized by multiple celery processes: one triggers the Entry processing, the other is triggered after all Entries have completed.
This should be the only sync mechanism we need. But how can we realize more complex scenarios?
Uploads can aggregate Entry state via mongo. Instead of storing that all Entries have be processed, it can simply query if all entries have been processed.
Some Entries depend on other Entries. Some Entries must be processed after the processing of others has completed.
Currently, we only implemented one cycle of Upload -> Entries -> Upload. I suggest two do many cycles: Upload -> Entries -> Upload -> more Entries -> Upload.
Parser determine a slot number. If they depend on slot 1 entries, they choose 2. E.g. VASP is slot 1, Phonopy is slot 2. We do as many cycles as there are slots among the entries of an upload. The processing could be extended dynamically and does not need to be predetermined. E.g. a parser could slot himself again. This would be super simple and no complex celery workflows need to be created.
Why not celery canvas/workflows
- requires redis
- requires storing celery state and nomad data in two different places
- provides lots of stuff, we don't need
- workflows might be very large (gaussian uploads with 100.000+ entries)
The concrete process depends on the data type (e.g. parser). Parsers should nominate (or even call) normalisers (currently normaliser are selected via domains).
In the past, it was good enough to run the whole entry process in one celery process. This would be different, if need to just run (certain) normalisers.
- process and re-process only parts of an upload
- run full slot cycle, but only consider out of date Entries
- processing based on Entry state
- out of date flag set by Upload API
- user requests reprocess
- watchdog sets out of date flag
- matching discovers new or out of date mainfiles
- how to mark slot 2 entries as out of date? If there was at least one reprocessed slot 1 entry?
- provide default metadata at all times (during errors, etc.)
- read user metadata from upload files (don't overwrite on reprocess)
- maintain processing metadata
- no more upload based user metadata (e.g. no publish with metadata)