Entry processing dependencies
The current entry processing system is designed to support entries that are completely independent: all parsing, normalization, etc. is performed completely independently of other entries within an upload.
There is, however, a need to support entries that have dependencies within an upload (dependencies across uploads are not supported for now). E.g. some programs are used to orchestrate workflows that require information from multiple parsers (currently only phonopy
and elastic
). With the advent of workflow software (fireworks, AiiDA, atomate, etc.) this is becoming more frequent. Also for experimental data this functionality can become very useful: an experiment is typically conducted as a series of steps, each step possibly having a separate parser.
The branch nomad_pipelines
contains a prototype that replaces the single-shot entry processing with pipelines. A pipeline consists of a series of named stages that may have dependencies. Each mainfile is associated with a pipeline, which is synchronized to run with the pipelines of other mainfiles found within an upload. The synchronization is based on the dependencies associated with the different stages.
For example, an upload may have the following mainfiles and pipelines:
mainfile1 -> Pipeline(Stage("a", func_a, dependencies=["b"]))
mainfile2 -> Pipeline(Stage("b", func_b))
mainfile3 -> Pipeline(Stage("c", func_c))
In the case of such a pipeline, the stages that are first within a dependency chain are executed first parallelly (mainfile3 func_b
and mainfile3 func_c
) after which the dependent stages are executed in order (mainfile1 func_a
). This mechanism supports also longer chains and multiple dependencies.
Currently this functionality is implemented with Celery, in particular with celery groups, chains and chords. To support chaining groups of tasks, celery requires a result backend to be present. The backend based on RabbitMQ (rpc
) does not, however, support chords. For this reason, the popular redis
backend is used instead. Redis simply manages the task execution order and does not really store any of the results (all celery tasks are immutable and the results are ignored.).
The following should be done if we proceed with the prototype:
-
Create tests for the pipelines (execution order, exceptions, invalid dependency chains, etc.) -
Create a working version of the phonon calculation processing (phonon parser should be able to read the method information from the results of an archive produced by FHI-aims parser)