Data synchronisation between NOMAD installations.
We have some basic mirror functionality based on Flask and the cli. It was design for migrating between different NOMAD versions, not so much for the Oasis.
We will reimplement this with fastapi. This time, we have the Oasis as the main use-case.
The API needs to support two major directions:
- Publish from oasis (basically like POST uploads?is_oasis_upload). This has two sides, initiating the action in the Oasis via API or CLI and receiving the data in the central NOMAD also via API.
- Pull from central NOMAD (basically what is now in mirror). This also has two sides, initiating the pull via CLI on the Oasis and providing the data in the central NOMAD via API.
Both directions work on uploads. There is no finer granularity. Both directions require functionality to build a transferable upload.zip. This basically means we need to define a special .zip format. Here we have to implement two options (or even more):
- raw files + nomad.json (requires full processing) [this is what we should start with]
- raw files + archive files (requires only indexing)
- just archive files (requires only indexing, and a NOMAD that deals with missing raw-files) [for experiments and FAIRmat]
There a ton of additional aspects that we should have in mind, but not necessarily implement right at the beginning:
- CLI that allows to incrementally pull "all" uploads (for mirrors and forking)
- support for "FedExed hard drives" (basically an offline more for mirrors)
- version awareness
- how to provide safe file transfer?
- parallel file transfer?
- support for "just metadata" (without raw, full archive, partial archive)
- synchronising users if the user management is not the same
To implement this, two endpoints are required, one for receiving an upload (push), one for providing an upload (pull). It could be as simple as:
POST sync/upload
GET sync/upload/{id}
The functionality that implements upload provision and adding a received upload should probably be implemented in processing.data.Upload
. Giving us flexibility and allowing to call this from either API or CLI.
We should also rewrite the mirror CLI for the following use-cases:
- pull individual uploads
- pull all public uploads
- pull all uploads (e.g. for forking NOMAD's data)
Pull synchronisation should also work for staging uploads and embargo data, when we use this to fork our data locally as a special use-case.