Migration based on extracted files
Unfortunately, the current approach of using a mixed form of existing uploads, extracted files from different locations is problematic, due to the diverse nature of the files.
We will need an approach strictly based on extracted filed to be reliable. Further, it needs to bundle uploads, since the origin upload directories can contain multiple TB of data (e.g. aflowlib).
We should follow a similar strategy to the origin archive bundles.
Implement:
-
a script to collect a manifest of files and their sizes -
partition this manifest -
keep this in mongo for repeatability and incrementality, it should work on different bases (e.g. /nomad, and /data) -
update the migration.py to work with this manifest