coe->fairdi repository db and raw data migration functions
To create a real production nomad@FAIRDI we must be able to migrate the data from NOMAD coe.
This must be re-producable and testable It should be incremental
Its implementation depends on #80 (closed), #81 (closed).
User meta data
The migration must identify this data and allow to post it with upload commits:
-
ownerships, coauthorships, shareships -
calcsets (i.e. data sets); metadata.chemical_formula
holds the dataset name -
user_metadata (permission~[open(0), restricted(1)]; label~comments) -
metadata_citations, citations (value, kind~[references(external), DOIs(internal)]) -
calculation checksum (depending on what this actually is and means)
Some of this user meta data should also be uploadable/postable for non migration uploads, e.g. bulk upload of external project data
- coauthorships, shareships
- comments
- references
-
dataset (needs a different implementation, the migration one just comprises the dataset id (calc_id) and assumes it is copied during mirgation.
User meta data commit body example
Keys starting with _underscore
are only allowed for mirgation/admin user. The other keys can also be used by regular users/client, e.g. to bulk upload external project data (e.g. Aflowlib)
{
"with_embargo": false,
"comments": ["This is a comment"],
"references": ["http://extenral.ref"],
"coauthors": [{user_id: "id"}, ...],
"_upload_time": "a proper date" // overrides the migration upload time
"_uploader": 19282 // user id overrides the migration uploader
"calculations": [ // overwrites top-level data
{
"mainfile": "/path/to/mainfile.out",
"_PID": 1726312,
"_datasets" : [
{
"_id" : 200603,
"dois" : [],
"name" : "MRopo_etal_amino_acids_conformer_dataset"
},
{
"_id" : 318834,
"dois" : [
"http://dx.doi.org/10.17172/NOMAD/20150526220502"
],
"name" : "MRopo_etal_amino_acids_conformer_dataset_SciData2015"
}
],
"_checksum": 1928272332,
"with_embargo": true,
"comments": ["This is a comment"],
"references": ["http://extenral.ref"],
"coauthors": [{user_id: "id"}, ...]
}
]
}
Questions
- Citations are per dataset or per calculation? Non trivial since both are calculations table entries.
- Can new TID/CIDs be different?
- Can we safely flatten datasets (calcs belonging to all recursive parents)
- login_tokens vs sessions?
- what is
calculations.checksum
Migration implementation
-
copies user from source before migration -
build index (calc_id,mainfile,upload) from source db -
extract metadata for index -
identifies uploads from filesystem -
uploads via regular upload API -
matches found mainfiles with index -
validates the calculation meta data agains source -
commits upload with additional data: restriction, coauthorship, shareship, references, comments, and only for migration: datasets (parent calc id), pid, uploader, upload_time -
an effective logging mechanism to identify inconsistencies (use ELK via regular logging) -
check if all index calcs/uploads have been found on the filesystem -
implement tests
Test condidtions
-
success: upload in both, all calcs in both, matches -
upload not in index/source db -
upload in, but calc not in source db -
calc in source db, not in upload -
missmatching metadata -
failed upload process (e.g. broken archive) -
failed calc process -
upload from archive -
upload from extracted
Iterating uploads/calcs in source db
Facts from db:
- there are
metadata
withlocation = null
andfilenames != null
(); here filenames shows upload and denotes a dataset - there are
location = null
andfilenames = null
, 2 of them withcalculations.nested_depth=0
, but still will children: all datasets - conclusively all calculations have location
- there are still 100k+ calculations without origin
-
location
is useless, sometime absolute, sometimes upload relative -
filenames
prefixes are:/data/n...
,$EXTRACTED
,/nomad/
- there is no index on
filenames
There are two principle directions
- go through all
metadata
records to find calculations and resolve them to raw data files - go through all raw data files and try to find corresponding entries, e.g. in
metadata
The raw data extracted /data/nomad/extracted
and /data2/uploads
only partially match:
- both contains some *.tar.gz and other files; ignore now check later
- extracted contains further sub directories:
materialsProject
,database_final_v5
; they are just links - otherwise there are uploads not in extracted and vice versa
- the differences are small, ~300 out of 11,000
Only extracted files are referenced in source db
Not all calculations in source db are organized in uploads (origin_id
)