Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • nomad-FAIR nomad-FAIR
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 214
    • Issues 214
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 28
    • Merge requests 28
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • nomad-labnomad-lab
  • nomad-FAIRnomad-FAIR
  • Issues
  • #84
Closed
Open
Issue created Dec 23, 2018 by Markus Scheidgen@mscheidgOwner24 of 26 checklist items completed24/26 checklist items

coe->fairdi repository db and raw data migration functions

To create a real production nomad@FAIRDI we must be able to migrate the data from NOMAD coe.

This must be re-producable and testable It should be incremental

Its implementation depends on #80 (closed), #81 (closed).

User meta data

The migration must identify this data and allow to post it with upload commits:

  • ownerships, coauthorships, shareships
  • calcsets (i.e. data sets); metadata.chemical_formula holds the dataset name
  • user_metadata (permission~[open(0), restricted(1)]; label~comments)
  • metadata_citations, citations (value, kind~[references(external), DOIs(internal)])
  • calculation checksum (depending on what this actually is and means)

Some of this user meta data should also be uploadable/postable for non migration uploads, e.g. bulk upload of external project data

  • coauthorships, shareships
  • comments
  • references
  • dataset (needs a different implementation, the migration one just comprises the dataset id (calc_id) and assumes it is copied during mirgation.

User meta data commit body example

Keys starting with _underscore are only allowed for mirgation/admin user. The other keys can also be used by regular users/client, e.g. to bulk upload external project data (e.g. Aflowlib)

{
  "with_embargo": false,
  "comments": ["This is a comment"],
  "references": ["http://extenral.ref"],
  "coauthors": [{user_id: "id"}, ...],
  "_upload_time": "a proper date" // overrides the migration upload time
  "_uploader": 19282 // user id overrides the migration uploader
  "calculations": [ // overwrites top-level data
    {
      "mainfile": "/path/to/mainfile.out",
      "_PID": 1726312,
      "_datasets" : [ 
        {
          "_id" : 200603,
          "dois" : [],
          "name" : "MRopo_etal_amino_acids_conformer_dataset"
        }, 
        {
          "_id" : 318834,
          "dois" : [ 
            "http://dx.doi.org/10.17172/NOMAD/20150526220502"
          ],
          "name" : "MRopo_etal_amino_acids_conformer_dataset_SciData2015"
        }
      ],
      "_checksum": 1928272332,
      "with_embargo": true,
      "comments": ["This is a comment"],
      "references": ["http://extenral.ref"],
      "coauthors": [{user_id: "id"}, ...]
    }
  ]
}

Questions

  • Citations are per dataset or per calculation? Non trivial since both are calculations table entries.
  • Can new TID/CIDs be different?
  • Can we safely flatten datasets (calcs belonging to all recursive parents)
  • login_tokens vs sessions?
  • what is calculations.checksum

Migration implementation

  • copies user from source before migration
  • build index (calc_id,mainfile,upload) from source db
  • extract metadata for index
  • identifies uploads from filesystem
  • uploads via regular upload API
  • matches found mainfiles with index
  • validates the calculation meta data agains source
  • commits upload with additional data: restriction, coauthorship, shareship, references, comments, and only for migration: datasets (parent calc id), pid, uploader, upload_time
  • an effective logging mechanism to identify inconsistencies (use ELK via regular logging)
  • check if all index calcs/uploads have been found on the filesystem
  • implement tests

Test condidtions

  • success: upload in both, all calcs in both, matches
  • upload not in index/source db
  • upload in, but calc not in source db
  • calc in source db, not in upload
  • missmatching metadata
  • failed upload process (e.g. broken archive)
  • failed calc process
  • upload from archive
  • upload from extracted

Iterating uploads/calcs in source db

Facts from db:

  • there are metadata with location = null and filenames != null (); here filenames shows upload and denotes a dataset
  • there are location = null and filenames = null, 2 of them with calculations.nested_depth=0, but still will children: all datasets
  • conclusively all calculations have location
  • there are still 100k+ calculations without origin
  • location is useless, sometime absolute, sometimes upload relative
  • filenames prefixes are: /data/n..., $EXTRACTED, /nomad/
  • there is no index on filenames

There are two principle directions

  • go through all metadata records to find calculations and resolve them to raw data files
  • go through all raw data files and try to find corresponding entries, e.g. in metadata

The raw data extracted /data/nomad/extracted and /data2/uploads only partially match:

  • both contains some *.tar.gz and other files; ignore now check later
  • extracted contains further sub directories: materialsProject, database_final_v5; they are just links
  • otherwise there are uploads not in extracted and vice versa
  • the differences are small, ~300 out of 11,000

Only extracted files are referenced in source db

Not all calculations in source db are organized in uploads (origin_id)

Edited Jan 27, 2019 by Markus Scheidgen
Assignee
Assign to
Time tracking