Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • nomad-FAIR nomad-FAIR
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 134
    • Issues 134
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 11
    • Merge requests 11
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar

On Thursday, 7th July from 1 to 3 pm there will be a maintenance with a short downtime of GitLab.

  • nomad-lab
  • nomad-FAIRnomad-FAIR
  • Issues
  • #348

Closed
Open
Created May 15, 2020 by Markus Scheidgen@mscheidgOwner

A new framework for parsing semi structured textfiles

The NOMAD CoE parsers use the simple_parser.py module for most of the parsers. While this module contains a lot of good ideas, it is probably not a good for new parsers.

Introduction

Why does the old library not work anymore?

  • It was build for a streaming backend that we don't need anymore. The streaming backend introduced lots of unnecessary constraints about the order that data can be produced. This meant that a lot of data had to be temporarily stored in local variables and lead to the introduction of the "CachingBackend" with a complex systems to declare temporary data.
  • It assumes that the structure of the input text matches the structure of the metainfo. Again, lots of temporary data.
  • It is easy to get lost in its "syntax" (look at parsers to see what I mean).
  • The API is very convoluted with various parser, nested contexts, backend classes. Ideally a parser would use one class and one object.
  • The API allowed to write parsers that are not reusable for multiple parse runs. We have to reinitialise and re-compile all the involved regular expressions all the time.
  • The code that controls the parsing and the code that post-processes the parsed data are separated, which makes it hard to understand what happens.
  • There are no re-usable modules for re-occuring patterns, e.g. for vectors, matrices, etc.
  • There are lots of different parser interfaces (ParserInterface, mainFunction, SmartParser) that the infrastructure needs to address.
  • Unit conversion was build into the parsing part and should have been handled by the backend/metainfo
  • The general code organisation and quality in python_common is questionable. One big problem is that we cannot separate dependencies that all parsers need, from some that only a few need (e.g. mdanalysis, mdtraj).
  • There is no common mechanism to deal with multiple files and sub-parsers.

We should create a new framework for developing parsers for semi structured textfiles for the following reasons:

  • There will be new parsers to write for the new projects.
  • It would give us a fighting change to modernise the existing parsers-
  • There are requests for more code support, hence more parsers.
  • It is really hard to explain someone the "WTFs" of the old framework, because most of the reasons are gone by no.

The new framework should have the following features:

  • There is a simple interface. A parser consist of one class, instances can be reused.
  • It uses the new metainfo Python library. Target quantities can be addressed by paths; arbitrary sections can be addressed; there is no need for opening and closing sections in a particular order.
  • Support for dealing with files: a mechanism that abstracts from the actual files; developers should not manually open files. Ideally the framework allows to parse files directly from the uploaded zip-file.
  • This list should be extended over time

Exercise

@himanel1 @ladinesa @temokmx @mscheidg We should all do this to brainstorm a new framework design.

Assume the new parser framework is already implemented. Using this fake input file and output metainfo as a references. What would the parser look like? Write some pseudo Python.

Input file, e.g. super_code.out:

2020/05/15
               *** super_code v2 ***
               
system 1
--------
sites: H(1.23, 0, 0), H(-1.23, 0, 0), O(0, 0.33, 0)
latice: (0, 0, 0), (1, 0, 0), (1, 1, 0)
energy: 1.29372

*** This was done with magic source                                ***
***                                x°42                            ***


system 2
--------
sites: H(1.23, 0, 0), H(-1.23, 0, 0), O(0, 0.33, 0)
cell: (0, 0, 0), (1, 0, 0), (1, 1, 0)
energy: 1.29372

Output metainfo:

{
  "section_run": [
    {
       "code_name": "super_code",
       "code_version": "2",
       "time": "2020-05-15-00:00:00",
       "section_system": [
         {
           "atom_labels": ["H", "H", "O"],
           "atom_positions": [[6.23e-18, 0, 0 ], ...
           "lattice_vectors": [[0, 0, 0]. ...]
         },
         {
           "atom_labels": ["H", "H", "O"],
           "atom_positions": [[6.23e-18, 0, 0 ], ...
           "lattice_vectors": [[0, 0, 0]. ...]
         }
       ],
       "section_scc": [
         {
           "scc_to_system_ref": "/section_run/0/section_system/0",
           "energy_total": 3.28172e-18
           "x_super_code_sites_magic_sauce": "x°42" 
         },
         {
           "scc_to_system_ref": "/section_run/0/section_system/1",
           "energy_total": 3.28172e-18
         }
       ]
    }
  ]
}

Please put your solutions in separate comments and discuss.

Assignee
Assign to
Time tracking