A new framework for parsing semi structured textfiles
The NOMAD CoE parsers use the simple_parser.py
module for most of the parsers. While this module contains a lot of good ideas, it is probably not a good for new parsers.
Introduction
Why does the old library not work anymore?
- It was build for a streaming backend that we don't need anymore. The streaming backend introduced lots of unnecessary constraints about the order that data can be produced. This meant that a lot of data had to be temporarily stored in local variables and lead to the introduction of the "CachingBackend" with a complex systems to declare temporary data.
- It assumes that the structure of the input text matches the structure of the metainfo. Again, lots of temporary data.
- It is easy to get lost in its "syntax" (look at parsers to see what I mean).
- The API is very convoluted with various parser, nested contexts, backend classes. Ideally a parser would use one class and one object.
- The API allowed to write parsers that are not reusable for multiple parse runs. We have to reinitialise and re-compile all the involved regular expressions all the time.
- The code that controls the parsing and the code that post-processes the parsed data are separated, which makes it hard to understand what happens.
- There are no re-usable modules for re-occuring patterns, e.g. for vectors, matrices, etc.
- There are lots of different parser interfaces (
ParserInterface
,mainFunction
,SmartParser
) that the infrastructure needs to address. - Unit conversion was build into the parsing part and should have been handled by the backend/metainfo
- The general code organisation and quality in
python_common
is questionable. One big problem is that we cannot separate dependencies that all parsers need, from some that only a few need (e.g.mdanalysis
,mdtraj
). - There is no common mechanism to deal with multiple files and sub-parsers.
We should create a new framework for developing parsers for semi structured textfiles for the following reasons:
- There will be new parsers to write for the new projects.
- It would give us a fighting change to modernise the existing parsers-
- There are requests for more code support, hence more parsers.
- It is really hard to explain someone the "WTFs" of the old framework, because most of the reasons are gone by no.
The new framework should have the following features:
- There is a simple interface. A parser consist of one class, instances can be reused.
- It uses the new metainfo Python library. Target quantities can be addressed by paths; arbitrary sections can be addressed; there is no need for opening and closing sections in a particular order.
- Support for dealing with files: a mechanism that abstracts from the actual files; developers should not manually
open
files. Ideally the framework allows to parse files directly from the uploaded zip-file. - This list should be extended over time
Exercise
@himanel1 @ladinesa @temokmx @mscheidg We should all do this to brainstorm a new framework design.
Assume the new parser framework is already implemented. Using this fake input file and output metainfo as a references. What would the parser look like? Write some pseudo Python.
Input file, e.g. super_code.out:
2020/05/15
*** super_code v2 ***
system 1
--------
sites: H(1.23, 0, 0), H(-1.23, 0, 0), O(0, 0.33, 0)
latice: (0, 0, 0), (1, 0, 0), (1, 1, 0)
energy: 1.29372
*** This was done with magic source ***
*** x°42 ***
system 2
--------
sites: H(1.23, 0, 0), H(-1.23, 0, 0), O(0, 0.33, 0)
cell: (0, 0, 0), (1, 0, 0), (1, 1, 0)
energy: 1.29372
Output metainfo:
{
"section_run": [
{
"code_name": "super_code",
"code_version": "2",
"time": "2020-05-15-00:00:00",
"section_system": [
{
"atom_labels": ["H", "H", "O"],
"atom_positions": [[6.23e-18, 0, 0 ], ...
"lattice_vectors": [[0, 0, 0]. ...]
},
{
"atom_labels": ["H", "H", "O"],
"atom_positions": [[6.23e-18, 0, 0 ], ...
"lattice_vectors": [[0, 0, 0]. ...]
}
],
"section_scc": [
{
"scc_to_system_ref": "/section_run/0/section_system/0",
"energy_total": 3.28172e-18
"x_super_code_sites_magic_sauce": "x°42"
},
{
"scc_to_system_ref": "/section_run/0/section_system/1",
"energy_total": 3.28172e-18
}
]
}
]
}
Please put your solutions in separate comments and discuss.