Commit 903677b2 authored by Lauri Himanen's avatar Lauri Himanen
Browse files

Added new package for local analysis

parent 4f494e50
# CP2K NoMaD Parser
# QuickStart
- Clone repository
git clone
- Run setup by running the script. For local, user specific install
without sudo permissions use (omit --user for a system-wide install):
python install --user
- You can test if everything is running fine by running the test script in tests folder:
cd cp2kparser/tests/cp2k_2.6.2
- If you want to try out parsing for a custom cp2k calculation, place all
relevant output and input files inside a common directory and run the
following command within that folder:
python -m cp2kparser
# Structure
Currently the python package is divided into three subpackages:
- Engines: Classes for parsing different type of files
- Generics: Generic utility classes and base classes
- Implementation: The classes that actually define the parser functionality.
## Engines
Basically all the "engines", that is the modules that parse certain type of
files, are reusable in other parsers. They could be put into a common
repository where other developers can improve and extend them. One should also
write tests for the engines that would validate their behaviour and ease the
performance analysis.
The engine classes work also as interfaces. You can change the engine behaviour
while maintaining the same API in the parsers. For example one might improve
the performance of an engine but if the function calls remain the same no other
code has to be changed.
Currently implemented engines that could be reused (not tested properly yet):
- AtomsEngine: For reading various atomic coordinate files. Currently uses ASE
to read the files.
- RegexEngine: For parsing text files with regular expressions. Uses the re2
library if available (falls back to default python regex implementation if
re2 not found).
- CSVEngine: For parsing CSV-like files. Has a very
flexible nature as you can specify comments, column delimiters, column
indices and the patterns used to separate different configurations.
- XMLEngine: For parsing XML files using XPath syntax.
## Generics
In the generics folder there is a module called that defines a
class called NomadParser. This acts as a base class for the cp2k parser defined
in the implementation folder.
The NomadParser class defines the interface which is eventually used by e.g.
the scala code (will be modified later to conform to the common interface).
This class is also responsible for some common tasks that are present in all
- Unit conversion
- JSON encoding
- Caching
- Time measurement for performance analysis
- Providing file contents, sizes and handles
# Tools and Methods
The following is a list of tools/methods that can help the development process.
## Documentation
The [google style guide]( provides a good template on how to document your code.
Documenting makes it much easier to follow the logic behind your parser.
## Logging
Python has a great [logging
package]( which helps in
following the program flow and catching different errors and warnings. In
cp2kparser the file cp2kparser/generics/ defines the behaviour of
the logger. There you can setup the log levels even at a modular level. A more
easily readable formatting is also provided for the log messages.
## Testing
The parsers can become quite complicated and maintaining them without
systematic testing can become troublesome. Unit tests provide one way to
test each parseable quantity and python has a very good [library for
unit testing]( When the parser
supports a new quantity it is quite fast to create unit tests for it. These
tests will validate the parsing, and also easily detect bugs that may rise when
the code is modified in the future.
## Unit conversion
You can find unit conversion tools from the python-common repository and its
nomadcore package. The unit conversion is currenlty done by
[Pint]( and it has a very natural syntax,
support for numpy arrays and an easily reconfigurable constant/unit declaration
## Profiling
The parsers have to be reasonably fast. For some codes there is already
significant amount of data in the NoMaD repository and the time taken to parse
it will depend on the performance of the parser. Also each time the parser
evolves after system deployment, the existing data may have to be reparsed at
least partially.
By profiling what functions take the most computational time and memory during
parsing you can identify the bottlenecks in the parser. There are already
existing profiling tools such as
which you can plug into your scripts very easily.
# Manual for uploading a CP2K calculation
The print level (GLOBAL/PRINT_LEVEL) of a CP2K run will afect how much
information can be parsed from it. Try to use print levels MEDIUM and above to
get best parsing results.
All the files that are needed to run the calculation should be included in the
upload, including the basis set and potential files. The folder structure does
not matter, as the whole directory is searced for relevant files.
Although CP2K often doesn't care about the file extensions, using them enables
the parser to automatically identify the files and makes it perform better
(only needs to decompress part of files in HDF5). Please use these default file
- Output file: .out (Only one)
- Input file: .inp (Only one. If you have "include" files, use some other extension e.g. .inc)
- XYZ coordinate files: .xyz
- Protein Data Bank files: .pdb
- Crystallographic Information Files: .cif
# Notes for CP2K developers
Here is a list of features/fixes that would make the parsing of CP2K results
- The pdb trajectory output doesn't seem to conform to the actual standard as
the different configurations are separated by the END keyword which is
supposed to be written only once in the file. The [format specification](
states that different configurations should start with MODEL and end with
ENDMDL tags.
## More Complex Parsing Scenarios
The utilities in can be used alone to make a parser in many
cases. The SimpleMatchers provide a very nice declarative way to define the
parsing process and takes care of unit conversion and pushing the results to
the scala layer.
Still you may find it useful to have additional help in handling more complex
scenarios. During the parser development you may encounter these questions:
- How to manage different versions fo the parsed code that may even have completely different syntax?
- How to handle multiple files?
- How to integrate all this with the functionality that is provided in
The NomadParser class is meant help in structuring your code. It uses the same
input and output format as the mainFunction in Here is a
minimal example of a parser that subclasses NomadParser Here is a minimal
example of a parser that subclasses NomadParser:
class MyParser(NomadParser):
This class is responsible for setting up the actual parser implementation
and provides the input and output for parsing. It inherits the NomadParser
class to get access to many useful features.
def __init__(self, input_json_string):
NomadParser.__init__(self, input_json_string)
self.version = None
self.implementation = None
def setup_version(self):
"""The parsers should be able to support different version of the same
code. In this function you can determine which version of the software
we're dealing with and initialize the correct implementation
self.version = "1"
self.implementation = globals()["MyParserImplementation{}".format(self.version)](self)
def parse(self):
"""After the version has been identified and an implementation is
setup, you can start parsing.
return getattr(self.implementation, name)()
The class MyParser only defines how to setup a parser based on the given input.
The actual dirty work is done by a parser implementation class. NomadParser
does not enforce any specific style for the implementation. By wrapping the
results in a Result object, you get the automatic unit conversion and JSON
backend support. A very minimal example of a parser implementation class:
class MyParserImplementation1():
"""This is an implementation class that contains the actual parsing logic
for a certain software version. There can be multiple implementation
classes and MyParser decides which one to use.
supported_quantities = ["energy_total", "particle_forces", "particle_position"]
def __init__(self, parser):
self.parser = parser
def energy_total(self):
"""Returns the total energy. Used to illustrate on how to parse a single
result = Result()
result.unit = "joule"
result.value = 2.0
return result
def particle_forces(self):
"""Returns multiple force configurations as a list. Has to load the
entire list into memory. You should avoid loading very big files
unnecessarily into memory. See the function 'particle_position()' for a
one example on how to avoid loading the entire file into memory.
result = Result()
result.unit = "newton"
xyz_string = self.parser.get_file_contents("forces")
forces = []
i_forces = []
for line in xyz_string.split('\n'):
line = line.strip()
if not line:
if line.startswith("i"):
if i_forces:
i_forces = []
elif line.startswith("2"):
i_forces.append([float(x) for x in line.split()[-3:]])
if i_forces:
result.value_iterable = forces
return result
def particle_position(self):
"""An example of a function returning a generator. This function does
not load the whole position file into memory, but goes throught it line
by line and returns configurations as soon as they are ready.
def position_generator():
"""This inner function is the generator, a function that remebers
it's state and can yield intermediate results.
xyz_file = self.parser.get_file_handle("positions")
i_forces = []
for line in xyz_file:
line = line.strip()
if not line:
if line.startswith("i"):
if i_forces:
yield np.array(i_forces)
i_forces = []
elif line.startswith("2"):
i_forces.append([float(x) for x in line.split()[-3:]])
if i_forces:
yield np.array(i_forces)
result = Result()
result.unit = "angstrom"
result.value_iterable = position_generator()
return result
The MyParser class decides which implementation to use based on e.g. the
software version number that is available on one of the input files. New
implementations corresponding to other software versions can then be easily
defined and they can also use the functionality of another implementation by
subclassing. Example:
class MyParserImplementation2(MyParserImplementation1):
"""Implementation for a different version of the electronic structure
software. Subclasses MyParserImplementation1. In this version the
energy unit has changed and the 'energy' function from
MyParserImplementation1 is overwritten.
def energy(self):
"""The energy unit has changed in this version."""
result = Result()
result.unit = "hartree"
result.value = "2.0"
return result
MyParser could be now used as follows:
input_json = """{
"metaInfoFile": "metainfo.json",
"tmpDir": "/home",
"metainfoToKeep": [],
"metainfoToSkip": [],
"files": {": "forces",
"": "positions"
parser = MyParser(json.dumps(input_json))
The input JSON string is used to initialize the parser. The 'metaInfoFile'
attribute contains the metainfo definitions used by the parser. From this file
the parser can determine the type and shape and existence of metainfo
The 'files' object contains all the files that are given to the parser. The
attribute names are the file paths and their values are optional id's. The id's
are not typically given and they have to be assigned by using the
setup_file_id() function of NomadParser. Assigning id's helps to manage the
from setuptools import setup
def main():
# Start package setup
'': ['*.json', '*.pickle'],
description="NoMaD parser implementation for CP2K",
author="Lauri Himanen",
# Run main function by default
if __name__ == "__main__":
Metadata-Version: 1.0
Name: nomadanalysis
Version: 0.1
Summary: Tools for analysing calculation results parsed by NOMAD parsers.
Home-page: UNKNOWN
Author: Lauri Himanen
License: GPL3
Description: UNKNOWN
Platform: UNKNOWN
\ No newline at end of file
\ No newline at end of file
#! /usr/bin/env python
# This will activate the logging utilities for nomadanalysis
import utils.log
# Import the common classes here for less typing
from .analyzer import Analyzer
import sys
import logging
from nomadcore.local_meta_info import loadJsonFile
from nomadcore.parser_backend import JsonParseEventsWriterBackend
from nomadanalysis.local_backend import LocalBackend
logger = logging.getLogger(__name__)
class Analyzer(object):
def __init__(self, parser=None):
self.parser = parser
def parse(self):
if not self.parser:
logger.error("A parser hasn't been defined.")
return self.parser.parser_context.backend.results
if __name__ == "__main__":
# Initialize backend
metainfo_path = "/home/lauri/Dropbox/nomad-dev/nomad-meta-info/meta_info/nomad_meta_info/cp2k.nomadmetainfo.json"
metainfoenv, warnings = loadJsonFile(metainfo_path)
backend = LocalBackend(metainfoenv)
# backend = JsonParseEventsWriterBackend(metainfoenv, sys.stdout)
# Initialize parser
from cp2kparser import CP2KParser
dirpath = "/home/lauri/Dropbox/nomad-dev/parser-cp2k/cp2kparser/cp2kparser/tests/cp2k_2.6.2/forces/outputfile/n"
parser = CP2KParser(dirpath=dirpath, metainfo_path=metainfo_path, backend=backend)
# Initialize analyzer
analyser = Analyzer(parser)
results = analyser.parse()
# Get Results
xc = results["XC_functional"]
# temps = results["cp2k_md_temperature_instantaneous"]
print xc.values
from nomadanalysis import Analyzer
from cp2kparser import CP2KParser
# Initialize the parser you want to use
parser = CP2KParser()
parser.dirpath = "/home/lauri/Dropbox/nomad-dev/parser-cp2k/cp2kparser/cp2kparser/tests/cp2k_2.6.2/forces/outputfile/n"
parser.metainto_to_keep = ["section_run"]
# Initialize the analyzer
analyzer = Analyzer(parser)
results = analyzer.parse()
import StringIO
class LocalBackend(object):
def __init__(self, metaInfoEnv, fileOut=StringIO.StringIO()):
self.__metaInfoEnv = metaInfoEnv
self.fileOut = fileOut
self.__gIndex = -1
self.__openSections = set()
self.__writeComma = False
self.__lastIndex = {}
self.results = {}
self.stats = {}
def openSection(self, metaName):
"""opens a new section and returns its new unique gIndex"""
newIndex = self.__lastIndex.get(metaName, -1) + 1
self.openSectionWithGIndex(metaName, newIndex)
return newIndex
def openSectionWithGIndex(self, metaName, gIndex):
"""opens a new section where gIndex is generated externally
gIndex should be unique (no reopening of a closed section)"""
self.__lastIndex[metaName] = gIndex
self.__openSections.add((metaName, gIndex))
self.__jsonOutput({"event":"openSection", "metaName":metaName, "gIndex":gIndex})
def __jsonOutput(self, dic):
def closeSection(self, metaName, gIndex):
self.__openSections.remove((metaName, gIndex))
def addValue(self, metaName, value, gIndex=-1):
if self.results.get(metaName) is None:
self.results[metaName] = Result()
def addRealValue(self, metaName, value, gIndex=-1):
if self.results.get(metaName) is None:
self.results[metaName] = Result()
def addArrayValues(self, metaName, values, gIndex=-1):
if self.results.get(metaName) is None:
self.results[metaName] = Result()
def metaInfoEnv(self):
return self.__metaInfoEnv
def startedParsingSession(self, mainFileUri, parserInfo, parsingStatus = None, parsingErrors = None):
def finishedParsingSession(self, parsingStatus, parsingErrors, mainFileUri = None, parserInfo = None):
class Result(object):
def __init__(self):
self.values = []
self.arrayValues = []
This module is used to control the logging in the nomac analysis package.
Each module in the package can have it's own logger, so that you can control
the logging on a modular level easily.
If you want to use a logger on a module simply add the following in the module
import logging
logger = logging.getLogger(__name__)
This creates a logger with a hierarchical name. The hierarchical name allows
the logger to inherit logger properties from a parent logger, but also allows
module level control for logging.
A custom formatting is also used for the log messages. The formatting is done
by the LogFormatter class and is different for different levels.
import logging
import textwrap
class LogFormatter(logging.Formatter):
def format(self, record):
level = record.levelname
module = record.module
message = record.msg
if level == "INFO" or level == "DEBUG":
return make_titled_message("{}:{}".format(level, module), message)
return "\n " + make_title(level, width=64) + "\n" + make_message(message, width=64, spaces=8) + "\n"
def make_titled_message(title, message, width=80):
"""Styles a message to be printed into console.
wrapper = textwrap.TextWrapper(width=width-5)
lines = wrapper.wrap(message)
styled_message = ""
first = True
for line in lines:
if first:
new_line = " >> {}: ".format(title) + line
styled_message += new_line
first = False
new_line = 5*" " + line
styled_message += "\n" + new_line
return styled_message
def make_message(message, width=80, spaces=0):
"""Styles a message to be printed into console.
wrapper = textwrap.TextWrapper(width=width-6)
lines = wrapper.wrap(message)
styled_message = ""
first = True
for line in lines:
new_line = spaces*" " + "| " + line + (width-6-len(line))*" " + " |"
if first:
styled_message += new_line
first = False
styled_message += "\n" + new_line
styled_message += "\n" + spaces*" " + "|" + (width-2)*"-" + "|"
return styled_message
def make_title(title, width=80):
"""Styles a title to be printed into console.
space = width-len(title)-4
pre_space = space/2-1
post_space = space-pre_space
line = "|" + str((pre_space)*"=") + " "
line += title
line += " " + str((post_space)*"=") + "|"