Commit f1c0863e authored by Lauri Himanen's avatar Lauri Himanen
Browse files

Cleanup and refactoring of the code, directly implemented the option of local...

Cleanup and refactoring of the code, directly implemented the option of local parsing to the ParserInterface baseclass.
parent 8996181b
# CP2K NoMaD Parser
This is the parser for [CP2K](
It is part of the [NOMAD Laboratory](
This is the main repository of the [NOMAD]( parser for
# Installation
This parser is a submodule of the nomad-lab-base repository. Developers within
the NoMaD project will automatically get a copy of this repository when they
download and install the base repository.
## Within NOMAD
When used within the NOMAD Laboratory, this parser will be available as a
submodule of the nomad-lab-base repository. You can download the base repository
with the command:
# Structure
The scala layer can access the parser functionality through the file, by calling the following command:
git clone --recursive
And the installation will be done according to the instructions found [here](
## Standalone Installation
The parser is also available as a standalone package within the repository:
git clone
If used in this standalone mode you can use the installation script
parser-cp2k/parser/parser-cp2k/ with the folllowing command
python install --user
python path/to/main/file
After the local install the parser will be available to python import under the name
# Usage
This scala interface is separated into it's own file to separate it from the
rest of the code. Some parsers will have the interface in the same file as the
parsing code, but I feel that this is a cleaner approach.
## Within NOMAD
The scala layer can access the parser throught the file.
The parser is designed to support multiple versions of CP2K with a [DRY](
approach: The initial parser class is based on CP2K 2.6.2, and other versions
will be subclassed from it. By sublassing, all the previous functionality will
be preserved, new functionality can be easily created, and old functionality
overridden only where necesssary.
## Standalone
The parser can be used in a python only standalone mode with a separate
'nomadtoolkit' package. In this local mode the parser can be like this:
# Standalone Mode
The parser is designed to be usable also outside the NoMaD project as a
separate python package. This standalone python-only mode is primarily for
people who want to easily access the parser without the need to setup the whole
"NOMAD Stack". It is also used when running unit tests. The nomadtoolkit
package is currently used by the developer only and is thus not available
through gitlab. Here is an example of the call syntax:
from nomadtoolkit import Analyzer
from cp2kparser import CP2KParser
import matplotlib.pyplot as mpl
# Initialize the contents and the parser you want to use.
paths = "/home/lauri/Dropbox/nomad-dev/parser-cp2k/parser/parser-cp2k/cp2kparser/tests/cp2k_2.6.2/functionals/lda"
parser = CP2KParser(contents=paths)
# 1. Initialize a parser by giving a path to the calculation folder that
# contains all the relevant files.
path = "path/to/folder"
parser = CP2KParser(path)
# Initialize the analyzer. The analyzer will initialize the parser with a local
# backend so that the results will be available as a python dictionary.
analyzer = Analyzer(parser)
# 2. Initialize the analyzer.
# By default all the quantities will be in SI. You can override the units here.
default_units = ["eV"]
analyzer = Analyzer(parser, default_units)
# 3. Parse
results = analyzer.parse()
cell = results["simulation_cell"]
n_atoms = results["number_of_atoms"]
atom_position = results["atom_position"]
atom_label = results["atom_label"]
print cell.value
print n_atoms.value
print atom_position.value
print atom_label.value
This standalone python-only mode is primarily for people who want to easily
access the parser without the need to setup the whole "NOMAD Stack". It is also
used when running unit tests. The nomadtoolkit package is currently used by the
developer only and is thus not available through gitlab.
# 4. Analyze the results
scf_energies = results["energy_total_scf_iteration"]
# Tools and Methods
The following is a list of tools/methods that can help the development process.
This section describes some of the guidelines that are used in the development
of this parser.
## Documentation
The [google style guide]( provides a good template on how to document your code.
Documenting makes it much easier to follow the logic behind your parser.
The [google style
provides a good template on how to document your code. Documenting makes it
much easier to follow the logic behind your parser.
## Logging
Python has a great [logging
......@@ -114,30 +104,11 @@ existing profiling tools such as
which you can plug into your scripts very easily.
# Manual for uploading a CP2K calculation
The print level (GLOBAL/PRINT_LEVEL) of a CP2K run will afect how much
information can be parsed from it. Try to use print levels MEDIUM and above to
get best parsing results.
All the files that are needed to run the calculation should be included in the
upload, including the basis set and potential files. The folder structure does
not matter, as the whole directory is searced for relevant files.
Although CP2K often doesn't care about the file extensions, using them enables
the parser to automatically identify the files and makes it perform better
(only needs to decompress part of files in HDF5). Please use these default file
- Output file: .out (Only one)
- Input file: .inp (Only one. If you have "include" files, use some other extension e.g. .inc)
- XYZ coordinate files: .xyz
- Protein Data Bank files: .pdb
- Crystallographic Information Files: .cif
# Notes for CP2K developers
Here is a list of features/fixes that would make the parsing of CP2K results
- The pdb trajectory output doesn't seem to conform to the actual standard as
the different configurations are separated by the END keyword which is
supposed to be written only once in the file. The [format specification](
states that different configurations should start with MODEL and end with
ENDMDL tags.
supposed to be written only once in the file. The [format
specification]( states that
different configurations should start with MODEL and end with ENDMDL tags.
"""The classes which make up the CP2K input tree.
These are defined in their own module, instead of the xmlpreparser module,
because the pickling of these classes is wrong if they are defined in the same
file which is run in console (module will be then __main__).
from collections import defaultdict
import logging
logger = logging.getLogger(__name__)
class CP2KInput(object):
"""The contents of a CP2K simulation including default values and default
units from the version-specific xml file.
def __init__(self, root_section):
self.root_section = root_section
def decode_cp2k_unit(unit):
"""Given a CP2K unit name, decode it as Pint unit definition.
map = {
# Length
"bohr": "bohr",
"m": "meter",
"pm": "picometer",
"nm": "nanometer",
"angstrom": "angstrom",
# Angle
"rad": "radian",
"deg": "degree",
"Ry": "rydberg"
pint_unit = map.get(unit)
if pint_unit:
return pint_unit
logger.error("Unknown CP2K unit definition '{}'.".format(unit))
def set_parameter(self, path, value):
parameter, section = self.get_parameter_and_section(path)
parameter.value = value
def set_keyword(self, path, value):
keyword, section = self.get_keyword_and_section(path)
if keyword and section:
keyword.value = value
elif section is not None:
# print "Saving default keyword at path '{}'".format(path)
split_path = path.rsplit("/", 1)
keyword = split_path[1]
section.default_keyword += keyword + " " + value + "\n"
def get_section(self, path):
split_path = path.split("/")
section = self.root_section
for part in split_path:
section = section.get_subsection(part)
if not section:
print "Error in getting section at path '{}'.".format(path)
return None
return section
def get_keyword_and_section(self, path):
split_path = path.rsplit("/", 1)
keyword = split_path[1]
section_path = split_path[0]
section = self.get_section(section_path)
keyword = section.get_keyword(keyword)
if keyword and section:
return (keyword, section)
elif section:
return (None, section)
return (None, None)
def get_keyword(self, path):
"""Returns the keyword that is specified by the given path.
If the keyword has no value set, returns the default value defined in
the XML.
keyword, section = self.get_keyword_and_section(path)
if keyword:
if keyword.value is not None:
return keyword.get_value()
if section.accessed:
return keyword.default_value
def get_default_keyword(self, path):
return self.get_section(path).default_keyword
def set_section_accessed(self, path):
section = self.get_section(path)
section.accessed = True
def get_keyword_default(self, path):
keyword, section = self.get_keyword_and_section(path)
if keyword:
return keyword.default_value
def get_default_unit(self, path):
keyword, section = self.get_keyword_and_section(path)
if keyword:
return keyword.default_unit
def get_unit(self, path):
keyword, section = self.get_keyword_and_section(path)
if keyword:
return keyword.get_unit()
def get_parameter_and_section(self, path):
section = self.get_section(path)
parameter = section.parameter
return (parameter, section)
def get_parameter(self, path):
parameter, section = self.get_parameter_and_section(path)
if parameter:
if parameter.value:
return parameter.value
elif section and section.accessed:
return parameter.lone_value
class Keyword(object):
"""Information about a keyword in a CP2K calculation.
def __init__(self, default_name, default_value, default_unit_value):
self.value = None
self.unit = None
self.value_no_unit = None
self.default_name = default_name
self.default_value = default_value
self.default_unit = default_unit_value
def get_value(self):
"""If the units of this value can be changed, return a value and the
unit separately.
if self.default_unit:
if not self.value_no_unit:
return self.value_no_unit
return self.value
def get_unit(self):
if self.default_unit:
if not self.unit:
return self.unit
logger.error("The keyword '{}' does not have a unit.".format(self.default_name))
def decode_cp2k_unit_and_value(self):
"""Given a CP2K unit name, decode it as Pint unit definition.
splitted = self.value.split(None, 1)
unit_definition = splitted[0]
if unit_definition.startswith('[') and unit_definition.endswith(']'):
unit_definition = unit_definition[1:-1]
self.unit = CP2KInput.decode_cp2k_unit(self.default_unit)
self.value_no_unit = splitted[1]
elif self.default_unit:
logger.debug("No special unit definition found, returning default unit.")
self.unit = CP2KInput.decode_cp2k_unit(self.default_unit)
self.value_no_unit = self.value
logger.debug("The value has no unit, returning bare value.")
self.value_no_unit = self.value
class Section(object):
"""An input section in a CP2K calculation.
def __init__(self, name):
self.accessed = False = name
self.keywords = defaultdict(list)
self.default_keyword = ""
self.parameter = None
self.sections = defaultdict(list)
def get_keyword(self, name):
keyword = self.keywords.get(name)
if keyword:
if len(keyword) == 1:
return keyword[0]
logger.error("The keyword '{}' in '{}' does not exist or has too many entries.".format(name,
def get_subsection(self, name):
subsection = self.sections.get(name)
if subsection:
if len(subsection) == 1:
return subsection[0]
logger.error("The subsection '{}' in '{}' has too many entries.".format(name,
logger.error("The subsection '{}' in '{}' does not exist.".format(name,
class SectionParameters(object):
"""Section parameters in a CP2K calculation.
Section parameters are the short values that can be added right after a
section name, e.g. &PRINT ON, where ON is the section parameter.
def __init__(self, default_value, lone_value):
self.value = None
self.default_value = default_value
self.lone_value = lone_value
#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""Provides functions for creating a python object representing a CP2K input
Creates preparsed versions of the cp2k_input.xmls and pickles them (python
version of serialization). The pickle files can then be easily reused without
doing the xml parsing again.
The actual calculation input contents can later be added to this object. Then
the object can be queried for the results, or the default values defined by the
import xml.etree.cElementTree as ET
import logging
import cPickle as pickle
from cp2kparser.parsing.cp2kinputenginedata.input_tree import *
logger = logging
def generate_object_tree(xml_file):
xml_element = ET.parse(xml_file)
object_tree = recursive_tree_generation(xml_element)
return object_tree
def recursive_tree_generation(xml_element):
# Make new section object for the root
section_name_element = xml_element.find("NAME")
if section_name_element is not None:
section_name = section_name_element.text
section_name = "CP2K_INPUT"
section = Section(section_name)
# Section parameters
parameter = xml_element.find("SECTION_PARAMETERS")
if parameter:
sp_default_element = parameter.find("DEFAULT_VALUE")
sp_default_value = None
if sp_default_element is not None:
sp_default_value = sp_default_element.text
sp_lone_element = parameter.find("LONE_KEYWORD_VALUE")
sp_lone_value = None
if sp_lone_element is not None:
sp_lone_value = sp_lone_element.text
parameter_object = SectionParameters(sp_default_value, sp_lone_value)
section.parameter = parameter_object
# Keywords
for keyword in xml_element.findall("KEYWORD"):
keyword_names = keyword.findall("NAME")
default_name = None
aliases = []
for name in keyword_names:
keytype = name.get("type")
if keytype == "default":
default_name = name.text
default_keyword_element = keyword.find("DEFAULT_VALUE")
default_keyword_value = None
if default_keyword_element is not None:
default_keyword_value = default_keyword_element.text
default_unit_element = keyword.find("DEFAULT_UNIT")
default_unit_value = None
if default_unit_element is not None:
default_unit_value = default_unit_element.text
keyword_object = Keyword(default_name, default_keyword_value, default_unit_value)
for alias in aliases:
# Sections
for sub_section_element in xml_element.findall("SECTION"):
sub_section = recursive_tree_generation(sub_section_element)
# Return section
return section
# Run main function by default
if __name__ == "__main__":
xml_file = open("./cp2k_262/cp2k_input.xml", 'r')
object_tree = CP2KInput(generate_object_tree(xml_file))
file_name = "./cp2k_262/cp2k_input_tree.pickle"
fh = open(file_name, "wb")
pickle.dump(object_tree, fh, protocol=2)
import numpy as np
import logging
logger = logging.getLogger(__name__)
from io import StringIO
import re2 as re
except ImportError:
import re
"re2 package not found. Using re package instead. "
"If you wan't to use re2 please see the following links:"
class CSVParser(object):
"""Used to parse out freeform CSV-like content.
Currently only can parse floating point information.
Reads the given file or string line by line, ignoring commented sections.
Each line with data is split with a given delimiter expression (regex).
From the split line the specified columns will be returned as floating
point numbers in a numpy array.
If given a separator specification (regex), the algorithm will try to split
the contents into different configurations which will be separated by a
line that matches the separator.
def __init__(self, parser):
cp2k_parser: Instance of a NomadParser or it's subclass. Allows
access to e.g. unified file reading methods.
self.parser = parser
def iread(self, contents, columns, delimiter=r"\s+", comments=r"#", separator=None):
"""Used to iterate a CSV-like file. If a separator is provided the file
is iterated one configuration at a time. Only keeps one configuration
of the file in memory. If no separator is given, the whole file will be
The contents are separated into configurations whenever the separator
regex is encountered on a line.
def split_line(line):
"""Chop off comments, strip, and split at delimiter.
if line.isspace():
return None
if comments:
line = compiled_comments.split(line, maxsplit=1)[0]
line = line.strip('\r\n ')
if line:
return compiled_delimiter.split(line)
return []
def is_separator(line):
"""Check if the given line matches the separator pattern.
Separators are used to split a file into multiple configurations.
if separator:
return False
# If string or unicode provided, create stream
if isinstance(contents, (str, unicode)):
contents = StringIO(unicode(contents))
# Precompile the different regexs before looping
if comments:
comments = (re.escape(comment) for comment in comments)
compiled_comments = re.compile('|'.join(comments))
if separator:
compiled_separator = re.compile(separator)
compiled_delimiter = re.compile(delimiter)
# Columns as list
if columns is not None:
columns = list(columns)
# Start iterating
configuration = []
for line in contents: # This actually reads line by line and only keeps the current line in memory
# If separator encountered, yield the stored configuration
if is_separator(line):
if configuration:
yield np.array(configuration)
configuration = []
# Ignore comments, separate by delimiter
vals = split_line(line)
line_forces = []
if vals:
for column in columns:
value = vals[column]
except IndexError:
logger.warning("The given index '{}' could not be found on the line '{}'. The given delimiter or index could be wrong.".format(column, line))
value = float(value)
except ValueError:
logger.warning("Could not cast value '{}' to float. Currently only floating point values are accepted".format(value))