Commit 4f494e50 authored by Lauri Himanen's avatar Lauri Himanen
Browse files

Refactoring

parent d07576e8
[submodule "cp2kparser/cp2kparser/metainfo"]
path = cp2kparser/cp2kparser/metainfo
url = git@gitlab.mpcdf.mpg.de:nomad-lab/nomad-meta-info.git
# CP2K NoMaD Parser
The NoMaD parser for CP2K. Under development. Will be modified to conform to
the common parser structure when it is available.
# QuickStart
- Clone repository
```shell
git clone git@gitlab.mpcdf.mpg.de:nomad-lab/parser-cp2k.git
```
- Run setup by running the setup.py script. For local, user specific install
without sudo permissions use (omit --user for a system-wide install):
```shell
python setup.py install --user
```
- You can test if everything is running fine by running the test script in tests folder:
```shell
cd cp2kparser/tests/cp2k_2.6.2
python run_tests.py
```
- If you want to try out parsing for a custom cp2k calculation, place all
relevant output and input files inside a common directory and run the
following command within that folder:
```shell
python -m cp2kparser
```
# Structure
Currently the python package is divided into three subpackages:
- Engines: Classes for parsing different type of files
- Generics: Generic utility classes and base classes
- Implementation: The classes that actually define the parser functionality.
## Engines
Basically all the "engines", that is the modules that parse certain type of
files, are reusable in other parsers. They could be put into a common
repository where other developers can improve and extend them. One should also
write tests for the engines that would validate their behaviour and ease the
performance analysis.
The engine classes work also as interfaces. You can change the engine behaviour
while maintaining the same API in the parsers. For example one might improve
the performance of an engine but if the function calls remain the same no other
code has to be changed.
Currently implemented engines that could be reused (not tested properly yet):
- AtomsEngine: For reading various atomic coordinate files. Currently uses ASE
to read the files.
- RegexEngine: For parsing text files with regular expressions. Uses the re2
library if available (falls back to default python regex implementation if
re2 not found).
- CSVEngine: For parsing CSV-like files. Has a very
flexible nature as you can specify comments, column delimiters, column
indices and the patterns used to separate different configurations.
- XMLEngine: For parsing XML files using XPath syntax.
## Generics
In the generics folder there is a module called nomadparser.py that defines a
class called NomadParser. This acts as a base class for the cp2k parser defined
in the implementation folder.
The NomadParser class defines the interface which is eventually used by e.g.
the scala code (will be modified later to conform to the common interface).
This class is also responsible for some common tasks that are present in all
parsers:
- Unit conversion
- JSON encoding
- Caching
- Time measurement for performance analysis
- Providing file contents, sizes and handles
# Tools and Methods
The following is a list of tools/methods that can help the development process.
## Documentation
The [google style guide](https://google.github.io/styleguide/pyguide.html?showone=Comments#Comments) provides a good template on how to document your code.
Documenting makes it much easier to follow the logic behind your parser.
## Logging
Python has a great [logging package](https://www.google.com) which helps in
following the program flow and catching different errors and warnings. In
cp2kparser the file cp2kparser/generics/logconfig.py defines the behaviour of
the logger. There you can setup the log levels even at a modular level. A more
easily readable formatting is also provided for the log messages.
## Testing
The parsers can become quite complicated and maintaining them without
systematic testing can become troublesome. Unit tests provide one way to
test each parseable quantity and python has a very good [library for
unit testing](https://docs.python.org/2/library/unittest.html). When the parser
supports a new quantity it is quite fast to create unit tests for it. These
tests will validate the parsing, and also easily detect bugs that may rise when
the code is modified in the future.
## Unit conversion
The NoMaD parsers need a unified approach to unit conversion. The parsers
should use the same set of physical constants, and a system that does the
conversion semiautomatically. I would propose using
[Pint](https://pint.readthedocs.org/en/0.6/) as it has a very natural syntax,
support for numpy arrays and an easily reconfigurable constant/unit declaration
mechanisms. The constants and units can be shared as simple text files across
all parsers.
## Profiling
The parsers have to be reasonably fast. For some codes there is already
significant amount of data in the NoMaD repository and the time taken to parse
it will depend on the performance of the parser. Also each time the parser
evolves after system deployment, the existing data may have to be reparsed at
least partially.
By profiling what functions take the most computational time and memory during
parsing you can identify the bottlenecks in the parser. There are already
existing profiling tools such as
[cProfile](https://docs.python.org/2/library/profile.html#module-cProfile)
which you can plug into your scripts very easily.
# Manual for uploading a CP2K calculation
The print level (GLOBAL/PRINT_LEVEL) of a CP2K run will afect how much
information can be parsed from it. Try to use print levels MEDIUM and above to
get best parsing results.
All the files that are needed to run the calculation should be included in the
upload, including the basis set and potential files. The folder structure does
not matter, as the whole directory is searced for relevant files.
Although CP2K often doesn't care about the file extensions, using them enables
the parser to automatically identify the files and makes it perform better
(only needs to decompress part of files in HDF5). Please use these default file
extensions:
- Output file: .out (Only one)
- Input file: .inp (Only one. If you have "include" files, use some other extension e.g. .inc)
- XYZ coordinate files: .xyz
- Protein Data Bank files: .pdb
- Crystallographic Information Files: .cif
# Notes for CP2K developers
Here is a list of features/fixes that would make the parsing of CP2K results
easier:
- Include the number of simulated atoms to the output file. I could
not find a way to easily determine the number of atoms for all run types (MD,
GEO_OPT,...). Currently the number of atoms is deduced by counting the atoms
in the initials coordinate file/COORD section.
- The pdb trajectory output doesn't seem to conform to the actual standard as
the different configurations are separated by the END keyword which is
supposed to be written only once in the file. The [format specification](http://www.wwpdb.org/documentation/file-format)
states that different configurations should start with MODEL and end with
ENDMDL tags.
#! /usr/bin/env python
import cp2kparser.generics.logconfig
import os
from cp2kparser.implementation.autoparser import parse_path
import argparse
import json
parser = argparse.ArgumentParser(description='Parse a CP2K calculation from folder.')
parser.add_argument('-metaInfoToKeep', type=str, help='A json list containing the names of the metainfos to keep during parsing.')
parser.add_argument('-metaInfoToSkip', type=str, help='A json list containing the names of the metainfos to skip during parsing.')
args = parser.parse_args()
# Try to decode the metaInfoTokeep
metaInfoToKeep = []
if args.metaInfoToKeep:
try:
metaInfoToKeep = json.loads(args.metaInfoToKeep)
except:
raise Exception("Could not decode the 'metaInfoToKeep' argument as a json list. You might need to surround the string with single quotes if it contains double quotes.")
# Try to decode the metaInfoToSkip
metaInfoToSkip = []
if args.metaInfoToSkip:
try:
metaInfoToSkip = json.loads(args.metaInfoToSkip)
except:
raise Exception("Could not decode the 'metaInfoToKeep' argument as a json list. You might need to surround the string with single quotes if it contains double quotes.")
path = os.getcwd()
parse_path(path, metaInfoToKeep, metaInfoToSkip)
Subproject commit 163501eabba0fa385f28edcb55aa577de96e7624
import numpy as np
import logging
logger = logging.getLogger(__name__)
from io import StringIO
try:
import re2 as re
except ImportError:
import re
logger.warning((
"re2 package not found. Using re package instead. "
"If you wan't to use re2 please see the following links:"
" https://github.com/google/re2"
" https://pypi.python.org/pypi/re2/"
))
else:
re.set_fallback_notification(re.FALLBACK_WARNING)
#===============================================================================
class CSVEngine(object):
"""Used to parse out freeform CSV-like content.
Currently only can parse floating point information.
Reads the given file or string line by line, ignoring commented sections.
Each line with data is split with a given delimiter expression (regex).
From the split line the specified columns will be returned as floating
point numbers in a numpy array.
If given a separator specification (regex), the algorithm will try to split
the contents into different configurations which will be separated by a
line that matches the separator.
"""
def __init__(self, parser):
"""
Args:
cp2k_parser: Instance of a NomadParser or it's subclass. Allows
access to e.g. unified file reading methods.
"""
self.parser = parser
def iread(self, contents, columns, delimiter=r"\s+", comments=r"#", separator=None):
"""Used to iterate a CSV-like file. If a separator is provided the file
is iterated one configuration at a time. Only keeps one configuration
of the file in memory. If no separator is given, the whole file will be
handled.
The contents are separated into configurations whenever the separator
regex is encountered on a line.
"""
def split_line(line):
"""Chop off comments, strip, and split at delimiter.
"""
if line.isspace():
return None
if comments:
line = compiled_comments.split(line, maxsplit=1)[0]
line = line.strip('\r\n ')
if line:
return compiled_delimiter.split(line)
else:
return []
def is_separator(line):
"""Check if the given line matches the separator pattern.
Separators are used to split a file into multiple configurations.
"""
if separator:
return compiled_separator.search(line)
return False
# If string or unicode provided, create stream
if isinstance(contents, (str, unicode)):
contents = StringIO(unicode(contents))
# Precompile the different regexs before looping
if comments:
comments = (re.escape(comment) for comment in comments)
compiled_comments = re.compile('|'.join(comments))
if separator:
compiled_separator = re.compile(separator)
compiled_delimiter = re.compile(delimiter)
# Columns as list
if columns is not None:
columns = list(columns)
# Start iterating
configuration = []
for line in contents: # This actually reads line by line and only keeps the current line in memory
# If separator encountered, yield the stored configuration
if is_separator(line):
if configuration:
yield np.array(configuration)
configuration = []
else:
# Ignore comments, separate by delimiter
vals = split_line(line)
line_forces = []
if vals:
for column in columns:
try:
value = vals[column]
except IndexError:
logger.warning("The given index '{}' could not be found on the line '{}'. The given delimiter or index could be wrong.".format(column, line))
return
try:
value = float(value)
except ValueError:
logger.warning("Could not cast value '{}' to float. Currently only floating point values are accepted".format(value))
return
else:
line_forces.append(value)
configuration.append(line_forces)
# The last configuration is yielded even if separator is not present at
# the end of file or is not given at all
if configuration:
yield np.array(configuration)
import os
import logging
logger = logging.getLogger(__name__)
try:
import re2 as re
except ImportError:
import re
logger.warning((
"re2 package not found. Using re package instead. "
"If you wan't to use re2 please see the following links:"
" https://github.com/google/re2"
" https://pypi.python.org/pypi/re2/"
))
else:
re.set_fallback_notification(re.FALLBACK_WARNING)
#===============================================================================
class Regex(object):
"""Represents a regex search used by the RegexEngine class.
In addition to a regular regex object from the re2 or re module, this
object wraps additional information about a regex search:
regex_string: The regular expression as a string. Supports also the
more verbose form
(https://docs.python.org/2/library/re.html#re.VERBOSE). Currently
supports only one capturing group.
index: Index for the wanted match. Can be a single integer number (also
negative indices supported) or if the special value "all" is provided,
all results will be returned.
separator: If a separator is defined, the input file can be chopped
into smaller pieces which are separated by the given separator. The
separator is a strig representing a regular epression. The smaller pieces are
then searched independently. This approach allows bigger files to be
handled piece by piece without loading the whole file into memory.
direction: If a separator is defined, this parameter defines whether
the file is chopped into pieces starting from the end or from the
start.
from_beginning: If true, the input must match the regular expression
right from the start. Any matches in the middle of the input are not
searched.
"""
def __init__(self, regex_string, index="all", separator=None, direction="down", from_beginning=False):
self.regex_string = regex_string
self.index = index
self.separator = separator
self.direction = direction
self.from_beginning = from_beginning
self.compiled_regex = None
self.compiled_separator = None
self.inner_regex = None
self.check_input()
self.compile()
def set_inner_regex(self, inner_regex):
self.inner_regex = inner_regex
def compile(self):
self.compiled_regex = re.compile(self.regex_string, re.VERBOSE)
self.compiled_separator = re.compile(self.separator, re.VERBOSE)
def check_input(self):
if self.direction != "down" and self.direction != "up":
logger.error("Unsupported direction value '{}' in a regex".format(self.direction))
def match(self, string):
return self.compiled_regex.match(string)
def search(self, string):
return self.compiled_regex.search(string)
def findall(self, string):
return self.compiled_regex.findall(string)
def finditer(self, string):
return self.compiled_regex.finditer(string)
#===============================================================================
class RegexEngine(object):
"""Used for parsing values values from files with regular expressions.
"""
def __init__(self, parser):
self.regexs = None
self.results = {}
self.regex_dict = {}
self.target_dict = {}
self.files = None
self.extractors = None
self.extractor_results = {}
self.output = None
self.regexs = None
self.cache = {}
self.compiled_regexs = {}
self.file_contents = {}
def parse(self, regex, file_handle):
"""Use the given regex to parse contents from the given file handle"""
file_name = file_handle.name
logger.debug("Searching regex in file '{}'".format(file_name))
result = self.recursive_extraction(regex, file_handle)
if result:
return result
# Couldn't find the quantity
logger.debug("Could not find a result for {}.".format(regex.regex_string))
def recursive_extraction(self, regex, data):
"""Goes through the exctractor tree recursively until the final
extractor is found and returns the value given by it. The value can be
of any dimension but contains only strings.
"""
# # Early return with cached result
# result = self.extractor_results.get(extractor_id)
# if result:
# return result
result = None
# If separator specified, do a blockwise search
if regex.separator is not None:
logger.debug("Going into blockwise regex search")
result = self.regex_block_search(data, regex)
# Regular string search
else:
logger.debug("Going into full regex search")
result = self.regex_search_string(data, regex)
# See if the tree continues
if regex.inner_regex is not None:
logger.debug("Entering next regex recursion level.")
return self.recursive_extraction(regex.inner_regex, result)
else:
return result
def regex_search_string(self, data, regex):
"""Do a regex search on the data. This loads the entire data into so it
might not be the best option for big files. See 'regex_block_search'
for reading the file piece-by-piece.
"""
from_beginning = regex.from_beginning
index = regex.index
# If given a file object, read all as string
if isinstance(data, file):
data.seek(0)
contents = data.read()
else:
contents = data
result = None
if from_beginning:
logger.debug("Doing full string search from beginning.")
return regex.match(contents)
elif index == "all":
logger.debug("Doing full string search for all results.")
result = regex.findall(contents)
if not result:
logger.debug("No matches.")
elif index >= 0:
logger.debug("Doing full string search with specified index.")
iter = regex.finditer(contents)
i = 0
while i <= index:
try:
match = iter.next()
except StopIteration:
if i == 0:
logger.debug("No results.")
else:
logger.debug("Invalid regex index.")
break
if i == index:
result = match.groups()[0]
i += 1
elif index < 0:
matches = regex.findall(contents)
if not matches:
logger.debug("No matches.")
else:
try:
result = matches[index]
except IndexError:
logger.debug("Invalid regex index.")
return result
def regex_block_search(self, file_handle, regex):
"""Do a regex search on the data. This function can load the file piece
by piece to avoid loading huge files into memory. The piece-wise search
can also be used to search the file from bottom to up.
"""
separator = regex.separator
direction = regex.direction
index = regex.index
from_beginning = regex.from_beginning
logger.debug("Doing blockwise search with separator: '{}', direction: '{}', from_beginning: '{}' and index '{}'".format(separator, direction, from_beginning, index))
# Determine the direction in which the blocks are read
if direction == "up":
logger.debug("Searching from bottom to up.")
generator = self.reverse_block_generator(file_handle, separator)
elif direction == "down":
logger.debug("Searching from up to bottom.")
generator = self.block_generator(file_handle, separator)
else:
logger.error("Unknown direction specifier: {}".format(direction))
return
# If all results wanted, just get all results from all blocks
if index == "all":
logger.debug("Searchin for all matches.")
results = []
for block in generator:
matches = regex.findall(block)
if matches:
if isinstance(matches, list):
for match in matches:
results.append(match)
else:
results.append(matches.groups()[0])
return results
# If index given, search until the correct index found
i_result = 0
counter = 0
for block in generator:
logger.debug("Searchin for a specific index.")
counter += 1
if from_beginning:
result = regex.match(block)
if result:
logger.debug("Found match in beginning of block.")
if index + 1 > i_result + 1:
i_result += 1
else:
return result.groups()[0]
else:
results = regex.findall(block)
if results:
if isinstance(results, list):
n_results = len(results)
else:
n_results = 1
logger.debug("Found results within block.")
if index + 1 > i_result + n_results:
i_result += n_results
else:
if n_results == 1:
return results.groups()[0]
else:
return results[i_result + (n_results-1) - index]
def reverse_block_generator(self, fh, separator_pattern, buf_size=1000000):
"""A generator that returns chunks of a file piece-by-piece in reverse
order.
"""
segment = None
offset = 0
fh.seek(0, os.SEEK_END)
total_size = remaining_size = fh.tell()
# Compile the separator with an added end of string character.
compiled_separator = re.compile(separator_pattern)
end_match = separator_pattern + r'$'
compiled_end_match = re.compile(end_match)
while remaining_size > 0:
offset = min(total_size, offset + buf_size)
fh.seek(-offset, os.SEEK_END)
buffer = fh.read(min(remaining_size, buf_size))
remaining_size -= buf_size
#print remaining_size
lines = compiled_separator.split(buffer)
# lines = buffer.split(separator)
# the first line of the buffer is probably not a complete line so
# we'll save it and append it to the last line of the next buffer
# we read
if segment is not None:
# If this chunk ends with the separator, do not concatenate
# the segment to the last line of new chunk instead, yield the
# segment instead
if compiled_end_match.find(buffer):
yield segment
else:
lines[-1] += segment
segment = lines[0]
for index in range(len(lines) - 1, 0, -1):
if len(lines[index]):
yield lines[index]
yield segment