Added new package for local analysis

903677b2 · Himanen, Lauri (himanel1) · 4f494e50 · 903677b2 · 903677b2 · 903677b2
Commit 903677b2 authored 9 years ago by Himanen, Lauri (himanel1)
--- a/cp2kparser/README.md
+++ b/cp2kparser/README.md
+# CP2K NoMaD Parser
+
+# QuickStart
+- Clone repository
+
+    ```shell
+    git clone git@gitlab.mpcdf.mpg.de:nomad-lab/parser-cp2k.git
+    ```
+
+- Run setup by running the setup.py script. For local, user specific install
+  without sudo permissions use (omit --user for a system-wide install):
+
+    ```shell
+    python setup.py install --user
+    ```
+
+- You can test if everything is running fine by running the test script in tests folder:
+
+    ```shell
+    cd cp2kparser/tests/cp2k_2.6.2
+    python run_tests.py
+    ```
+
+- If you want to try out parsing for a custom cp2k calculation, place all
+  relevant output and input files inside a common directory and run the
+  following command within that folder:
+
+    ```shell
+    python -m cp2kparser
+    ```
+
+# Structure
+Currently the python package is divided into three subpackages:
+- Engines: Classes for parsing different type of files
+- Generics: Generic utility classes and base classes
+- Implementation: The classes that actually define the parser functionality.
+
+## Engines
+Basically all the "engines", that is the modules that parse certain type of
+files, are reusable in other parsers. They could be put into a common
+repository where other developers can improve and extend them. One should also
+write tests for the engines that would validate their behaviour and ease the
+performance analysis.
+
+The engine classes work also as interfaces. You can change the engine behaviour
+while maintaining the same API in the parsers. For example one might improve
+the performance of an engine but if the function calls remain the same no other
+code has to be changed.
+
+Currently implemented engines that could be reused (not tested properly yet):
+- AtomsEngine: For reading various atomic coordinate files. Currently uses ASE
+  to read the files.
+- RegexEngine: For parsing text files with regular expressions. Uses the re2
+library if available (falls back to default python regex implementation if
+re2 not found).
+- CSVEngine: For parsing CSV-like files. Has a very
+flexible nature as you can specify comments, column delimiters, column
+indices and the patterns used to separate different configurations.
+- XMLEngine: For parsing XML files using XPath syntax.
+
+## Generics
+In the generics folder there is a module called nomadparser.py that defines a
+class called NomadParser. This acts as a base class for the cp2k parser defined
+in the implementation folder.
+
+The NomadParser class defines the interface which is eventually used by e.g.
+the scala code (will be modified later to conform to the common interface).
+This class is also responsible for some common tasks that are present in all
+parsers:
+
+- Unit conversion
+- JSON encoding
+- Caching
+- Time measurement for performance analysis
+- Providing file contents, sizes and handles
+
+# Tools and Methods
+
+The following is a list of tools/methods that can help the development process.
+
+## Documentation
+The [google style guide](https://google.github.io/styleguide/pyguide.html?showone=Comments#Comments) provides a good template on how to document your code.
+Documenting makes it much easier to follow the logic behind your parser.
+
+## Logging
+Python has a great [logging
+package](https://docs.python.org/2/library/logging.html) which helps in
+following the program flow and catching different errors and warnings. In
+cp2kparser the file cp2kparser/generics/logconfig.py defines the behaviour of
+the logger. There you can setup the log levels even at a modular level. A more
+easily readable formatting is also provided for the log messages.
+
+## Testing
+The parsers can become quite complicated and maintaining them without
+systematic testing can become troublesome. Unit tests provide one way to
+test each parseable quantity and python has a very good [library for
+unit testing](https://docs.python.org/2/library/unittest.html). When the parser
+supports a new quantity it is quite fast to create unit tests for it. These
+tests will validate the parsing, and also easily detect bugs that may rise when
+the code is modified in the future.
+
+## Unit conversion
+You can find unit conversion tools from the python-common repository and its
+nomadcore package.  The unit conversion is currenlty done by
+[Pint](https://pint.readthedocs.org/en/0.6/) and it has a very natural syntax,
+support for numpy arrays and an easily reconfigurable constant/unit declaration
+mechanisms.
+
+## Profiling
+The parsers have to be reasonably fast. For some codes there is already
+significant amount of data in the NoMaD repository and the time taken to parse
+it will depend on the performance of the parser. Also each time the parser
+evolves after system deployment, the existing data may have to be reparsed at
+least partially.
+
+By profiling what functions take the most computational time and memory during
+parsing you can identify the bottlenecks in the parser. There are already
+existing profiling tools such as
+[cProfile](https://docs.python.org/2/library/profile.html#module-cProfile)
+which you can plug into your scripts very easily.
+
+# Manual for uploading a CP2K calculation
+The print level (GLOBAL/PRINT_LEVEL) of a CP2K run will afect how much
+information can be parsed from it. Try to use print levels MEDIUM and above to
+get best parsing results.
+
+All the files that are needed to run the calculation should be included in the
+upload, including the basis set and potential files. The folder structure does
+not matter, as the whole directory is searced for relevant files.
+
+Although CP2K often doesn't care about the file extensions, using them enables
+the parser to automatically identify the files and makes it perform better
+(only needs to decompress part of files in HDF5). Please use these default file
+extensions:
+ - Output file: .out (Only one)
+ - Input file: .inp (Only one. If you have "include" files, use some other extension e.g. .inc)
+ - XYZ coordinate files: .xyz
+ - Protein Data Bank files: .pdb
+ - Crystallographic Information Files: .cif
+
+# Notes for CP2K developers
+Here is a list of features/fixes that would make the parsing of CP2K results
+easier:
+ - The pdb trajectory output doesn't seem to conform to the actual standard as
+   the different configurations are separated by the END keyword which is
+   supposed to be written only once in the file. The [format specification](http://www.wwpdb.org/documentation/file-format)
+   states that different configurations should start with MODEL and end with
+   ENDMDL tags.
--- a/cp2kparser/library.md
+++ b/cp2kparser/library.md
+## More Complex Parsing Scenarios
+
+The utilities in simple_parser.py can be used alone to make a parser in many
+cases. The SimpleMatchers provide a very nice declarative way to define the
+parsing process and takes care of unit conversion and pushing the results to
+the scala layer.
+
+Still you may find it useful to have additional help in handling more complex
+scenarios. During the parser development you may encounter these questions:
+    - How to manage different versions fo the parsed code that may even have completely different syntax?
+    - How to handle multiple files?
+    - How to integrate all this with the functionality that is provided in simple_parser.py?
+
+The NomadParser class is meant help in structuring your code. It uses the same
+input and output format as the mainFunction in simple_parser.py. Here is a
+minimal example of a parser that subclasses NomadParser Here is a minimal
+example of a parser that subclasses NomadParser:
+
+```python
+class MyParser(NomadParser):
+    """
+    This class is responsible for setting up the actual parser implementation
+    and provides the input and output for parsing. It inherits the NomadParser
+    class to get access to many useful features.
+    """
+    def __init__(self, input_json_string):
+        NomadParser.__init__(self, input_json_string)
+        self.version = None
+        self.implementation = None
+        self.setup_version()
+
+    def setup_version(self):
+        """The parsers should be able to support different version of the same
+        code. In this function you can determine which version of the software
+        we're dealing with and initialize the correct implementation
+        accordingly.
+        """
+        self.version = "1"
+        self.implementation = globals()["MyParserImplementation{}".format(self.version)](self)
+
+    def parse(self):
+        """After the version has been identified and an implementation is
+        setup, you can start parsing.
+        """
+        return getattr(self.implementation, name)()
+```
+
+The class MyParser only defines how to setup a parser based on the given input.
+The actual dirty work is done by a parser implementation class. NomadParser
+does not enforce any specific style for the implementation. By wrapping the
+results in a Result object, you get the automatic unit conversion and JSON
+backend support. A very minimal example of a parser implementation class:
+
+```python
+class MyParserImplementation1():
+    """This is an implementation class that contains the actual parsing logic
+    for a certain software version. There can be multiple implementation
+    classes and MyParser decides which one to use.
+    """
+    supported_quantities = ["energy_total", "particle_forces", "particle_position"]
+
+    def __init__(self, parser):
+        self.parser = parser
+
+    def energy_total(self):
+        """Returns the total energy. Used to illustrate on how to parse a single
+        result.
+        """
+        result = Result()
+        result.unit = "joule"
+        result.value = 2.0
+        return result
+
+    def particle_forces(self):
+        """Returns multiple force configurations as a list. Has to load the
+        entire list into memory. You should avoid loading very big files
+        unnecessarily into memory. See the function 'particle_position()' for a
+        one example on how to avoid loading the entire file into memory.
+        """
+        result = Result()
+        result.unit = "newton"
+        xyz_string = self.parser.get_file_contents("forces")
+        forces = []
+        i_forces = []
+        for line in xyz_string.split('\n'):
+            line = line.strip()
+            if not line:
+                continue
+            if line.startswith("i"):
+                if i_forces:
+                    forces.append(np.array(i_forces))
+                    i_forces = []
+                continue
+            elif line.startswith("2"):
+                continue
+            else:
+                i_forces.append([float(x) for x in line.split()[-3:]])
+        if i_forces:
+            forces.append(np.array(i_forces))
+        result.value_iterable = forces
+        return result
+
+    def particle_position(self):
+        """An example of a function returning a generator. This function does
+        not load the whole position file into memory, but goes throught it line
+        by line and returns configurations as soon as they are ready.
+        """
+
+        def position_generator():
+            """This inner function is the generator, a function that remebers
+            it's state and can yield intermediate results.
+            """
+            xyz_file = self.parser.get_file_handle("positions")
+            i_forces = []
+            for line in xyz_file:
+                line = line.strip()
+                if not line:
+                    continue
+                if line.startswith("i"):
+                    if i_forces:
+                        yield np.array(i_forces)
+                        i_forces = []
+                    continue
+                elif line.startswith("2"):
+                    continue
+                else:
+                    i_forces.append([float(x) for x in line.split()[-3:]])
+            if i_forces:
+                yield np.array(i_forces)
+
+        result = Result()
+        result.unit = "angstrom"
+        result.value_iterable = position_generator()
+        return result
+```
+
+The MyParser class decides which implementation to use based on e.g. the
+software version number that is available on one of the input files. New
+implementations corresponding to other software versions can then be easily
+defined and they can also use the functionality of another implementation by
+subclassing. Example:
+
+```python
+class MyParserImplementation2(MyParserImplementation1):
+    """Implementation for a different version of the electronic structure
+    software. Subclasses MyParserImplementation1. In this version the
+    energy unit has changed and the 'energy' function from
+    MyParserImplementation1 is overwritten.
+    """
+
+    def energy(self):
+        """The energy unit has changed in this version."""
+        result = Result()
+        result.unit = "hartree"
+        result.value = "2.0"
+        return result
+```
+MyParser could be now used as follows:
+
+```python
+    input_json = """{
+        "metaInfoFile": "metainfo.json",
+        "tmpDir": "/home",
+        "metainfoToKeep": [],
+        "metainfoToSkip": [],
+        "files": {
+            forces.xyz": "forces",
+            "positions.xyz": "positions"
+        }
+    }
+    """
+
+    parser = MyParser(json.dumps(input_json))
+    parser.parse()
+```
+
+The input JSON string is used to initialize the parser. The 'metaInfoFile'
+attribute contains the metainfo definitions used by the parser. From this file
+the parser can determine the type and shape and existence of metainfo
+definitions.
+
+The 'files' object contains all the files that are given to the parser. The
+attribute names are the file paths and their values are optional id's. The id's
+are not typically given and they have to be assigned by using the
+setup_file_id() function of NomadParser. Assigning id's helps to manage the
+files.
--- a/cp2kparser/setup.py
+++ b/cp2kparser/setup.py
+from setuptools import setup
+
+
+#===============================================================================
+def main():
+    # Start package setup
+    setup(
+        name="cp2kparser",
+        version="0.1",
+        include_package_data=True,
+        package_data={
+            '': ['*.json', '*.pickle'],
+        },
+        description="NoMaD parser implementation for CP2K",
+        author="Lauri Himanen",
+        author_email="lauri.himanen@gmail.com",
+        license="GPL3",
+        packages=["cp2kparser"],
+        install_requires=[
+            'pint',
+            'numpy',
+            'ase'
+        ],
+        zip_safe=False
+    )
+
+# Run main function by default
+if __name__ == "__main__":
+    main()
--- a/nomadanalysis/README.md
+++ b/nomadanalysis/README.md
+# Nomad Analysis
--- a/nomadanalysis/nomadanalysis.egg-info/PKG-INFO
+++ b/nomadanalysis/nomadanalysis.egg-info/PKG-INFO
+Metadata-Version: 1.0
+Name: nomadanalysis
+Version: 0.1
+Summary: Tools for analysing calculation results parsed by NOMAD parsers.
+Home-page: UNKNOWN
+Author: Lauri Himanen
+Author-email: lauri.himanen@gmail.com
+License: GPL3
+Description: UNKNOWN
+Platform: UNKNOWN
--- a/nomadanalysis/nomadanalysis.egg-info/SOURCES.txt
+++ b/nomadanalysis/nomadanalysis.egg-info/SOURCES.txt
+setup.py
+nomadanalysis/__init__.py
+nomadanalysis/analyzer.py
+nomadanalysis.egg-info/PKG-INFO
+nomadanalysis.egg-info/SOURCES.txt
+nomadanalysis.egg-info/dependency_links.txt
+nomadanalysis.egg-info/not-zip-safe
+nomadanalysis.egg-info/requires.txt
+nomadanalysis.egg-info/top_level.txt
\ No newline at end of file
--- a/nomadanalysis/nomadanalysis.egg-info/dependency_links.txt
+++ b/nomadanalysis/nomadanalysis.egg-info/dependency_links.txt
+
--- a/nomadanalysis/nomadanalysis.egg-info/not-zip-safe
+++ b/nomadanalysis/nomadanalysis.egg-info/not-zip-safe
+
--- a/nomadanalysis/nomadanalysis.egg-info/requires.txt
+++ b/nomadanalysis/nomadanalysis.egg-info/requires.txt
+pint
+numpy
\ No newline at end of file
--- a/nomadanalysis/nomadanalysis.egg-info/top_level.txt
+++ b/nomadanalysis/nomadanalysis.egg-info/top_level.txt
+nomadanalysis
--- a/nomadanalysis/nomadanalysis/__init__.py
+++ b/nomadanalysis/nomadanalysis/__init__.py
+#! /usr/bin/env python
+
+# This will activate the logging utilities for nomadanalysis
+import utils.log
+
+# Import the common classes here for less typing
+from .analyzer import Analyzer
--- a/nomadanalysis/nomadanalysis/analyzer.py
+++ b/nomadanalysis/nomadanalysis/analyzer.py
+import sys
+import logging
+from nomadcore.local_meta_info import loadJsonFile
+from nomadcore.parser_backend import JsonParseEventsWriterBackend
+from nomadanalysis.local_backend import LocalBackend
+
+
+logger = logging.getLogger(__name__)
+
+
+class Analyzer(object):
+    def __init__(self, parser=None):
+        self.parser = parser
+
+    def parse(self):
+        if not self.parser:
+            logger.error("A parser hasn't been defined.")
+        self.parser.parse()
+
+        return self.parser.parser_context.backend.results
+
+
+if __name__ == "__main__":
+
+    # Initialize backend
+    metainfo_path = "/home/lauri/Dropbox/nomad-dev/nomad-meta-info/meta_info/nomad_meta_info/cp2k.nomadmetainfo.json"
+    metainfoenv, warnings = loadJsonFile(metainfo_path)
+    backend = LocalBackend(metainfoenv)
+    # backend = JsonParseEventsWriterBackend(metainfoenv, sys.stdout)
+
+    # Initialize parser
+    from cp2kparser import CP2KParser
+    dirpath = "/home/lauri/Dropbox/nomad-dev/parser-cp2k/cp2kparser/cp2kparser/tests/cp2k_2.6.2/forces/outputfile/n"
+    parser = CP2KParser(dirpath=dirpath, metainfo_path=metainfo_path, backend=backend)
+
+    # Initialize analyzer
+    analyser = Analyzer(parser)
+    results = analyser.parse()
+
+    # Get Results
+    xc = results["XC_functional"]
+    # temps = results["cp2k_md_temperature_instantaneous"]
+    print xc.values
--- a/nomadanalysis/nomadanalysis/examples/1_basics.py
+++ b/nomadanalysis/nomadanalysis/examples/1_basics.py
+from nomadanalysis import Analyzer
+from cp2kparser import CP2KParser
+
+# Initialize the parser you want to use
+parser = CP2KParser()
+parser.dirpath = "/home/lauri/Dropbox/nomad-dev/parser-cp2k/cp2kparser/cp2kparser/tests/cp2k_2.6.2/forces/outputfile/n"
+parser.metainto_to_keep = ["section_run"]
+
+# Initialize the analyzer
+analyzer = Analyzer(parser)
+results = analyzer.parse()
--- a/nomadanalysis/nomadanalysis/local_backend.py
+++ b/nomadanalysis/nomadanalysis/local_backend.py
+import StringIO
+
+
+class LocalBackend(object):
+
+    def __init__(self, metaInfoEnv, fileOut=StringIO.StringIO()):
+        self.__metaInfoEnv = metaInfoEnv
+        self.fileOut = fileOut
+        self.__gIndex = -1
+        self.__openSections = set()
+        self.__writeComma = False
+        self.__lastIndex = {}
+        self.results = {}
+        self.stats = {}
+
+    def openSection(self, metaName):
+        """opens a new section and returns its new unique gIndex"""
+        newIndex = self.__lastIndex.get(metaName, -1) + 1
+        self.openSectionWithGIndex(metaName, newIndex)
+        return newIndex
+
+    def openSectionWithGIndex(self, metaName, gIndex):
+        """opens a new section where gIndex is generated externally
+        gIndex should be unique (no reopening of a closed section)"""
+        self.__lastIndex[metaName] = gIndex
+        self.__openSections.add((metaName, gIndex))
+        self.__jsonOutput({"event":"openSection", "metaName":metaName, "gIndex":gIndex})
+
+    def __jsonOutput(self, dic):
+        pass
+
+    def closeSection(self, metaName, gIndex):
+        self.__openSections.remove((metaName, gIndex))
+
+    def addValue(self, metaName, value, gIndex=-1):
+        if self.results.get(metaName) is None:
+            self.results[metaName] = Result()
+        self.results[metaName].values.append(value)
+
+    def addRealValue(self, metaName, value, gIndex=-1):
+        if self.results.get(metaName) is None:
+            self.results[metaName] = Result()
+        self.results[metaName].values.append(value)
+
+    def addArrayValues(self, metaName, values, gIndex=-1):
+        if self.results.get(metaName) is None:
+            self.results[metaName] = Result()
+        self.results[metaName].arrayValues.append(values)
+
+    def metaInfoEnv(self):
+        return self.__metaInfoEnv
+
+    def startedParsingSession(self, mainFileUri, parserInfo, parsingStatus = None, parsingErrors = None):
+        pass
+
+    def finishedParsingSession(self, parsingStatus, parsingErrors, mainFileUri = None, parserInfo = None):
+        pass
+
+
+class Result(object):
+    def __init__(self):
+        self.values = []
+        self.arrayValues = []
--- a/nomadanalysis/nomadanalysis/utils/__init__.py
+++ b/nomadanalysis/nomadanalysis/utils/__init__.py
--- a/nomadanalysis/nomadanalysis/utils/log.py
+++ b/nomadanalysis/nomadanalysis/utils/log.py
+"""
+This module is used to control the logging in the nomac analysis package.
+
+Each module in the package can have it's own logger, so that you can control
+the logging on a modular level easily.
+
+If you want to use a logger on a module simply add the following in the module
+preamble:
+    import logging
+    logger = logging.getLogger(__name__)
+
+This creates a logger with a hierarchical name. The hierarchical name allows
+the logger to inherit logger properties from a parent logger, but also allows
+module level control for logging.
+
+A custom formatting is also used for the log messages. The formatting is done
+by the LogFormatter class and is different for different levels.
+"""
+import logging
+import textwrap
+
+
+#===============================================================================
+class LogFormatter(logging.Formatter):
+
+    def format(self, record):
+        level = record.levelname
+        module = record.module
+        message = record.msg
+
+        if level == "INFO" or level == "DEBUG":
+            return make_titled_message("{}:{}".format(level, module), message)
+        else:
+            return "\n        " + make_title(level, width=64) + "\n" + make_message(message, width=64, spaces=8) + "\n"
+
+
+#===============================================================================
+def make_titled_message(title, message, width=80):
+    """Styles a message to be printed into console.
+    """
+    wrapper = textwrap.TextWrapper(width=width-5)
+    lines = wrapper.wrap(message)
+    styled_message = ""
+    first = True
+    for line in lines:
+        if first:
+            new_line = "  >> {}: ".format(title) + line
+            styled_message += new_line
+            first = False
+        else:
+            new_line = 5*" " + line
+            styled_message += "\n" + new_line
+
+    return styled_message
+
+
+#===============================================================================
+def make_message(message, width=80, spaces=0):
+    """Styles a message to be printed into console.
+    """
+    wrapper = textwrap.TextWrapper(width=width-6)
+    lines = wrapper.wrap(message)
+    styled_message = ""
+    first = True
+    for line in lines:
+        new_line = spaces*" " + "|  " + line + (width-6-len(line))*" " + "  |"
+        if first:
+            styled_message += new_line
+            first = False
+        else:
+            styled_message += "\n" + new_line
+    styled_message += "\n" + spaces*" " + "|" + (width-2)*"-" + "|"
+    return styled_message
+
+
+#===============================================================================
+def make_title(title, width=80):
+    """Styles a title to be printed into console.
+    """
+    space = width-len(title)-4
+    pre_space = space/2-1
+    post_space = space-pre_space
+    line = "|" + str((pre_space)*"=") + " "
+    line += title
+    line += " " + str((post_space)*"=") + "|"
+    return line
+
+
+#===============================================================================
+# The highest level logger setup
+root_logger = logging.getLogger("nomadparser")
+root_logger.setLevel(logging.INFO)
+
+# Create console handler and set level to debug
+root_console_handler = logging.StreamHandler()
+root_console_handler.setLevel(logging.DEBUG)
+root_console_formatter = LogFormatter()
+root_console_handler.setFormatter(root_console_formatter)
+root_logger.addHandler(root_console_handler)
--- a/nomadanalysis/setup.py
+++ b/nomadanalysis/setup.py
+from setuptools import setup
+
+
+#===============================================================================
+def main():
+    # Start package setup
+    setup(
+        name="nomadanalysis",
+        version="0.1",
+        description="Tools for analysing calculation results parsed by NOMAD parsers.",
+        author="Lauri Himanen",
+        author_email="lauri.himanen@gmail.com",
+        license="GPL3",
+        packages=["nomadanalysis"],
+        install_requires=[
+            'pint',
+            'numpy',
+        ],
+        zip_safe=False
+    )
+
+# Run main function by default
+if __name__ == "__main__":
+    main()