diff --git a/cp2kparser/README.md b/cp2kparser/README.md new file mode 100644 index 0000000000000000000000000000000000000000..2ed3ac766aa0696d18ebcaa87d9e4f05ee050e83 --- /dev/null +++ b/cp2kparser/README.md @@ -0,0 +1,148 @@ +# CP2K NoMaD Parser + +# QuickStart +- Clone repository + + ```shell + git clone git@gitlab.mpcdf.mpg.de:nomad-lab/parser-cp2k.git + ``` + +- Run setup by running the setup.py script. For local, user specific install + without sudo permissions use (omit --user for a system-wide install): + + ```shell + python setup.py install --user + ``` + +- You can test if everything is running fine by running the test script in tests folder: + + ```shell + cd cp2kparser/tests/cp2k_2.6.2 + python run_tests.py + ``` + +- If you want to try out parsing for a custom cp2k calculation, place all + relevant output and input files inside a common directory and run the + following command within that folder: + + ```shell + python -m cp2kparser + ``` + +# Structure +Currently the python package is divided into three subpackages: +- Engines: Classes for parsing different type of files +- Generics: Generic utility classes and base classes +- Implementation: The classes that actually define the parser functionality. + +## Engines +Basically all the "engines", that is the modules that parse certain type of +files, are reusable in other parsers. They could be put into a common +repository where other developers can improve and extend them. One should also +write tests for the engines that would validate their behaviour and ease the +performance analysis. + +The engine classes work also as interfaces. You can change the engine behaviour +while maintaining the same API in the parsers. For example one might improve +the performance of an engine but if the function calls remain the same no other +code has to be changed. + +Currently implemented engines that could be reused (not tested properly yet): +- AtomsEngine: For reading various atomic coordinate files. Currently uses ASE + to read the files. +- RegexEngine: For parsing text files with regular expressions. Uses the re2 +library if available (falls back to default python regex implementation if +re2 not found). +- CSVEngine: For parsing CSV-like files. Has a very +flexible nature as you can specify comments, column delimiters, column +indices and the patterns used to separate different configurations. +- XMLEngine: For parsing XML files using XPath syntax. + +## Generics +In the generics folder there is a module called nomadparser.py that defines a +class called NomadParser. This acts as a base class for the cp2k parser defined +in the implementation folder. + +The NomadParser class defines the interface which is eventually used by e.g. +the scala code (will be modified later to conform to the common interface). +This class is also responsible for some common tasks that are present in all +parsers: + +- Unit conversion +- JSON encoding +- Caching +- Time measurement for performance analysis +- Providing file contents, sizes and handles + +# Tools and Methods + +The following is a list of tools/methods that can help the development process. + +## Documentation +The [google style guide](https://google.github.io/styleguide/pyguide.html?showone=Comments#Comments) provides a good template on how to document your code. +Documenting makes it much easier to follow the logic behind your parser. + +## Logging +Python has a great [logging +package](https://docs.python.org/2/library/logging.html) which helps in +following the program flow and catching different errors and warnings. In +cp2kparser the file cp2kparser/generics/logconfig.py defines the behaviour of +the logger. There you can setup the log levels even at a modular level. A more +easily readable formatting is also provided for the log messages. + +## Testing +The parsers can become quite complicated and maintaining them without +systematic testing can become troublesome. Unit tests provide one way to +test each parseable quantity and python has a very good [library for +unit testing](https://docs.python.org/2/library/unittest.html). When the parser +supports a new quantity it is quite fast to create unit tests for it. These +tests will validate the parsing, and also easily detect bugs that may rise when +the code is modified in the future. + +## Unit conversion +You can find unit conversion tools from the python-common repository and its +nomadcore package. The unit conversion is currenlty done by +[Pint](https://pint.readthedocs.org/en/0.6/) and it has a very natural syntax, +support for numpy arrays and an easily reconfigurable constant/unit declaration +mechanisms. + +## Profiling +The parsers have to be reasonably fast. For some codes there is already +significant amount of data in the NoMaD repository and the time taken to parse +it will depend on the performance of the parser. Also each time the parser +evolves after system deployment, the existing data may have to be reparsed at +least partially. + +By profiling what functions take the most computational time and memory during +parsing you can identify the bottlenecks in the parser. There are already +existing profiling tools such as +[cProfile](https://docs.python.org/2/library/profile.html#module-cProfile) +which you can plug into your scripts very easily. + +# Manual for uploading a CP2K calculation +The print level (GLOBAL/PRINT_LEVEL) of a CP2K run will afect how much +information can be parsed from it. Try to use print levels MEDIUM and above to +get best parsing results. + +All the files that are needed to run the calculation should be included in the +upload, including the basis set and potential files. The folder structure does +not matter, as the whole directory is searced for relevant files. + +Although CP2K often doesn't care about the file extensions, using them enables +the parser to automatically identify the files and makes it perform better +(only needs to decompress part of files in HDF5). Please use these default file +extensions: + - Output file: .out (Only one) + - Input file: .inp (Only one. If you have "include" files, use some other extension e.g. .inc) + - XYZ coordinate files: .xyz + - Protein Data Bank files: .pdb + - Crystallographic Information Files: .cif + +# Notes for CP2K developers +Here is a list of features/fixes that would make the parsing of CP2K results +easier: + - The pdb trajectory output doesn't seem to conform to the actual standard as + the different configurations are separated by the END keyword which is + supposed to be written only once in the file. The [format specification](http://www.wwpdb.org/documentation/file-format) + states that different configurations should start with MODEL and end with + ENDMDL tags. diff --git a/cp2kparser/library.md b/cp2kparser/library.md new file mode 100644 index 0000000000000000000000000000000000000000..87c619810628425252ede025044fd5594308df45 --- /dev/null +++ b/cp2kparser/library.md @@ -0,0 +1,186 @@ +## More Complex Parsing Scenarios + +The utilities in simple_parser.py can be used alone to make a parser in many +cases. The SimpleMatchers provide a very nice declarative way to define the +parsing process and takes care of unit conversion and pushing the results to +the scala layer. + +Still you may find it useful to have additional help in handling more complex +scenarios. During the parser development you may encounter these questions: + - How to manage different versions fo the parsed code that may even have completely different syntax? + - How to handle multiple files? + - How to integrate all this with the functionality that is provided in simple_parser.py? + +The NomadParser class is meant help in structuring your code. It uses the same +input and output format as the mainFunction in simple_parser.py. Here is a +minimal example of a parser that subclasses NomadParser Here is a minimal +example of a parser that subclasses NomadParser: + +```python +class MyParser(NomadParser): + """ + This class is responsible for setting up the actual parser implementation + and provides the input and output for parsing. It inherits the NomadParser + class to get access to many useful features. + """ + def __init__(self, input_json_string): + NomadParser.__init__(self, input_json_string) + self.version = None + self.implementation = None + self.setup_version() + + def setup_version(self): + """The parsers should be able to support different version of the same + code. In this function you can determine which version of the software + we're dealing with and initialize the correct implementation + accordingly. + """ + self.version = "1" + self.implementation = globals()["MyParserImplementation{}".format(self.version)](self) + + def parse(self): + """After the version has been identified and an implementation is + setup, you can start parsing. + """ + return getattr(self.implementation, name)() +``` + +The class MyParser only defines how to setup a parser based on the given input. +The actual dirty work is done by a parser implementation class. NomadParser +does not enforce any specific style for the implementation. By wrapping the +results in a Result object, you get the automatic unit conversion and JSON +backend support. A very minimal example of a parser implementation class: + +```python +class MyParserImplementation1(): + """This is an implementation class that contains the actual parsing logic + for a certain software version. There can be multiple implementation + classes and MyParser decides which one to use. + """ + supported_quantities = ["energy_total", "particle_forces", "particle_position"] + + def __init__(self, parser): + self.parser = parser + + def energy_total(self): + """Returns the total energy. Used to illustrate on how to parse a single + result. + """ + result = Result() + result.unit = "joule" + result.value = 2.0 + return result + + def particle_forces(self): + """Returns multiple force configurations as a list. Has to load the + entire list into memory. You should avoid loading very big files + unnecessarily into memory. See the function 'particle_position()' for a + one example on how to avoid loading the entire file into memory. + """ + result = Result() + result.unit = "newton" + xyz_string = self.parser.get_file_contents("forces") + forces = [] + i_forces = [] + for line in xyz_string.split('\n'): + line = line.strip() + if not line: + continue + if line.startswith("i"): + if i_forces: + forces.append(np.array(i_forces)) + i_forces = [] + continue + elif line.startswith("2"): + continue + else: + i_forces.append([float(x) for x in line.split()[-3:]]) + if i_forces: + forces.append(np.array(i_forces)) + result.value_iterable = forces + return result + + def particle_position(self): + """An example of a function returning a generator. This function does + not load the whole position file into memory, but goes throught it line + by line and returns configurations as soon as they are ready. + """ + + def position_generator(): + """This inner function is the generator, a function that remebers + it's state and can yield intermediate results. + """ + xyz_file = self.parser.get_file_handle("positions") + i_forces = [] + for line in xyz_file: + line = line.strip() + if not line: + continue + if line.startswith("i"): + if i_forces: + yield np.array(i_forces) + i_forces = [] + continue + elif line.startswith("2"): + continue + else: + i_forces.append([float(x) for x in line.split()[-3:]]) + if i_forces: + yield np.array(i_forces) + + result = Result() + result.unit = "angstrom" + result.value_iterable = position_generator() + return result +``` + +The MyParser class decides which implementation to use based on e.g. the +software version number that is available on one of the input files. New +implementations corresponding to other software versions can then be easily +defined and they can also use the functionality of another implementation by +subclassing. Example: + +```python +class MyParserImplementation2(MyParserImplementation1): + """Implementation for a different version of the electronic structure + software. Subclasses MyParserImplementation1. In this version the + energy unit has changed and the 'energy' function from + MyParserImplementation1 is overwritten. + """ + + def energy(self): + """The energy unit has changed in this version.""" + result = Result() + result.unit = "hartree" + result.value = "2.0" + return result +``` +MyParser could be now used as follows: + +```python + input_json = """{ + "metaInfoFile": "metainfo.json", + "tmpDir": "/home", + "metainfoToKeep": [], + "metainfoToSkip": [], + "files": { + forces.xyz": "forces", + "positions.xyz": "positions" + } + } + """ + + parser = MyParser(json.dumps(input_json)) + parser.parse() +``` + +The input JSON string is used to initialize the parser. The 'metaInfoFile' +attribute contains the metainfo definitions used by the parser. From this file +the parser can determine the type and shape and existence of metainfo +definitions. + +The 'files' object contains all the files that are given to the parser. The +attribute names are the file paths and their values are optional id's. The id's +are not typically given and they have to be assigned by using the +setup_file_id() function of NomadParser. Assigning id's helps to manage the +files. diff --git a/cp2kparser/setup.py b/cp2kparser/setup.py new file mode 100644 index 0000000000000000000000000000000000000000..f0459abb5fb3ad50b03b1137cd9d003dc46d177a --- /dev/null +++ b/cp2kparser/setup.py @@ -0,0 +1,29 @@ +from setuptools import setup + + +#=============================================================================== +def main(): + # Start package setup + setup( + name="cp2kparser", + version="0.1", + include_package_data=True, + package_data={ + '': ['*.json', '*.pickle'], + }, + description="NoMaD parser implementation for CP2K", + author="Lauri Himanen", + author_email="lauri.himanen@gmail.com", + license="GPL3", + packages=["cp2kparser"], + install_requires=[ + 'pint', + 'numpy', + 'ase' + ], + zip_safe=False + ) + +# Run main function by default +if __name__ == "__main__": + main() diff --git a/nomadanalysis/README.md b/nomadanalysis/README.md new file mode 100644 index 0000000000000000000000000000000000000000..bf317aae64b6cac7ab8a214a5929ff56b5528334 --- /dev/null +++ b/nomadanalysis/README.md @@ -0,0 +1 @@ +# Nomad Analysis diff --git a/nomadanalysis/nomadanalysis.egg-info/PKG-INFO b/nomadanalysis/nomadanalysis.egg-info/PKG-INFO new file mode 100644 index 0000000000000000000000000000000000000000..082b26ce93cbb1db21cb8dec971a1a6230a7994f --- /dev/null +++ b/nomadanalysis/nomadanalysis.egg-info/PKG-INFO @@ -0,0 +1,10 @@ +Metadata-Version: 1.0 +Name: nomadanalysis +Version: 0.1 +Summary: Tools for analysing calculation results parsed by NOMAD parsers. +Home-page: UNKNOWN +Author: Lauri Himanen +Author-email: lauri.himanen@gmail.com +License: GPL3 +Description: UNKNOWN +Platform: UNKNOWN diff --git a/nomadanalysis/nomadanalysis.egg-info/SOURCES.txt b/nomadanalysis/nomadanalysis.egg-info/SOURCES.txt new file mode 100644 index 0000000000000000000000000000000000000000..e5e018a61ce1792c406b8bdacc2d80744509166b --- /dev/null +++ b/nomadanalysis/nomadanalysis.egg-info/SOURCES.txt @@ -0,0 +1,9 @@ +setup.py +nomadanalysis/__init__.py +nomadanalysis/analyzer.py +nomadanalysis.egg-info/PKG-INFO +nomadanalysis.egg-info/SOURCES.txt +nomadanalysis.egg-info/dependency_links.txt +nomadanalysis.egg-info/not-zip-safe +nomadanalysis.egg-info/requires.txt +nomadanalysis.egg-info/top_level.txt \ No newline at end of file diff --git a/nomadanalysis/nomadanalysis.egg-info/dependency_links.txt b/nomadanalysis/nomadanalysis.egg-info/dependency_links.txt new file mode 100644 index 0000000000000000000000000000000000000000..8b137891791fe96927ad78e64b0aad7bded08bdc --- /dev/null +++ b/nomadanalysis/nomadanalysis.egg-info/dependency_links.txt @@ -0,0 +1 @@ + diff --git a/nomadanalysis/nomadanalysis.egg-info/not-zip-safe b/nomadanalysis/nomadanalysis.egg-info/not-zip-safe new file mode 100644 index 0000000000000000000000000000000000000000..8b137891791fe96927ad78e64b0aad7bded08bdc --- /dev/null +++ b/nomadanalysis/nomadanalysis.egg-info/not-zip-safe @@ -0,0 +1 @@ + diff --git a/nomadanalysis/nomadanalysis.egg-info/requires.txt b/nomadanalysis/nomadanalysis.egg-info/requires.txt new file mode 100644 index 0000000000000000000000000000000000000000..e076fe33b652b52ea3d15ae9d9261213ae377e52 --- /dev/null +++ b/nomadanalysis/nomadanalysis.egg-info/requires.txt @@ -0,0 +1,2 @@ +pint +numpy \ No newline at end of file diff --git a/nomadanalysis/nomadanalysis.egg-info/top_level.txt b/nomadanalysis/nomadanalysis.egg-info/top_level.txt new file mode 100644 index 0000000000000000000000000000000000000000..5f449b7a78769ec29b0e4b9c3f604f2d09b79cb7 --- /dev/null +++ b/nomadanalysis/nomadanalysis.egg-info/top_level.txt @@ -0,0 +1 @@ +nomadanalysis diff --git a/nomadanalysis/nomadanalysis/__init__.py b/nomadanalysis/nomadanalysis/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..786cde3164857ccb25fc2d850bfe4c3c3eb7cf6f --- /dev/null +++ b/nomadanalysis/nomadanalysis/__init__.py @@ -0,0 +1,7 @@ +#! /usr/bin/env python + +# This will activate the logging utilities for nomadanalysis +import utils.log + +# Import the common classes here for less typing +from .analyzer import Analyzer diff --git a/nomadanalysis/nomadanalysis/analyzer.py b/nomadanalysis/nomadanalysis/analyzer.py new file mode 100644 index 0000000000000000000000000000000000000000..de1336fd899c8037654fae1481e96f3897cd40df --- /dev/null +++ b/nomadanalysis/nomadanalysis/analyzer.py @@ -0,0 +1,43 @@ +import sys +import logging +from nomadcore.local_meta_info import loadJsonFile +from nomadcore.parser_backend import JsonParseEventsWriterBackend +from nomadanalysis.local_backend import LocalBackend + + +logger = logging.getLogger(__name__) + + +class Analyzer(object): + def __init__(self, parser=None): + self.parser = parser + + def parse(self): + if not self.parser: + logger.error("A parser hasn't been defined.") + self.parser.parse() + + return self.parser.parser_context.backend.results + + +if __name__ == "__main__": + + # Initialize backend + metainfo_path = "/home/lauri/Dropbox/nomad-dev/nomad-meta-info/meta_info/nomad_meta_info/cp2k.nomadmetainfo.json" + metainfoenv, warnings = loadJsonFile(metainfo_path) + backend = LocalBackend(metainfoenv) + # backend = JsonParseEventsWriterBackend(metainfoenv, sys.stdout) + + # Initialize parser + from cp2kparser import CP2KParser + dirpath = "/home/lauri/Dropbox/nomad-dev/parser-cp2k/cp2kparser/cp2kparser/tests/cp2k_2.6.2/forces/outputfile/n" + parser = CP2KParser(dirpath=dirpath, metainfo_path=metainfo_path, backend=backend) + + # Initialize analyzer + analyser = Analyzer(parser) + results = analyser.parse() + + # Get Results + xc = results["XC_functional"] + # temps = results["cp2k_md_temperature_instantaneous"] + print xc.values diff --git a/nomadanalysis/nomadanalysis/examples/1_basics.py b/nomadanalysis/nomadanalysis/examples/1_basics.py new file mode 100644 index 0000000000000000000000000000000000000000..1592424c0d7862e37341ec2356ee9b5799b7a70f --- /dev/null +++ b/nomadanalysis/nomadanalysis/examples/1_basics.py @@ -0,0 +1,11 @@ +from nomadanalysis import Analyzer +from cp2kparser import CP2KParser + +# Initialize the parser you want to use +parser = CP2KParser() +parser.dirpath = "/home/lauri/Dropbox/nomad-dev/parser-cp2k/cp2kparser/cp2kparser/tests/cp2k_2.6.2/forces/outputfile/n" +parser.metainto_to_keep = ["section_run"] + +# Initialize the analyzer +analyzer = Analyzer(parser) +results = analyzer.parse() diff --git a/nomadanalysis/nomadanalysis/local_backend.py b/nomadanalysis/nomadanalysis/local_backend.py new file mode 100644 index 0000000000000000000000000000000000000000..b8fa931fedc090fbf865ea9ef6cace589d3bfee5 --- /dev/null +++ b/nomadanalysis/nomadanalysis/local_backend.py @@ -0,0 +1,63 @@ +import StringIO + + +class LocalBackend(object): + + def __init__(self, metaInfoEnv, fileOut=StringIO.StringIO()): + self.__metaInfoEnv = metaInfoEnv + self.fileOut = fileOut + self.__gIndex = -1 + self.__openSections = set() + self.__writeComma = False + self.__lastIndex = {} + self.results = {} + self.stats = {} + + def openSection(self, metaName): + """opens a new section and returns its new unique gIndex""" + newIndex = self.__lastIndex.get(metaName, -1) + 1 + self.openSectionWithGIndex(metaName, newIndex) + return newIndex + + def openSectionWithGIndex(self, metaName, gIndex): + """opens a new section where gIndex is generated externally + gIndex should be unique (no reopening of a closed section)""" + self.__lastIndex[metaName] = gIndex + self.__openSections.add((metaName, gIndex)) + self.__jsonOutput({"event":"openSection", "metaName":metaName, "gIndex":gIndex}) + + def __jsonOutput(self, dic): + pass + + def closeSection(self, metaName, gIndex): + self.__openSections.remove((metaName, gIndex)) + + def addValue(self, metaName, value, gIndex=-1): + if self.results.get(metaName) is None: + self.results[metaName] = Result() + self.results[metaName].values.append(value) + + def addRealValue(self, metaName, value, gIndex=-1): + if self.results.get(metaName) is None: + self.results[metaName] = Result() + self.results[metaName].values.append(value) + + def addArrayValues(self, metaName, values, gIndex=-1): + if self.results.get(metaName) is None: + self.results[metaName] = Result() + self.results[metaName].arrayValues.append(values) + + def metaInfoEnv(self): + return self.__metaInfoEnv + + def startedParsingSession(self, mainFileUri, parserInfo, parsingStatus = None, parsingErrors = None): + pass + + def finishedParsingSession(self, parsingStatus, parsingErrors, mainFileUri = None, parserInfo = None): + pass + + +class Result(object): + def __init__(self): + self.values = [] + self.arrayValues = [] diff --git a/nomadanalysis/nomadanalysis/utils/__init__.py b/nomadanalysis/nomadanalysis/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/nomadanalysis/nomadanalysis/utils/log.py b/nomadanalysis/nomadanalysis/utils/log.py new file mode 100644 index 0000000000000000000000000000000000000000..29b692c976db1a6c7524b77329c8538723ba0af6 --- /dev/null +++ b/nomadanalysis/nomadanalysis/utils/log.py @@ -0,0 +1,99 @@ +""" +This module is used to control the logging in the nomac analysis package. + +Each module in the package can have it's own logger, so that you can control +the logging on a modular level easily. + +If you want to use a logger on a module simply add the following in the module +preamble: + import logging + logger = logging.getLogger(__name__) + +This creates a logger with a hierarchical name. The hierarchical name allows +the logger to inherit logger properties from a parent logger, but also allows +module level control for logging. + +A custom formatting is also used for the log messages. The formatting is done +by the LogFormatter class and is different for different levels. +""" +import logging +import textwrap + + +#=============================================================================== +class LogFormatter(logging.Formatter): + + def format(self, record): + level = record.levelname + module = record.module + message = record.msg + + if level == "INFO" or level == "DEBUG": + return make_titled_message("{}:{}".format(level, module), message) + else: + return "\n " + make_title(level, width=64) + "\n" + make_message(message, width=64, spaces=8) + "\n" + + +#=============================================================================== +def make_titled_message(title, message, width=80): + """Styles a message to be printed into console. + """ + wrapper = textwrap.TextWrapper(width=width-5) + lines = wrapper.wrap(message) + styled_message = "" + first = True + for line in lines: + if first: + new_line = " >> {}: ".format(title) + line + styled_message += new_line + first = False + else: + new_line = 5*" " + line + styled_message += "\n" + new_line + + return styled_message + + +#=============================================================================== +def make_message(message, width=80, spaces=0): + """Styles a message to be printed into console. + """ + wrapper = textwrap.TextWrapper(width=width-6) + lines = wrapper.wrap(message) + styled_message = "" + first = True + for line in lines: + new_line = spaces*" " + "| " + line + (width-6-len(line))*" " + " |" + if first: + styled_message += new_line + first = False + else: + styled_message += "\n" + new_line + styled_message += "\n" + spaces*" " + "|" + (width-2)*"-" + "|" + return styled_message + + +#=============================================================================== +def make_title(title, width=80): + """Styles a title to be printed into console. + """ + space = width-len(title)-4 + pre_space = space/2-1 + post_space = space-pre_space + line = "|" + str((pre_space)*"=") + " " + line += title + line += " " + str((post_space)*"=") + "|" + return line + + +#=============================================================================== +# The highest level logger setup +root_logger = logging.getLogger("nomadparser") +root_logger.setLevel(logging.INFO) + +# Create console handler and set level to debug +root_console_handler = logging.StreamHandler() +root_console_handler.setLevel(logging.DEBUG) +root_console_formatter = LogFormatter() +root_console_handler.setFormatter(root_console_formatter) +root_logger.addHandler(root_console_handler) diff --git a/nomadanalysis/setup.py b/nomadanalysis/setup.py new file mode 100644 index 0000000000000000000000000000000000000000..60a18a926af243389fb7c79bdcc4fb4cbc029bd6 --- /dev/null +++ b/nomadanalysis/setup.py @@ -0,0 +1,24 @@ +from setuptools import setup + + +#=============================================================================== +def main(): + # Start package setup + setup( + name="nomadanalysis", + version="0.1", + description="Tools for analysing calculation results parsed by NOMAD parsers.", + author="Lauri Himanen", + author_email="lauri.himanen@gmail.com", + license="GPL3", + packages=["nomadanalysis"], + install_requires=[ + 'pint', + 'numpy', + ], + zip_safe=False + ) + +# Run main function by default +if __name__ == "__main__": + main()