Python Interface to HDF5 Archive Files
The module "python-common/common/python/nomadcore/archive.py" provides functionality to access HDF5 Archive files with an easy, pythonic interface. Currently only reading the original files is supported. Writing can be implemented if necessary, but it was left out for now to prevent any accidental modification of the Archive files.
Here is a quick overview of the functionality:
-
The ArchiveHDF5 class contains a 'calculations'-dictionary that contains all the separate calculations and their hashes. There is also a 'repositories'-dictionary for accessing calculations from a specific repository upload.
-
The data inside calculations can be accessed with a dictionary-like interface that also allows recursive search and indexing. All datasets are returned as numpy-arrays. To get a specific dataset, provide an index for each section, e.g. calc["section_run:0/section_system:1/simulation_cell"]
-
You can get a list of sections by leaving out the index, e.g. calc["section_run:0/section_system"] will return a list of section_systems that belong to the first section_run.
-
The sections that are returned by queries are also searchable with the same syntax, e.g. section_run = calc["section_run:0"]["section_system:0"]["simulation_cell"]
Here is an example of using the interface:
import numpy as np
from nomadcore.archive import ArchiveHDF5
path = "./hdftest.h5"
archive = ArchiveHDF5(path)
# Access data from a specific calculation. All data is returned as numpy
# arrays.
calc = archive.calculations["CzZ46B-Rp4Swdxt3cJX1vSYmblMtB"]
simulation_cell = calc["section_run:0/section_system:0/simulation_cell"]
section_run = calc["section_run:0"]
program_name = section_run["program_name"]
# The sections extend the dictionary interface. The 'keys', 'values', 'items'
# and '__len__' will only consider the direct children of the section.
a = "program_name" in section_run
b = section_run.keys()
c = section_run.values()
d = section_run.items()
e = section_run.get("program_name")
f = len(section_run)
# To access multiple sections, just leave the index from the query. Then all
# the found sections are returned as a list.
sccs = section_run["section_single_configuration_calculation"]
for scc in sccs:
energy = scc["energy_total"]
# The calculations also contain information about the mainfile URI and the
# parser
mainfile_uri = calc.mainfile_uri
parser_info = calc.parser_info
# Get calculations from a specific repository inside the file
repositories = archive.repositories["RZzPSYJHo1o6aXsWLzQUYS-EdaisU"]
for calc_id, calc in repositories.items():
uri = calc.mainfile_uri
# Loop over all calculations
for calculation in archive.calculations.values():
uri = calc.mainfile_uri
# If you want to enable a cache that will store any values you set to memory,
# then enable a flag during the archive creation. After this you can set values
# but they will only persist during the lifetime of the object.
archive = ArchiveHDF5(path, use_write_cache=True)
calc = archive.calculations["CzZ46B-Rp4Swdxt3cJX1vSYmblMtB"]
calc["section_run:0/section_system:0/simulation_cell"] = np.array([[1,0,0], [0,1,0], [0,0,1]])