Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • nomad-FAIR nomad-FAIR
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 220
    • Issues 220
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 32
    • Merge requests 32
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • nomad-labnomad-lab
  • nomad-FAIRnomad-FAIR
  • Issues
  • #891
Closed
Open
Issue created Jun 09, 2022 by Lauri Himanen@himanel1Maintainer

Hierarchical metainfo for complex structures

We are adding support for a more detailed breakdown of the atomistic structure of entries. Essentially this means that we are replacing the contents of results.material (which is able to describe only a single material) with a hierarchy of different structural components that have been identified within the entry. Now that we have built some prototypes, it would be good to discuss the new metainfo that will be needed. This would also be a good opportunity to refactor some of the names/contents of what we store about the chemical composition/structure.

For some visual reference, here is a link to a development deployment that includes a visualization of the topology: https://nomad-lab.eu/dev/rae/improved-support-for-complex-structural-top/gui/search/entries/entry/id/KunbQcxjcEBBlcMC96H9QtaPc1fK

Hierarchy and relations

My initial suggestion for storing the hierarchy and relations:

  • There is a single root, which corresponds to the 'original' system, i.e. the representative system that is chosen by our pipeline. It may be possible to think about multiple roots if we encounter e.g. systems that dramatically change during the simulation (e.g. a chemical reaction?).
  • Each subsystem has a reference to its parent and a list of its children. This ensures easy navigation up/down the hierarchy.
  • Each subsystem indicates the relation with its parent. This way we have proper ontological "tuples" that specify a parent, child and their relationship.
  • For search reasons, this hierarchy will be stored as a list (repeating, nested objects in ES terms). This allows one to easily search for properties of any single subsystem. In the GUI this list is reconstructed as a tree for display purposes.
  • During this early phase, I'm storing this hierarchy under results.material.topology.

Metadata

My initial suggestion for the metainfo stored for individual subsystems is as follows:

class System(MSection):
    description = Quantity(
        type=str,
        description='''
        A full description of this system.
        ''')
    system_id = Quantity(
        type=str,
        description='''
        That path of this section within the metainfo that is used as a unique identifier.
        ''')
    method = Quantity(
        type=MEnum('parser', 'user', 'matid'),
        description='''
        The source of topological information.
        ''')
    label = Quantity(
        type=str,
        description='''
        Descriptive short label for this system.
        ''')
    material_id = Quantity(
        type=str,
        description='''
        A fixed length, unique material identifier in the form of a hash
        digest.
        ''')
    material_name = Quantity(
        type=str,
        description='''
        Meaningful name for this a material if any can be assigned.
        ''')
    structural_type = Quantity(
        type=MEnum(['bulk', 'surface', '2D', '1D', 'atom', 'group', 'molecule', 'monomer', 'unavailable']), default='not processed',
        description='''
        Classification based on structural features.
        ''')
    functional_type = Quantity(
        type=str,
        shape=['0..*'],
        description='''
        Classification based on the functional properties.
        ''')
    compound_type = Quantity(
        type=str,
        shape=['0..*'],
        description='''
        Classification based on the chemical formula.
        ''')
    elements = Quantity(
        type=MEnum(chemical_symbols),
        shape=['0..*'],
        default=[],
        description='''
        Names of the different elements present in the structure.
        ''')
    n_elements = Quantity(
        type=int,
        default=0,
        derived=lambda s: len(s.elements),
        description='''
        Number of different elements in the structure as an integer.
        ''')
    elements_exclusive = Quantity(
        type=str,
        derived=lambda s: ' '.join(sorted(s.elements)),
        description='''
        String containing the chemical elements in alphabetical order and
        separated by a single whitespace. This quantity can be used for
        exclusive element searches where you want to find entries/materials
        with only certain given elements.
        ''')
    formula_hill = Quantity(
        type=str,
        description='''
            The chemical formula for a structure in Hill form with element symbols followed by
            integer chemical proportion numbers. The proportion number MUST be omitted if it is 1.
        ''')
    formula_reduced = Quantity(
        type=str,
        description='''
            The reduced chemical formula for a structure as a string with element symbols and
            integer chemical proportion numbers. The proportion number MUST be omitted if it is 1.
        ''')
    formula_anonymous = Quantity(
        type=str,
        description='''
            The anonymous formula is the chemical_formula_reduced, but where the elements are
            instead first ordered by their chemical proportion number, and then, in order left to
            right, replaced by anonymous symbols A, B, C, ..., Z, Aa, Ba, ..., Za, Ab, Bb, ... and
            so on.
        ''')
    formula_reduced_fragments = Quantity(
        type=str,
        shape=['*'],
        description='''
        The reduced formula separated into individual terms containing both the atom
        type and count. Used for searching parts of a formula.
        ''')
    parent_system = Quantity(
        type=str,
        description='''
        Reference to the parent system.
        ''')
    child_systems = Quantity(
        type=str,
        shape=['*'],
        description='''
        References to the child systems.
        ''')
    n_atoms = Quantity(
        type=int,
        shape=[],
        description='''
        The total number of species (atoms, particles) in the system.
        ''')
    indices = Quantity(
        type=np.dtype(np.int64),
        shape=['*', '*'],
        description='''
        Indices of the atoms belonging to this group. These indices refer to
        the original system. Each row represents a new instance.
        ''')
    atoms_ref = Quantity(
        type=Atoms,
        description='''
        Reference to an atomistic structure that is associated with this
        system'.
        ''')
    system_relation = SubSection(sub_section=Relation.m_def, repeats=False)
    atoms = SubSection(sub_section=Atoms.m_def, repeats=False)
    cell = SubSection(sub_section=Cell.m_def, repeats=False)
    symmetry = SubSection(sub_section=SymmetryNew.m_def, repeats=False)
    prototype = SubSection(sub_section=Prototype.m_def, repeats=False)

class Relation(MSection):
    type = Quantity(
        type=MEnum('subsystem', 'idealization'),
        description='''
        The type of relation.
        ''')

class Atoms(MSection):
    concentrations = Quantity(
        type=np.dtype(np.float64),
        shape=['n_atoms'],
        description='''
        Concentrations of the species defined by labels which can be assigned for systems
        with variable compositions.
        ''')
    labels = Quantity(
        type=str,
        shape=['n_atoms'],
        description='''
        List containing the labels of the atoms. In the usual case, these correspond to
        the chemical symbols of the atoms. One can also append an index if there is a
        need to distinguish between species with the same symbol, e.g., atoms of the
        same species assigned to different atom-centered basis sets or pseudo-potentials,
        or simply atoms in different locations in the structure such as those in the bulk
        and on the surface. In the case where a species is not an atom, and therefore
        cannot be representated by a chemical symbol, the label can simply be the name of
        the particles.
        ''')
    positions = Quantity(
        type=np.dtype(np.float64),
        shape=['n_atoms', 3],
        unit='meter',
        description='''
        Positions of all the species, in cartesian coordinates. This metadata defines a
        configuration and is therefore required. For alloys where concentrations of
        species are given for each site in the unit cell, it stores the position of the
        sites.
        ''')
    velocities = Quantity(
        type=np.dtype(np.float64),
        shape=['n_atoms', 3],
        unit='meter / second',
        description='''
        Velocities of the nuclei, defined as the change in cartesian coordinates of the
        nuclei with respect to time.
        ''')
    lattice_vectors = Quantity(
        type=np.dtype(np.float64),
        shape=[3, 3],
        unit='meter',
        description='''
        Lattice vectors in cartesian coordinates of the simulation cell. The
        last (fastest) index runs over the $x,y,z$ Cartesian coordinates, and the first
        index runs over the 3 lattice vectors.
        ''')
    lattice_vectors_reciprocal = Quantity(
        type=np.dtype(np.float64),
        shape=[3, 3],
        unit='1/meter',
        description='''
        Reciprocal lattice vectors in cartesian coordinates of the simulation cell. The
        first index runs over the $x,y,z$ Cartesian coordinates, and the second index runs
        over the 3 lattice vectors.
        ''')
    local_rotations = Quantity(
        type=np.dtype(np.float64),
        shape=['n_atoms', 3, 3],
        description='''
        A rotation matrix defining the orientation of each atom. If the rotation matrix
        cannot be specified for an atom, the remaining atoms should set it to
        the zero matrix (not the identity!)
        ''')
    periodic = Quantity(
        type=bool,
        shape=[3],
        description='''
        Denotes if periodic boundary condition is applied to each of the lattice vectors.'
        ''')
    supercell_matrix = Quantity(
        type=np.dtype(np.int32),
        shape=[3, 3],
        description='''
        Specifies the matrix that transforms the unit-cell into the super-cell in which
        the actual calculation is performed.
        ''')
    species = SubSection(sub_section=Species.m_def, repeats=False)
    wyckoff_sets = SubSection(sub_section=WyckoffSet.m_def, repeats=True)

class Species(MSection):
    name = Quantity(
        type=str,
        description='''
        Name that uniquely identifies this species within a system.
        ''')
    mass = Quantity(
        type=np.dtype(np.float64),
        shape=[],
        unit='kilogram',
        description='''
        Mass of the species.
        ''')
    atomic_number = Quantity(
        type=np.dtype(np.int32),
        shape=[],
        description='''
        The atomic number of the species if available.
        ''')

class Symmetry(MSection):
    bravais_lattice = Quantity(
        type=MEnum(bravais_lattices),
        shape=[],
        description='''
        Identifier for the Bravais lattice in Pearson notation. The first lowercase letter
        identifies the crystal family and can be one of the following: a (triclinic), b
        (monoclinic), o (orthorhombic), t (tetragonal), h (hexagonal) or c (cubic). The
        second uppercase letter identifies the centring and can be one of the following: P
        (primitive), S (face centred), I (body centred), R (rhombohedral centring) or F
        (all faces centred).
        ''')
    crystal_system = Quantity(
        type=MEnum(crystal_systems),
        shape=[],
        description='''
        Name of the crystal system.
        ''')
    hall_number = Quantity(
        type=np.dtype(np.int32),
        shape=[],
        description='''
        The Hall number for this system.
        ''')
    hall_symbol = Quantity(
        type=str,
        shape=[],
        description='''
        The Hall symbol for this system.
        ''')
    point_group = Quantity(
        type=str,
        shape=[],
        description='''
        Symbol of the crystallographic point group in the Hermann-Mauguin notation.
        ''')
    space_group_number = Quantity(
        type=np.dtype(np.int32),
        shape=[],
        description='''
        Specifies the International Union of Crystallography (IUC) number of the 3D space
        group of this system.
        ''')
    space_group_symbol = Quantity(
        type=str,
        shape=[],
        description='''
        The International Union of Crystallography (IUC) short symbol of the 3D
        space group of this system.
        ''')
    choice = Quantity(
        type=str,
        shape=[],
        description='''
        String that specifies the centering, origin and basis vector settings of the 3D
        space group that defines the symmetry group of the simulated physical system (see
        section system). Values are as defined by spglib.
        ''')
    strukturbericht_designation = Quantity(
        type=str,
        description='''
        Classification of the material according to the historically grown
        'strukturbericht'.
        ''')
    symmetry_method = Quantity(
        type=str,
        shape=[],
        description='''
        Identifies the source of the symmetry information contained within this
        section. If equal to 'spg_normalized' the information comes from a
        normalization step.
        ''')
    origin_shift = Quantity(
        type=np.dtype(np.float64),
        shape=[3],
        description='''
        Vector $\\mathbf{p}$ from the origin of the standardized system to the origin of
        the original system. Together with the matrix $\\mathbf{P}$, found in
        space_group_3D_transformation_matrix, the transformation between the standardized
        coordinates $\\mathbf{x}_s$ and original coordinates $\\mathbf{x}$ is then given
        by $\\mathbf{x}_s = \\mathbf{P} \\mathbf{x} + \\mathbf{p}$.
        ''')
    transformation_matrix = Quantity(
        type=np.dtype(np.float64),
        shape=[3, 3],
        description='''
        Matrix $\\mathbf{P}$ that is used to transform the standardized coordinates to the
        original coordinates. Together with the vector $\\mathbf{p}$, found in
        space_group_3D_origin_shift, the transformation between the standardized
        coordinates $\\mathbf{x}_s$ and original coordinates $\\mathbf{x}$ is then given by
        $\\mathbf{x}_s = \\mathbf{P} \\mathbf{x} + \\mathbf{p}$.
        ''')
    symmorphic = Quantity(
        type=bool,
        shape=[],
        description='''
        Specifies if the space group is symmorphic. Set to True if all
        translations are zero.
        ''')

class Prototype(MSection):
    aflow_id = Quantity(
        type=str,
        shape=[],
        description='''
        AFLOW id of the prototype (see
        http://aflowlib.org/CrystalDatabase/prototype_index.html) identified on the basis
        of the space_group and normalized_wyckoff.
        ''')
    assignment_method = Quantity(
        type=str,
        shape=[],
        description='''
        Method used to identify the prototype.
        ''')
    label = Quantity(
        type=str,
        shape=[],
        description='''
        Label of the prototype identified on the basis of the space_group and
        normalized_wyckoff. The label is in the same format as in the read_prototypes
        function: <space_group_number>-<prototype_name>-<Pearson's symbol>).
        ''')
    name = Quantity(
        type=str,
        description='''
        A common name for this prototypical structure, e.g. fcc, bcc.
        ''')
    formula = Quantity(
        type=str,
        description='''
        The formula of the prototypical material for this structure.
        ''')

class Cell(MSection):
    a = Quantity(
        type=np.dtype(np.float64),
        unit='m',
        description='''
        Length of the first basis vector.
        ''')
    b = Quantity(
        type=np.dtype(np.float64),
        unit='m',
        description='''
        Length of the second basis vector.
        ''')
    c = Quantity(
        type=np.dtype(np.float64),
        unit='m',
        description='''
        Length of the third basis vector.
        ''')
    alpha = Quantity(
        type=np.dtype(np.float64),
        unit='radian',
        description='''
        Angle between second and third basis vector.
        ''')
    beta = Quantity(
        type=np.dtype(np.float64),
        unit='radian',
        description='''
        Angle between first and third basis vector.
        ''')
    gamma = Quantity(
        type=np.dtype(np.float64),
        unit='radian',
        description='''
        Angle between first and second basis vector.
        ''')
    volume = Quantity(
        type=np.dtype(np.float64),
        unit='m ** 3',
        description='''
        Volume of the cell.
        ''')
    atomic_density = Quantity(
        type=np.dtype(np.float64),
        unit='1 / m ** 3',
        description='''
        Atomic density of the material (atoms/volume).'
        ''')
    mass_density = Quantity(
        type=np.dtype(np.float64),
        unit='kg / m ** 3',
        description='''
        Mass density of the material.
        ''')

Here are some notes about this:

  • The metainfo is designed so that it could be reused in run.system, or anywhere where we store information about structures.
  • System: the direct metainfo contain generic information about this subsystem: it's structural type, any possible functional/compound types, formulas etc.
  • Atoms: is filled only if an atomistic representation is available. Mimicks what is currently stored in run.system.atoms. Notice that the Wyckoff positions are stored here, instead of being stored in symmetry.
  • Cell: is filled only if the subsystem has a cell. Provides a place for storing lattice constants and other cell properties.
  • Symmetry: is filled only if symmetry is available. Mimicks what is currently stored in run.system.symmetry.
  • Prototype: is filled only if a prototype is available. Mimicks what is currently stored in run.system.prototype.

Specific questions

  1. Any suggestions for the final location of this data are welcome. E.g. should we store it in results.topology, or completely replace results.material with this hierarchy?
  2. What types of formulas should we store and what names should we use (e.g. chemical_formula vs formula)
  3. How to handle non-stoichiometric (e.g. Fe_0.95O) or other specialized chemical formulae coming from experiments? Should these be pushed to formula_descriptive, which would become a place for non-standard, but useful formulas? (similar to optimade)
  4. Should we attempt to support non-standard species definitions? E.g. someone doing a simulation with deuterium or to other species with custom masses or basis sets. (The species information is now stored in system.atoms.species).
  5. Some structure types may contain useful additional information. E.g. we could store the orientation of a surface. Where should this kind of information be stored?
  6. Maybe instead of having a special enum for structural_type=molecule_group, we could have a more generic container called 'group', that could be used to logically combine any type of subsystems? (Now modified. The group subtype is currently deduced by analysing the child_systems, but an enum that describes the group contents can be added later if required e.g. for the search).
Edited Jun 17, 2022 by Lauri Himanen
Assignee
Assign to
Time tracking