Hierarchical metainfo for complex structures
We are adding support for a more detailed breakdown of the atomistic structure of entries. Essentially this means that we are replacing the contents of results.material
(which is able to describe only a single material) with a hierarchy of different structural components that have been identified within the entry. Now that we have built some prototypes, it would be good to discuss the new metainfo that will be needed. This would also be a good opportunity to refactor some of the names/contents of what we store about the chemical composition/structure.
For some visual reference, here is a link to a development deployment that includes a visualization of the topology: https://nomadlab.eu/dev/rae/improvedsupportforcomplexstructuraltop/gui/search/entries/entry/id/KunbQcxjcEBBlcMC96H9QtaPc1fK
Hierarchy and relations
My initial suggestion for storing the hierarchy and relations:
 There is a single root, which corresponds to the 'original' system, i.e. the representative system that is chosen by our pipeline. It may be possible to think about multiple roots if we encounter e.g. systems that dramatically change during the simulation (e.g. a chemical reaction?).
 Each subsystem has a reference to its parent and a list of its children. This ensures easy navigation up/down the hierarchy.
 Each subsystem indicates the relation with its parent. This way we have proper ontological "tuples" that specify a parent, child and their relationship.
 For search reasons, this hierarchy will be stored as a list (repeating, nested objects in ES terms). This allows one to easily search for properties of any single subsystem. In the GUI this list is reconstructed as a tree for display purposes.
 During this early phase, I'm storing this hierarchy under
results.material.topology
.
Metadata
My initial suggestion for the metainfo stored for individual subsystems is as follows:
class System(MSection):
description = Quantity(
type=str,
description='''
A full description of this system.
''')
system_id = Quantity(
type=str,
description='''
That path of this section within the metainfo that is used as a unique identifier.
''')
method = Quantity(
type=MEnum('parser', 'user', 'matid'),
description='''
The source of topological information.
''')
label = Quantity(
type=str,
description='''
Descriptive short label for this system.
''')
material_id = Quantity(
type=str,
description='''
A fixed length, unique material identifier in the form of a hash
digest.
''')
material_name = Quantity(
type=str,
description='''
Meaningful name for this a material if any can be assigned.
''')
structural_type = Quantity(
type=MEnum(['bulk', 'surface', '2D', '1D', 'atom', 'group', 'molecule', 'monomer', 'unavailable']), default='not processed',
description='''
Classification based on structural features.
''')
functional_type = Quantity(
type=str,
shape=['0..*'],
description='''
Classification based on the functional properties.
''')
compound_type = Quantity(
type=str,
shape=['0..*'],
description='''
Classification based on the chemical formula.
''')
elements = Quantity(
type=MEnum(chemical_symbols),
shape=['0..*'],
default=[],
description='''
Names of the different elements present in the structure.
''')
n_elements = Quantity(
type=int,
default=0,
derived=lambda s: len(s.elements),
description='''
Number of different elements in the structure as an integer.
''')
elements_exclusive = Quantity(
type=str,
derived=lambda s: ' '.join(sorted(s.elements)),
description='''
String containing the chemical elements in alphabetical order and
separated by a single whitespace. This quantity can be used for
exclusive element searches where you want to find entries/materials
with only certain given elements.
''')
formula_hill = Quantity(
type=str,
description='''
The chemical formula for a structure in Hill form with element symbols followed by
integer chemical proportion numbers. The proportion number MUST be omitted if it is 1.
''')
formula_reduced = Quantity(
type=str,
description='''
The reduced chemical formula for a structure as a string with element symbols and
integer chemical proportion numbers. The proportion number MUST be omitted if it is 1.
''')
formula_anonymous = Quantity(
type=str,
description='''
The anonymous formula is the chemical_formula_reduced, but where the elements are
instead first ordered by their chemical proportion number, and then, in order left to
right, replaced by anonymous symbols A, B, C, ..., Z, Aa, Ba, ..., Za, Ab, Bb, ... and
so on.
''')
formula_reduced_fragments = Quantity(
type=str,
shape=['*'],
description='''
The reduced formula separated into individual terms containing both the atom
type and count. Used for searching parts of a formula.
''')
parent_system = Quantity(
type=str,
description='''
Reference to the parent system.
''')
child_systems = Quantity(
type=str,
shape=['*'],
description='''
References to the child systems.
''')
n_atoms = Quantity(
type=int,
shape=[],
description='''
The total number of species (atoms, particles) in the system.
''')
indices = Quantity(
type=np.dtype(np.int64),
shape=['*', '*'],
description='''
Indices of the atoms belonging to this group. These indices refer to
the original system. Each row represents a new instance.
''')
atoms_ref = Quantity(
type=Atoms,
description='''
Reference to an atomistic structure that is associated with this
system'.
''')
system_relation = SubSection(sub_section=Relation.m_def, repeats=False)
atoms = SubSection(sub_section=Atoms.m_def, repeats=False)
cell = SubSection(sub_section=Cell.m_def, repeats=False)
symmetry = SubSection(sub_section=SymmetryNew.m_def, repeats=False)
prototype = SubSection(sub_section=Prototype.m_def, repeats=False)
class Relation(MSection):
type = Quantity(
type=MEnum('subsystem', 'idealization'),
description='''
The type of relation.
''')
class Atoms(MSection):
concentrations = Quantity(
type=np.dtype(np.float64),
shape=['n_atoms'],
description='''
Concentrations of the species defined by labels which can be assigned for systems
with variable compositions.
''')
labels = Quantity(
type=str,
shape=['n_atoms'],
description='''
List containing the labels of the atoms. In the usual case, these correspond to
the chemical symbols of the atoms. One can also append an index if there is a
need to distinguish between species with the same symbol, e.g., atoms of the
same species assigned to different atomcentered basis sets or pseudopotentials,
or simply atoms in different locations in the structure such as those in the bulk
and on the surface. In the case where a species is not an atom, and therefore
cannot be representated by a chemical symbol, the label can simply be the name of
the particles.
''')
positions = Quantity(
type=np.dtype(np.float64),
shape=['n_atoms', 3],
unit='meter',
description='''
Positions of all the species, in cartesian coordinates. This metadata defines a
configuration and is therefore required. For alloys where concentrations of
species are given for each site in the unit cell, it stores the position of the
sites.
''')
velocities = Quantity(
type=np.dtype(np.float64),
shape=['n_atoms', 3],
unit='meter / second',
description='''
Velocities of the nuclei, defined as the change in cartesian coordinates of the
nuclei with respect to time.
''')
lattice_vectors = Quantity(
type=np.dtype(np.float64),
shape=[3, 3],
unit='meter',
description='''
Lattice vectors in cartesian coordinates of the simulation cell. The
last (fastest) index runs over the $x,y,z$ Cartesian coordinates, and the first
index runs over the 3 lattice vectors.
''')
lattice_vectors_reciprocal = Quantity(
type=np.dtype(np.float64),
shape=[3, 3],
unit='1/meter',
description='''
Reciprocal lattice vectors in cartesian coordinates of the simulation cell. The
first index runs over the $x,y,z$ Cartesian coordinates, and the second index runs
over the 3 lattice vectors.
''')
local_rotations = Quantity(
type=np.dtype(np.float64),
shape=['n_atoms', 3, 3],
description='''
A rotation matrix defining the orientation of each atom. If the rotation matrix
cannot be specified for an atom, the remaining atoms should set it to
the zero matrix (not the identity!)
''')
periodic = Quantity(
type=bool,
shape=[3],
description='''
Denotes if periodic boundary condition is applied to each of the lattice vectors.'
''')
supercell_matrix = Quantity(
type=np.dtype(np.int32),
shape=[3, 3],
description='''
Specifies the matrix that transforms the unitcell into the supercell in which
the actual calculation is performed.
''')
species = SubSection(sub_section=Species.m_def, repeats=False)
wyckoff_sets = SubSection(sub_section=WyckoffSet.m_def, repeats=True)
class Species(MSection):
name = Quantity(
type=str,
description='''
Name that uniquely identifies this species within a system.
''')
mass = Quantity(
type=np.dtype(np.float64),
shape=[],
unit='kilogram',
description='''
Mass of the species.
''')
atomic_number = Quantity(
type=np.dtype(np.int32),
shape=[],
description='''
The atomic number of the species if available.
''')
class Symmetry(MSection):
bravais_lattice = Quantity(
type=MEnum(bravais_lattices),
shape=[],
description='''
Identifier for the Bravais lattice in Pearson notation. The first lowercase letter
identifies the crystal family and can be one of the following: a (triclinic), b
(monoclinic), o (orthorhombic), t (tetragonal), h (hexagonal) or c (cubic). The
second uppercase letter identifies the centring and can be one of the following: P
(primitive), S (face centred), I (body centred), R (rhombohedral centring) or F
(all faces centred).
''')
crystal_system = Quantity(
type=MEnum(crystal_systems),
shape=[],
description='''
Name of the crystal system.
''')
hall_number = Quantity(
type=np.dtype(np.int32),
shape=[],
description='''
The Hall number for this system.
''')
hall_symbol = Quantity(
type=str,
shape=[],
description='''
The Hall symbol for this system.
''')
point_group = Quantity(
type=str,
shape=[],
description='''
Symbol of the crystallographic point group in the HermannMauguin notation.
''')
space_group_number = Quantity(
type=np.dtype(np.int32),
shape=[],
description='''
Specifies the International Union of Crystallography (IUC) number of the 3D space
group of this system.
''')
space_group_symbol = Quantity(
type=str,
shape=[],
description='''
The International Union of Crystallography (IUC) short symbol of the 3D
space group of this system.
''')
choice = Quantity(
type=str,
shape=[],
description='''
String that specifies the centering, origin and basis vector settings of the 3D
space group that defines the symmetry group of the simulated physical system (see
section system). Values are as defined by spglib.
''')
strukturbericht_designation = Quantity(
type=str,
description='''
Classification of the material according to the historically grown
'strukturbericht'.
''')
symmetry_method = Quantity(
type=str,
shape=[],
description='''
Identifies the source of the symmetry information contained within this
section. If equal to 'spg_normalized' the information comes from a
normalization step.
''')
origin_shift = Quantity(
type=np.dtype(np.float64),
shape=[3],
description='''
Vector $\\mathbf{p}$ from the origin of the standardized system to the origin of
the original system. Together with the matrix $\\mathbf{P}$, found in
space_group_3D_transformation_matrix, the transformation between the standardized
coordinates $\\mathbf{x}_s$ and original coordinates $\\mathbf{x}$ is then given
by $\\mathbf{x}_s = \\mathbf{P} \\mathbf{x} + \\mathbf{p}$.
''')
transformation_matrix = Quantity(
type=np.dtype(np.float64),
shape=[3, 3],
description='''
Matrix $\\mathbf{P}$ that is used to transform the standardized coordinates to the
original coordinates. Together with the vector $\\mathbf{p}$, found in
space_group_3D_origin_shift, the transformation between the standardized
coordinates $\\mathbf{x}_s$ and original coordinates $\\mathbf{x}$ is then given by
$\\mathbf{x}_s = \\mathbf{P} \\mathbf{x} + \\mathbf{p}$.
''')
symmorphic = Quantity(
type=bool,
shape=[],
description='''
Specifies if the space group is symmorphic. Set to True if all
translations are zero.
''')
class Prototype(MSection):
aflow_id = Quantity(
type=str,
shape=[],
description='''
AFLOW id of the prototype (see
http://aflowlib.org/CrystalDatabase/prototype_index.html) identified on the basis
of the space_group and normalized_wyckoff.
''')
assignment_method = Quantity(
type=str,
shape=[],
description='''
Method used to identify the prototype.
''')
label = Quantity(
type=str,
shape=[],
description='''
Label of the prototype identified on the basis of the space_group and
normalized_wyckoff. The label is in the same format as in the read_prototypes
function: <space_group_number><prototype_name><Pearson's symbol>).
''')
name = Quantity(
type=str,
description='''
A common name for this prototypical structure, e.g. fcc, bcc.
''')
formula = Quantity(
type=str,
description='''
The formula of the prototypical material for this structure.
''')
class Cell(MSection):
a = Quantity(
type=np.dtype(np.float64),
unit='m',
description='''
Length of the first basis vector.
''')
b = Quantity(
type=np.dtype(np.float64),
unit='m',
description='''
Length of the second basis vector.
''')
c = Quantity(
type=np.dtype(np.float64),
unit='m',
description='''
Length of the third basis vector.
''')
alpha = Quantity(
type=np.dtype(np.float64),
unit='radian',
description='''
Angle between second and third basis vector.
''')
beta = Quantity(
type=np.dtype(np.float64),
unit='radian',
description='''
Angle between first and third basis vector.
''')
gamma = Quantity(
type=np.dtype(np.float64),
unit='radian',
description='''
Angle between first and second basis vector.
''')
volume = Quantity(
type=np.dtype(np.float64),
unit='m ** 3',
description='''
Volume of the cell.
''')
atomic_density = Quantity(
type=np.dtype(np.float64),
unit='1 / m ** 3',
description='''
Atomic density of the material (atoms/volume).'
''')
mass_density = Quantity(
type=np.dtype(np.float64),
unit='kg / m ** 3',
description='''
Mass density of the material.
''')
Here are some notes about this:
 The metainfo is designed so that it could be reused in
run.system
, or anywhere where we store information about structures. 
System
: the direct metainfo contain generic information about this subsystem: it's structural type, any possible functional/compound types, formulas etc. 
Atoms
: is filled only if an atomistic representation is available. Mimicks what is currently stored inrun.system.atoms
. Notice that the Wyckoff positions are stored here, instead of being stored in symmetry. 
Cell
: is filled only if the subsystem has a cell. Provides a place for storing lattice constants and other cell properties. 
Symmetry
: is filled only if symmetry is available. Mimicks what is currently stored inrun.system.symmetry
. 
Prototype
: is filled only if a prototype is available. Mimicks what is currently stored inrun.system.prototype
.
Specific questions
 Any suggestions for the final location of this data are welcome. E.g. should we store it in
results.topology
, or completely replaceresults.material
with this hierarchy?  What types of formulas should we store and what names should we use (e.g.
chemical_formula
vsformula
)  How to handle nonstoichiometric (e.g. Fe_0.95O) or other specialized chemical formulae coming from experiments? Should these be pushed to
formula_descriptive
, which would become a place for nonstandard, but useful formulas? (similar to optimade) 
Should we attempt to support nonstandard species definitions? E.g. someone doing a simulation with deuterium or to other species with custom masses or basis sets.(The species information is now stored insystem.atoms.species
).  Some structure types may contain useful additional information. E.g. we could store the orientation of a surface. Where should this kind of information be stored?

Maybe instead of having a special enum for(Now modified. The group subtype is currently deduced by analysing thestructural_type=molecule_group
, we could have a more generic container called 'group', that could be used to logically combine any type of subsystems?child_systems
, but an enum that describes the group contents can be added later if required e.g. for the search).