Metainfo improvements

The DataType idea

Move all types into their own DataTypes and remove the "if then else" logic from set, get, and (de)serialize operations. E.g. NPArray(dtype) and PythonType(type) should deal with np and python types respectively. This should make it much easier to extend NPArray(dtype) with something like HDF5Dataset(dtype) in the future. It might also make it easier to add more specialised types. Also the m_to_dict, __set_normalized, etc. looks horrible in its current state.

Parameters for DataTypes

Potentially those new types need parameters like dtype or type. Note that types can have parameters. Reference and SectionReference for example, already take the referenced section definition as a type.

Unit and shape

But, the unit and the shape should stay in the quantity definition. In all DataType operations, you will have access to the quantity def anyways. Maybe DataType needs a field supports_shapes or something. Maybe of the set, get, and (de)serialize will need to know, if they should manage the list vs scalar or if the data type is actually doing this. For something like Quantity(type=Datatime, shape=['*']) m_to_dict would create a list and would ask Datetime to serialize the elements. For Quantity(type=np.float64, shape=[1,2]) m_to_dict would just call serialize on NPArray and would expect it to serialze the whole array and not just an element. For data types that do not support shapes in itself, only scalars and list would work, for higher shapes we would throw an error.

Backwards compatibility in type definitions

The QuantityType (the type for types) could duck-type and help with backwards compatibility. E.g. everytime you use type=np.float64, QuantityType replaces it it it#s set_normalized function with NPArray(np.float64). Keep in mind that we also need backwards compatibility in how QuantityType serializes types. For example an NPArray(np.float64) should still (de)serialize to {type_kind: 'numpy', type_data: 'float64'}, etc.

Duck-typing and type conversion

When we need to map the metainfo to other systems like Pydantic, Optimade, Mongo, etc. we often make use of MTypes to figure out if a quantity value is compatible with a respective foreign Pydantic, Optimade, Mongo, etc. type. Here MTypes provides list of types that are called number, numpy, etc. Maybe DataType can define functions to implement this more explicitly:

class DataType:
    def compatible_with(target_type: Type) -> bool:
        """
        Returns true if the given type is compatible. All compatible types can be used in `convert`. 
        Also values in all compatible types can be assigned to quantities with self type.
        """
        return target_type == self

    def convert(target_type: Type[T], value) -> T:
        """
        Converts the given value into a value of the given compatible type. 
        This will not assert if the given type is actually compatible.
        Use `compatible_with` to check.
        """
        return value

class NPArray(DataType):
    def __init__(self, dtype):
        self.dtype = dtype

    def compatible_with(target_type):
        if (self.dtype.type in [np.float64, npfloat32]):
            return target_type == float
        if (self.dtype.type in [np.int64, np.uint64]):
            return target_type == int        
        return target_type == self.dtype.type

    def convert(target_type, value):
        if target_type == self.dtype.type:
            return value
        target_type(value)

Smaller things:

  • Maybe we also add more standard types, e.g. Pydantic(pydantic_model), Dataframe(...) for table data.
  • Non standard data types should be moved to nomad.datamodel.metainfo. Ideally, the nomad.metainfo could be reduced to pure Python (no numpy, no pandas, no nomad.config). Ways to possibly inject dependencies are specialisations of DataType and Context.
  • Cleanup: a concise way to define annotations
  • Cleanup: remove "more" attributes
  • Cleanup: remove/deprecate label property
  • Cleanup: Remove unused submodules: benchmarks, legacy, generate
  • Cleanup: Deprecate Category
  • Cleanup: Remove Environments completely
  • Split the package vertically (metainfo, extensions, annotations, context, datatypes) and not horizontally (metainfo, utils). This will be hard as a lot of stuff between MSection, Definition, Datatype, Annotation, Context is cyclic by nature.
  • Similar to the context and annotation implementations also the extensions should go into nomad.datamodel.metainfo, where they are only imported if actually needed and they might depend on more than basic python packages.
  • Also nexus should definitely move. First into nomad.datamodel.metainfo, but eventually in its own plugin.

Not everything has to be in one MR.

Edited Mar 20, 2024 by Markus Scheidgen
Assignee Loading
Time tracking Loading