Metainfo improvements
### The DataType idea Move all types into their own `DataTypes` and remove the "if then else" logic from *set*, *get*, and *(de)serialize* operations. E.g. `NPArray(dtype)` and `PythonType(type)` should deal with np and python types respectively. This should make it much easier to extend `NPArray(dtype)` with something like `HDF5Dataset(dtype)` in the future. It might also make it easier to add more specialised types. Also the `m_to_dict`, `__set_normalized`, etc. looks horrible in its current state. ### Parameters for DataTypes Potentially those new types need parameters like `dtype` or `type`. Note that types can have parameters. `Reference` and `SectionReference` for example, already take the referenced section definition as a type. ### Unit and shape But, the unit and the shape should stay in the quantity definition. In all `DataType` operations, you will have access to the quantity def anyways. Maybe `DataType` needs a field `supports_shapes` or something. Maybe of the *set*, *get*, and *(de)serialize* will need to know, if they should manage the list vs scalar or if the data type is actually doing this. For something like `Quantity(type=Datatime, shape=['*'])` `m_to_dict` would create a list and would ask Datetime to serialize the elements. For `Quantity(type=np.float64, shape=[1,2])` `m_to_dict` would just call `serialize` on `NPArray` and would expect it to serialze the whole array and not just an element. For data types that do not support shapes in itself, only scalars and list would work, for higher shapes we would throw an error. ### Backwards compatibility in type definitions The `QuantityType` (the type for types) could duck-type and help with backwards compatibility. E.g. everytime you use `type=np.float64`, `QuantityType` replaces it it it#s `set_normalized` function with `NPArray(np.float64)`. Keep in mind that we also need backwards compatibility in how `QuantityType` serializes types. For example an `NPArray(np.float64)` should still (de)serialize to `{type_kind: 'numpy', type_data: 'float64'}`, etc. ### Duck-typing and type conversion When we need to map the metainfo to other systems like Pydantic, Optimade, Mongo, etc. we often make use of `MTypes` to figure out if a quantity value is compatible with a respective foreign Pydantic, Optimade, Mongo, etc. type. Here `MTypes` provides list of types that are called `number`, `numpy`, etc. Maybe `DataType` can define functions to implement this more explicitly: ```py class DataType: def compatible_with(target_type: Type) -> bool: """ Returns true if the given type is compatible. All compatible types can be used in `convert`. Also values in all compatible types can be assigned to quantities with self type. """ return target_type == self def convert(target_type: Type[T], value) -> T: """ Converts the given value into a value of the given compatible type. This will not assert if the given type is actually compatible. Use `compatible_with` to check. """ return value class NPArray(DataType): def __init__(self, dtype): self.dtype = dtype def compatible_with(target_type): if (self.dtype.type in [np.float64, npfloat32]): return target_type == float if (self.dtype.type in [np.int64, np.uint64]): return target_type == int return target_type == self.dtype.type def convert(target_type, value): if target_type == self.dtype.type: return value target_type(value) ``` ### Smaller things: - Maybe we also add more standard types, e.g. `Pydantic(pydantic_model)`, `Dataframe(...)` for table data. - Non standard data types should be moved to `nomad.datamodel.metainfo`. Ideally, the `nomad.metainfo` could be reduced to pure Python (no numpy, no pandas, no nomad.config). Ways to possibly inject dependencies are specialisations of `DataType` and `Context`. - Cleanup: a concise way to define annotations - Cleanup: remove "more" attributes - Cleanup: remove/deprecate `label` property - Cleanup: Remove unused submodules: `benchmarks`, `legacy`, `generate` - Cleanup: Deprecate `Category` - Cleanup: Remove `Environments` completely - Split the package vertically (metainfo, extensions, annotations, context, datatypes) and not horizontally (metainfo, utils). This will be hard as a lot of stuff between `MSection`, `Definition`, `Datatype`, `Annotation`, `Context` is cyclic by nature. - Similar to the context and annotation implementations also the extensions should go into `nomad.datamodel.metainfo`, where they are only imported if actually needed and they might depend on more than basic python packages. - Also `nexus` should definitely move. First into `nomad.datamodel.metainfo`, but eventually in its own plugin. Not everything has to be in one MR.
issue