Metainfo improvements
The DataType idea
Move all types into their own DataTypes
and remove the "if then else" logic from set, get, and (de)serialize operations. E.g. NPArray(dtype)
and PythonType(type)
should deal with np and python types respectively. This should make it much easier to extend NPArray(dtype)
with something like HDF5Dataset(dtype)
in the future. It might also make it easier to add more specialised types. Also the m_to_dict
, __set_normalized
, etc. looks horrible in its current state.
Parameters for DataTypes
Potentially those new types need parameters like dtype
or type
. Note that types can have parameters. Reference
and SectionReference
for example, already take the referenced section definition as a type.
Unit and shape
But, the unit and the shape should stay in the quantity definition. In all DataType
operations, you will have access to the quantity def anyways. Maybe DataType
needs a field supports_shapes
or something. Maybe of the set, get, and (de)serialize will need to know, if they should manage the list vs scalar or if the data type is actually doing this. For something like Quantity(type=Datatime, shape=['*'])
m_to_dict
would create a list and would ask Datetime to serialize the elements. For Quantity(type=np.float64, shape=[1,2])
m_to_dict
would just call serialize
on NPArray
and would expect it to serialze the whole array and not just an element. For data types that do not support shapes in itself, only scalars and list would work, for higher shapes we would throw an error.
Backwards compatibility in type definitions
The QuantityType
(the type for types) could duck-type and help with backwards compatibility. E.g. everytime you use type=np.float64
, QuantityType
replaces it it it#s set_normalized
function with NPArray(np.float64)
. Keep in mind that we also need backwards compatibility in how QuantityType
serializes types. For example an NPArray(np.float64)
should still (de)serialize to {type_kind: 'numpy', type_data: 'float64'}
, etc.
Duck-typing and type conversion
When we need to map the metainfo to other systems like Pydantic, Optimade, Mongo, etc. we often make use of MTypes
to figure out if a quantity value is compatible with a respective foreign Pydantic, Optimade, Mongo, etc. type. Here MTypes
provides list of types that are called number
, numpy
, etc. Maybe DataType
can define functions to implement this more explicitly:
class DataType:
def compatible_with(target_type: Type) -> bool:
"""
Returns true if the given type is compatible. All compatible types can be used in `convert`.
Also values in all compatible types can be assigned to quantities with self type.
"""
return target_type == self
def convert(target_type: Type[T], value) -> T:
"""
Converts the given value into a value of the given compatible type.
This will not assert if the given type is actually compatible.
Use `compatible_with` to check.
"""
return value
class NPArray(DataType):
def __init__(self, dtype):
self.dtype = dtype
def compatible_with(target_type):
if (self.dtype.type in [np.float64, npfloat32]):
return target_type == float
if (self.dtype.type in [np.int64, np.uint64]):
return target_type == int
return target_type == self.dtype.type
def convert(target_type, value):
if target_type == self.dtype.type:
return value
target_type(value)
Smaller things:
- Maybe we also add more standard types, e.g.
Pydantic(pydantic_model)
,Dataframe(...)
for table data. - Non standard data types should be moved to
nomad.datamodel.metainfo
. Ideally, thenomad.metainfo
could be reduced to pure Python (no numpy, no pandas, no nomad.config). Ways to possibly inject dependencies are specialisations ofDataType
andContext
. - Cleanup: a concise way to define annotations
- Cleanup: remove "more" attributes
- Cleanup: remove/deprecate
label
property - Cleanup: Remove unused submodules:
benchmarks
,legacy
,generate
- Cleanup: Deprecate
Category
- Cleanup: Remove
Environments
completely - Split the package vertically (metainfo, extensions, annotations, context, datatypes) and not horizontally (metainfo, utils). This will be hard as a lot of stuff between
MSection
,Definition
,Datatype
,Annotation
,Context
is cyclic by nature. - Similar to the context and annotation implementations also the extensions should go into
nomad.datamodel.metainfo
, where they are only imported if actually needed and they might depend on more than basic python packages. - Also
nexus
should definitely move. First intonomad.datamodel.metainfo
, but eventually in its own plugin.
Not everything has to be in one MR.