Add Feature for Descriptive Array Quantity
After discussions with Area A and B (https://github.com/FAIRmat-NFDI/AreaA-data_modeling_and_schemas/discussions/30 and https://github.com/FAIRmat-NFDI/AreaA-data_modeling_and_schemas/discussions/31) we realized the need for having a descriptive array quantity within NOMAD.
The need for two separate quantities was specified where the first type (project name NumericalArray
) would be a section/class/object containing an n-dimensional array with precomputed properties: Mean, Min, Max, Standard Deviation, and Shape (dimensions). This is analogous to a Numpy ndarray
where these values are calculated by built in methods: mean()
, min()
, max()
, std()
and the properties shape
and ndim
. Additionally we discussed including quantiles (first ventile, first quartile, median, third quartile, 19th ventile) which similarly can be calculated (on the flattened array) by Numpy's quantile(array, q=(0.05, 0.25, 0.5, 0.75, 0.95))
. An example would be:
import json
import numpy as np
mu, sigma = 1, 0.1 # mean and standard deviation
a = np.random.normal(mu, sigma, (20, 30, 10))
qs = (0.05, 0.25, 0.5, 0.75, 0.95)
quants = np.quantile(a, qs)
descriptors= {
"dimensionality": a.ndim,
"shape": a.shape,
"mean": a.mean(),
"min": a.min(),
"max": a.max(),
"standard_deviation": a.std(),
"quantiles": {q: quant for q, quant in zip(qs, quants)}
}
print(json.dumps(descriptors, indent=2))
{
"dimensionality": 3,
"shape": [
20,
30,
10
],
"mean": 1.0002707071723846,
"min": 0.6440913244294681,
"max": 1.3667379386039438,
"standard_deviation": 0.09958481036605629,
"quantiles": {
"0.05": 0.8351471083009838,
"0.25": 0.931646179009207,
"0.5": 0.9991245263293533,
"0.75": 1.0693072252478522,
"0.95": 1.1614793219827306
}
}
In addition we need some sort of reduced preview of this array which could be solved by the changes being made to the API. There could be 3 levels of access:
- Access descriptors above
- Access subset of array
- Access the whole array
The second type of array (project name ContextArray
) would be a NumericalArray
within some context with axes, units, uncertainty and quantization. This would be analogous to XArray's DataArray
. Here the axes would themselves be ContextArrays
with units etc. The numerical values would be stored in NumericalArrays
with it's descriptors. This would allow us to search for, for example, the 19th ventile (max without outliers) of the transmission for measurements where a certain wavelength, lamda, was measured (lambda > min, lambda < max).
Additionally we discussed that the values could be a reference to:
- Other data field
- External file
- Virtual source (any combination or subset of the above)