Outlier handling in search
As we do not enforce any strict quality measures for the processed entries, our data sometimes contains outlier values originating from invalid/interrupted calculations. One example is e.g. the geometry optimization final_energy_difference: typically this value is < 1e-5 eV, but there are outliers that have values of 1e10 eV. This is problematic for data presentation.
We do not wish to remove or fix such calculations at the Archive level, as these calculations should still be public and users should be able to find and access them. What is possible is that sanitize the aggregations for some fields in order to present reasonable data e.g. in the GUI. Sanitizing the aggregations means that we can exclude these outliers from histograms, min_max aggregations etc, making our presentation much more meaningful.
TODO:
-
Add new ES annotation option for min/max values per quantity. -
Add new API aggregation option: sanitize
that controls whether the min/max filtering is applied before the aggregation is built. This will only affect the aggregation, not the returned results. This way the GUI can perform the sanitation, and API users can control it as they see fit. -
Add an option for controlling the sanitation from the GUI?