Search suggestions improvement
The currently stored suggestion values are insufficient in some cases:
-
The tokenizers splits words by certain breakpoints and each part of the split is inserted as a suggestion input match. This causes individual parts to be matched correctly, but will not do any matches if the user continues typing further, e.g. PBE
will matchGGA_C_PBE_SOL
andGGA_X_PBE_SOL
, but PBE_SOL will not match anything. The solution would be to still split by certain tokens, but always store the full remaining string after the split. -
Suggestions are not returned for all unique values, but only for unique suggestion values. This means that searching for O2
will not return both formulas forMg2O2
andNi2O2
, but only one of them, sinceO2
is a value that is stored by both suggestions andskip_duplicates=True
. This cannot be fixed by simply settingskip_duplicates=False
, as this would return also duplicate entries (if there are multiple Mg2O2 entries, Mg2O2 will be included twice in the suggestions). A way around this is to include the prefix for all individual tokens (e.g. Ni2O2 would normally be split into two tokens,Ni2O2
andO2
, but to make the suggestion values unique, we add the prefix after the token, so that the final suggestion tokens will be:Ni2O2
,O2 Ni2
). -
Formula suggestions (and search) should be improved. There are a few alternative formula definitions (Hill, metal, etc.) that are widely used. Our search index should store only the "normalized" formula, which is the Hill formula, but the suggestions and search should still take into account other variations of the formula. This would be very similar to what we do with units: we only store one version, but allow the users to work in another system as well by doing on-the-fly transformations. To do this the suggestions should be augmented with formula variants that are returned by the suggestions
endpoint. The search interface should then also translate formula searches into the Hill form before executing the search.
Edited by Lauri Himanen