Search suggestions improvement
The currently stored suggestion values are insufficient in some cases:
-
The tokenizers splits words by certain breakpoints and each part of the split is inserted as a suggestion input match. This causes individual parts to be matched correctly, but will not do any matches if the user continues typing further, e.g. PBEwill matchGGA_C_PBE_SOLandGGA_X_PBE_SOL, but PBE_SOL will not match anything. The solution would be to still split by certain tokens, but always store the full remaining string after the split. -
Suggestions are not returned for all unique values, but only for unique suggestion values. This means that searching for O2will not return both formulas forMg2O2andNi2O2, but only one of them, sinceO2is a value that is stored by both suggestions andskip_duplicates=True. This cannot be fixed by simply settingskip_duplicates=False, as this would return also duplicate entries (if there are multiple Mg2O2 entries, Mg2O2 will be included twice in the suggestions). A way around this is to include the prefix for all individual tokens (e.g. Ni2O2 would normally be split into two tokens,Ni2O2andO2, but to make the suggestion values unique, we add the prefix after the token, so that the final suggestion tokens will be:Ni2O2,O2 Ni2). -
Formula suggestions (and search) should be improved. There are a few alternative formula definitions (Hill, metal, etc.) that are widely used. Our search index should store only the "normalized" formula, which is the Hill formula, but the suggestions and search should still take into account other variations of the formula. This would be very similar to what we do with units: we only store one version, but allow the users to work in another system as well by doing on-the-fly transformations. To do this the suggestions should be augmented with formula variants that are returned by the suggestionsendpoint. The search interface should then also translate formula searches into the Hill form before executing the search.
Edited by Lauri Himanen