Search suggestions improvement
The currently stored suggestion values are insufficient in some cases:
The tokenizers splits words by certain breakpoints and each part of the split is inserted as a suggestion input match. This causes individual parts to be matched correctly, but will not do any matches if the user continues typing further, e.g.
GGA_X_PBE_SOL, but PBE_SOL will not match anything. The solution would be to still split by certain tokens, but always store the full remaining string after the split.
Suggestions are not returned for all unique values, but only for unique suggestion values. This means that searching for
O2will not return both formulas for
Ni2O2, but only one of them, since
O2is a value that is stored by both suggestions and
skip_duplicates=True. This cannot be fixed by simply setting
skip_duplicates=False, as this would return also duplicate entries (if there are multiple Mg2O2 entries, Mg2O2 will be included twice in the suggestions). A way around this is to include the prefix for all individual tokens (e.g. Ni2O2 would normally be split into two tokens,
O2, but to make the suggestion values unique, we add the prefix after the token, so that the final suggestion tokens will be:
Formula suggestions (and search) should be improved. There are a few alternative formula definitions (Hill, metal, etc.) that are widely used. Our search index should store only the "normalized" formula, which is the Hill formula, but the suggestions and search should still take into account other variations of the formula. This would be very similar to what we do with units: we only store one version, but allow the users to work in another system as well by doing on-the-fly transformations. To do this the suggestions should be augmented with formula variants that are returned by the
suggestionsendpoint. The search interface should then also translate formula searches into the Hill form before executing the search.