Search suggestions improvement

The currently stored suggestion values are insufficient in some cases:

  • The tokenizers splits words by certain breakpoints and each part of the split is inserted as a suggestion input match. This causes individual parts to be matched correctly, but will not do any matches if the user continues typing further, e.g. PBE will match GGA_C_PBE_SOL and GGA_X_PBE_SOL, but PBE_SOL will not match anything. The solution would be to still split by certain tokens, but always store the full remaining string after the split.
  • Suggestions are not returned for all unique values, but only for unique suggestion values. This means that searching for O2 will not return both formulas for Mg2O2 and Ni2O2, but only one of them, since O2 is a value that is stored by both suggestions and skip_duplicates=True. This cannot be fixed by simply setting skip_duplicates=False, as this would return also duplicate entries (if there are multiple Mg2O2 entries, Mg2O2 will be included twice in the suggestions). A way around this is to include the prefix for all individual tokens (e.g. Ni2O2 would normally be split into two tokens, Ni2O2 and O2, but to make the suggestion values unique, we add the prefix after the token, so that the final suggestion tokens will be: Ni2O2, O2 Ni2).
  • Formula suggestions (and search) should be improved. There are a few alternative formula definitions (Hill, metal, etc.) that are widely used. Our search index should store only the "normalized" formula, which is the Hill formula, but the suggestions and search should still take into account other variations of the formula. This would be very similar to what we do with units: we only store one version, but allow the users to work in another system as well by doing on-the-fly transformations. To do this the suggestions should be augmented with formula variants that are returned by the suggestions endpoint. The search interface should then also translate formula searches into the Hill form before executing the search.
Edited Jan 18, 2022 by Lauri Himanen
Assignee Loading
Time tracking Loading