Using pubchem data
PubChem is an open chemistry database and we have data in NOMAD that we want to associate with entries on PubChem. Therefore, some normalise functions (e.g. in one of our BaseSections) use the PubChem API. However, since this is done during processing of individual entries, there might be lots of API calls to PubChem from our servers. This is inefficient and also breaks when we hit PubChem's "dynamic request throttling" (https://pubchem.ncbi.nlm.nih.gov/docs/dynamic-request-throttling).
In other cases (e.g. AFLOW prototypes), we simply create a file (or even dict) in our source-code with a clone of the database. However, PubChem is a 100m entries and there are issues with this approach:
- does not fit into source code
- we probably should not ship the data with images or pip package
- querying the data is not a trivial problem anymore
We should investigate potential options:
- run an official PubChem clone (pro: ops friendly)
- implement and run our own simplyfied PubChem clone as separately deployed service (pro: ops friendly)
- serve PubChem data as an API plugin
- query PubChem data directly from a database (con: does not work for Oasis unless we ship the data)
Misc:
- Question? Pubchem data license, e.g. can we clone the data, or even distribute the database?