Improved "ArchiveQuery"

The existing ArchiveQuery has some obvious flaws.

  • #682 (closed) describes failure due to 502. This might be unavoidable if the API is under high load. ArchiveQuery should deal with it instead of error-ing out
  • #679 (closed) describes a JSON decode error. This should not happen, but obviously can happen. The ArchiveQuery should deal with it instead of error-ing out. Proper logging should also help to better identify the cause (e.g. specific calculation)
  • #680 (closed) describes that some required things are missing. This should be fixed in v1, which adds all required to the search.
  • The last point is implemented poorly, because references are treated like sub-sections and not followed

Long running queries might always exhibit problems. The ArchiveQuery should be reimplemented with the explicit premise of API failures. As a consequence:

  • results should be cached explicitly, locally, and somewhat permanently
  • actual error handling
  • the implementation should be more modern, e.g. with asyncio

Steps to take:

  • get familiar with asyncio
  • rework the ArchiveQuery implementation based on httpx + asyncio
  • evaluate how much parallelism (asyncio again) we can use in the archive query API
  • rework the API accordingly
  • discuss the documentation examples with luca/luigi and martin/simon to make them more meaningful (this should finally also address the bugs above)
Edited Feb 13, 2022 by Theodore Chang
Assignee Loading
Time tracking Loading