ArchiveQuery sends to many requests.
The current ArchiveQuery implementation offers a page_size
parameter. However, this is only used to perform the search fetch the entry ids. When the ArchiveQuery is used to downloading entries, it downloads entries one by one with individual requests. This is done in parallel. This quickly hits the rate limiting and causes lots of 503 errors. Currently it is not usable for the AI toolkit queries.
-
The implementation needs to download page_size
entries in one request and not just one. Maybe we need twopage_size
parameters: For the "fetch" a large page size is ok, for the "download" you might want to have a smaller one. -
parallel requests must only be issued at a configurable rate, e.g. 10 per second, not faster -
the current implementation retries for a configurable amount of times, but uses no backoff. It should only retry after a few seconds. -
the archive query should be tested with a substantial query during the install tests
ci/cd
Edited by Theodore Chang