ArchiveQuery sends to many requests.

The current ArchiveQuery implementation offers a page_size parameter. However, this is only used to perform the search fetch the entry ids. When the ArchiveQuery is used to downloading entries, it downloads entries one by one with individual requests. This is done in parallel. This quickly hits the rate limiting and causes lots of 503 errors. Currently it is not usable for the AI toolkit queries.

The implementation needs to download page_size entries in one request and not just one. Maybe we need two page_size parameters: For the "fetch" a large page size is ok, for the "download" you might want to have a smaller one.
parallel requests must only be issued at a configurable rate, e.g. 10 per second, not faster
the current implementation retries for a configurable amount of times, but uses no backoff. It should only retry after a few seconds.
the archive query should be tested with a substantial query during the install tests ci/cd

Edited Feb 28, 2024 by Theodore Chang