Optimise clean up and match all activities in processing

  1. Optimise matching (68c728d8)
    • Parallelized file matching using ThreadPoolExecutor to leverage multiple CPU cores.
    • Implemented in-memory caching of existing entries for O(1) lookup during matching.
    • Optimized reset_entry_processing_status using direct MongoDB queries instead of Python-side iteration.
    • Consolidated status reset logic into the matching phase, removing redundant activity calls.

Generally a 2-4x improvement for matching speed.

Locally:

  • For an upload with 10k with tiny entries, the time drops from 16s to 4s.
  • For perovskite dataset with 42k entries, the time drops from 7mins to 3mins.

On test deployment:

  • For perovskite dataset with 42k entries, the time drops from 31mins to 11mins. (Files served via network storage tends to be relatively slow, so the speed up via threads is very significant here).
  1. Split clean up activity into batches (3a254a8a)
    • Refactored cleanup/indexing into a batched Temporal workflow to prevent memory/CPU spikes and timeouts on large uploads.
    • Improved reliability against Elasticsearch 429/503 errors by using smaller, retryable batches.
    • Added an inline fast path for small uploads (under 100 entries) to reduce orchestration overhead.
  • For an upload with 10k tiny entries, the time drops from 2mins to 30s.
  • For perovskite dataset with 42k entries, the original implementation times out (probably also kills the python process), now we can finish clean up in 15 mins.
  1. Retry elastic index errors (487497f0)
    • Added exponential backoff and jitter for Elasticsearch bulk indexing retries on 429 and 503 errors.
    • Introduced new configuration parameters for fine-tuning bulk retry behavior (attempts, backoff intervals, and jitter).
Edited by Ahmed Ilyas

Merge request reports

Loading