Optimize ArchiveReader read disk performance
The current implementation that uses blocks receives the entry id from external and uses binary search to search blocks to locate if the target entry is in the archive.
The _load_toc_block() method performs the following steps.
- compute the block number of the target entry
- load the corresponding block from the disk
- unpack entries and populate then into the self._toc dict
Consider an archive of n entries and, for simplicity, assume 1 entry per block so that entry and block are interchangable in the following context.
When search for each entry, the binary search requires O(log(n)) operations with O(log(n)) disk reads. Assume to read all n entries, then O(n*log(n)) disk reads are required. However, O(n) disk reads are required to read all entries.
To fix:
Maintain an internal list to record if the block has been read. If so, skip reading the block from disk. Reduce disk opeartion from O(n*log(n)) to O(n).