Skip to content

Support parsers creating multiple entries from the same file

Closes #761 (closed)

We make it possible for an entry (associated with some mainfile) to have child entries. This is done by introducing a new field on the entry level: mainfile_key. Main entries have mainfile_key == None, child entries have the same value for mainfile as the main/parent entry, plus some non-empty string as value for the mainfile_key. Note, however, that most parsers will only produce a main entry without any child entries.

Both the main and the child entries are full-fledged entries, i.e. they are distinct objects in mongo and elastic search, they have their own archive files, their own metadata, and so on.

The combination (upload_id, mainfile, mainfile_key) uniquely identifies any entry. For every child entry, a main entry must exist (i.e. an entry with the same upload_id and mainfile, but with mainfile_key == None).

Parsers signal that they want to create child entries by returning a set of keys, one for each child, from the is_mainfile function (instead of a boolean, like we have done up until now). Parsers that don't want to create any child entries can just return True, as before.

The parse function is called only once, for the main entry, and we pass an additional argument to it: child_archives, a dictionary of the format {mainfile_key: child_archive}.

Edited by David Sikter

Merge request reports