Provide features for data "curation"
Rumor has it, that there are a lot of inconsistency in the data. We need tools to find those.
Especially for finding doublets in the raw upload data. Apparently there are doublicants in the aflow lib imports.
There are different angles on how to do this:
- hashes on "uploads" (resonable? Does this make any sense for (aflow) imports where zips are aggregation of multiple calcs?)
- hashes on mainfiles (is this enough? minor changes in sourcefiles that might be created during aflow export, etc.)
- hashes on metadata (whats the right granularity)