Bundle format and functionality

added current datamodel feature processing labels

assigned to @dsikter

Assigned to @dsikter for now, but the idea is that @thchang and @dsikter work together. This probably needs more issues to tract its progress.

changed the description

I've done some thinking about how this would actually look in the future. I imagine we will at some point want an export dialog in the gui, on the upload overview page. There it should be possible to specify export options, and choose a destination. The destination could be a zip file for download, or another repo. Then I imagine there should also be an import button, on the uploads page, for uploading and import bundle files.

I think a first step could be to implement a very rudimentary export/import dialog (only for published uploads). Basic bundle functionality is already there, and should work (as long as the upload doesn't contain custom schemas). So should be simple to do, and I think it will make testing a lot easier. The dialogues could have some warning text to alert the user that this is under development, or we could have a config switch for enabling them, if we want to hide them for now. What do you think, @mscheidg?

This is not just about putting an interface on the existing bundle. We need to change the scope from just "reproducing an upload on another NOMAD installation" to having a format for self-consisting data.

Export/import is another good use-case for the bundle. I guess this should also be put into the CLI.

I disagree about the bundle; the current bundle only works for very simple cases and these are not the typical FAIRmat cases.

we cannot rely on the build-in schema as installations will run different version. All this hash-stuff was added so that we do not have to rely on all details of the build-in schema. Also custom schemas are important.
we need to support references that point outside the bundle
we need configurability for the contents of a bundle

Going for interface features to early, might impede the necessary improvements on the bundle. I would suggest to start from scratch on the bundle. Also code wise, this should get its own module. It should be implemented and tested independent of the processing or any API/UI interfaces.

Well, the old strategy was to reprocess the raw files, so in that sense it works, even if the system schema is different. I was not after any fancy gui for the export and import at this point, but even having just an import and export button, without any settings, would be convenient for dev purposes. It would make it much easier to play around with the system.

All in all, I'd say it's not that big a difference from what we already have implemented, it's changing the file structure of the bundle and adding some more data. I really don't see why we would start completely from scratch, I think much of the code can be reused. But we need to sit down and agree on exactly how the new bundle should look.

Also, I'm not so sure what you mean by adding "a directory of all external references"?

Maybe from "scratch" is the wrong wording. Of course we can re-use what is there. But the bundle has to support more to fit FAIRmat. We cannot reprocess because we might not transfer the raw files and we cannot rely on the schema being the same everywhere.

For development, calling import/export from the CLI should be the easiest to implement interface.

Either we rewrite the archives replacing references or we add a directory/map of all references that allows to translate them on the fly. Anyhow, when an upload moves installations, references to other uploads have to be treated somehow.

I think we can add the installation into metadata, or somewhere context can reach, so that when resolving /uploads/.. type references, it is possible to check with the archive's installation, if it matches oasis url then it is local, otherwise it is rewritten in form of a full url, which is then sent to be retrieved.

In such a way there is no need to find all references in any archives to be exported. The minimum effort?

I also don't think we need to replace any references in the archive files. It should be enough if we keep track of what the original source repo url is (and I think we should keep track of this anyway). Then, when we encounter an installation-relative url in an archive that has been imported from somewhere else, we just need to make sure that we resolve it with respect to the original repo, rather than the current repo (except for references to schemas, since these are expected to be transferred with the archive, and thus should be possible to fetch also from the current repo). Since repo urls will most probably change every now and then and require translation anyway, I think it makes sense to avoid replacing links. Rewriting the archives on the fly when creating a bundle would be quite a lot of extra work and make the process slower.

Btw, there is no CLI for bundle export/import yet, so that would need to be implemented if we're to use that instead of gui buttons.

changed the description

Standards to consider:

I've read up on these standards now. The bagit standard seems simple and makes good sense to use. The current file with metadata, bundle.json, could probably also be replaced by one that is compliant with the RO-crate format. It would mean a bit of overhead, and the standard is very flexible so a RO-crate compatible schema could be implemented in many different ways. But then we also wanted the bundle to be similar to how we store data in the file system. There are many ways to go about this, I think we need to have a discussion to decide exactly how the bundle and the files on disk should look in the future so we're on the same page.

Different thread. @mnakh ran into problems with #1020 (closed) (the new search dialog for references), where we want to be able to find inheriting sections. He's using the list of sections in metadata/sections, but it only contains the actually used sections, not their parent sections. If it contained also the parent sections, we think it could be used to find inheriting sections via a query. For this issue with the new bundle creation, it would also be useful if there were a quick way to get the list of all schemas an entry uses, without needing to open and parse anything (but then I'd want to have the hashes as well). So @mscheidg , what do you think about extending the metainfo/sections to include the parents, and possibly also the hashes? Is that a possibility?

It solves the inheriting sections in an efficient way so there is no need to traverse through all "nomad.datamodel.datamodel.EntryArchive". It solves not only the inheriting section but also the issue #1160 (closed) to get the general class "nomad.datamodel.data.ArchiveSection'.

Yes, we could probably extend the "metadata/sections" data with more information (e.g. inheritance).

@thchang what do you think? Is this a change that could be implemented easily/soonishly? Just extending the list to include also the base classes of used definitions (recursively), and to include the hashes when the config write_definition_id_to_archive is set?

created branch 1181-bundle-format-and-functionality to address this issue

mentioned in merge request !955 (merged)

mentioned in issue #1190 (closed)

assigned to @thchang and unassigned @dsikter

mentioned in issue #1231 (closed)

unassigned @thchang

added backlog label and removed current label

The described issue is well known and future refactring efforts will adress it. The issue was closed automatically.

closed

Bundle format and functionality

Designs

Child items ...

Activity