From 90002052e7f0f12f8129cabebe6aadac965e93d3 Mon Sep 17 00:00:00 2001
From: Markus Scheidgen <markus.scheidgen@gmail.com>
Date: Fri, 22 Jan 2021 08:29:36 +0100
Subject: [PATCH] Added more information about uploading data to the docs.

---
 docs/index.rst  |   2 +-
 docs/upload.md  | 166 ++++++++++++++++++++++++++++++++++++++++++++++++
 docs/upload.rst |  53 ----------------
 3 files changed, 167 insertions(+), 54 deletions(-)
 create mode 100644 docs/upload.md
 delete mode 100644 docs/upload.rst

diff --git a/docs/index.rst b/docs/index.rst
index 96c39c3b9b..81f9ec561f 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -8,7 +8,7 @@ and infrastructure with a simplyfied architecture and consolidated code base.
    :maxdepth: 1
 
    introduction.md
-   upload.rst
+   upload.md
    api.md
    client/client.rst
    metainfo.rst
diff --git a/docs/upload.md b/docs/upload.md
new file mode 100644
index 0000000000..33051ec44f
--- /dev/null
+++ b/docs/upload.md
@@ -0,0 +1,166 @@
+# How to upload data
+
+To contribute your data to the repository, please, login to our [upload page](https://nomad-lab.eu/prod/rae/gui/uploads)
+(you need to register first, if you do not have a NOMAD account yet).
+
+*A note for returning NOMAD users!* We revised the upload process with browser based upload
+alongside new shell commands. The new Upload page allows you to monitor upload processing
+and verify processing results before publishing your data to the Repository.
+
+The [upload page](https://nomad-lab.eu/prod/rae/gui/uploads) acts as a staging area for your data. It allows you to
+upload data, to supervise the processing of your data, and to examine all metadata that
+NOMAD extracts from your uploads. The data on the upload page will be private and can be
+deleted again. If you are satisfied with our processing, you can publish the data.
+Only then, data will become publicly available and cannot be deleted anymore.
+You will always be able to access, share, and download your data. You may curate your data
+and create datasets to give them a hierarchical structure. These functions are available
+from the Your data page by selecting and editing data.
+
+You should upload many files at the same time by creating .zip or .tar files of your folder structures.
+Ideally, input and output files are accompanied by relevant auxiliary files. NOMAD will
+consider everything within a single directory as related.
+
+Once published, data cannot be erased. Linking a corrected version to a corresponding older
+one ("erratum") will be possible soon. Files from an improved calculation, even for the
+same material, will be handled as a new entry.
+
+You can publish data as being open access or restricted for up to three years (with embargo).
+For the latter you may choose with whom you want to share your data. We strongly support the
+idea of open access and thus suggest to impose as few restrictions as possible from the very
+beginning. In case of open access data, all uploaded files are downloadable by any user.
+Additional information, e.g. pointing to publications or how your data should be cited,
+can be provided after the upload. Also DOIs can be requested. The restriction on data
+can be lifted at any time. You cannot restrict data that was published as open access.
+
+Unless published without an embargo, all your information will be private and only visible
+to you (or NOMAD users you explicitly shared your data with). Viewing private data will
+always require a login.
+
+By uploading you confirm authorship of the uploaded calculations. Co-authors must be specified
+after the upload process. This procedure is very much analogous to the submission of a
+publication to a scientific journal.
+
+Upload of data is free of charge.
+
+## Limits
+
+The following limitations apply to uploading:
+
+- One upload cannot exceed 32 GB in size
+- Only 10 non published uploads are allowed per user
+
+## On the supported codes
+
+NOMAD is interpreting your files. It will check each file and recognize if it is the
+main output file of one of the supported codes. NOMAD will create a entry for this *mainfile*
+that represents the respective data of this code run, experiment, etc. NOMAD only
+shows that for such recognized entries. If you uploads do not contain any files that
+NOMAD recognizes, you upload will be shown as empty and no data can be published.
+
+However, all files that are associated to a recognized *mainfile* by residing in the
+same directory, will be presented as *auxiliary* files along side the entry represented
+by the *mainfile*.
+
+### A note for VASP users
+
+On the handling of **POTCAR** files: NOMAD takes care of it; you don't
+need to worry about it. We understand that according to your VASP license, POTCAR files are
+not supposed to be visible to the public. Thus, in agreement with Georg Kresse, NOMAD will
+extract the most important information of POTCAR files and store it in the files named
+`POTCAR.stripped`. These files can be assessed and downloaded by anyone, while the original
+POTCAR files are only available to the uploader and assigned co-authors.
+This is done automatically; you don't need to do anything.
+
+## Preparing an upload file
+
+You can upload .zip and .tar.gz files to NOMAD. The directory structure within can
+be arbitrary. Keep in mind that files in a single directory are all associated (see above).
+Ideally you only keep the files of a single (or closely related) code runs, experiments, etc.
+in one directory.
+
+You should not place files in additional archives within the upload file. NOMAD will not
+extract any zips in zips and similar entrapments.
+
+## Uploading large amounts of data
+
+This problem is many fold. In the remainder the following topics are discussed.
+
+- NOMAD restrictions about upload size and number of unpublished simultaneous uploads
+- Managing metadata (comments, references, co-authors, datasets) for a large number of entries
+- Safely transferring the data to NOMAD
+
+### General strategy
+
+Before you attempt to upload large amounts of data, do some experiments with a representative
+and small subset of your data. Use this to simulate a larger upload,
+checking and editing it the normal way. You do not have to publish this test upload;
+simply delete it before publish, once you are satisfied with the results.
+
+Ask for assistance. [Contact us](https://nomad-lab.eu/about/contact) in advance. This will
+allow us to react to your specific situation and eventually prepare additional measures.
+
+Keep enough time before you need your data to be published. Adding multiple hundreds of
+GBs to NOMAD isn't a trivial feat and will take some time and effort from all sides.
+
+### Upload restrictions
+
+The upload restrictions are necessary to keep NOMAD data in manageable chunks and we cannot
+simply grant exceptions to these rules.
+
+This means you have to split your data into 32 GB uploads. Uploading these files, observing
+the processing, and publishing the data can be automatized through NOMAD APIs.
+
+When splitting your data, it is important to not split sub-directories if they represent
+all files of a single entry. NOMAD can only bundle those related files to an entry if
+they are part of the same upload (and directory). Therefore, there is no single recipe to
+follow and a script to split your data will depend on how your data is organized.
+
+### Avoid additional operations on your data
+
+Changing the metadata of a large amounts of entries can be expensive and will also mean
+more work with our APIs. A simpler solution is to add the metadata directly to your uploads.
+This way NOMAD can pick it up automatically, no further actions required.
+
+Each NOMAD upload can contain a `nomad.json` file at the root. This file can contain
+metadata that you want to apply to all your entries. Here is an example:
+
+```
+{
+    "comment": "Data from a cool research project",
+    "references": ['http://archivex.org/mypaper'],
+    "co_authors": [
+        '<co-author-ids>',
+        '<co-author-ids>'
+    ]
+    "datasets": [
+        '<dataset-id>'
+    ],
+    "entries": {
+        "path/to/calcs/vasp.xml": {
+            "commit": "An entry specific comment."
+        }
+    }
+}
+```
+
+Another measure is to directly publish your data upon upload. After performing some
+smaller test upload, you should consider to skip our staging and publish the upload
+right away. This can save you some time and additional API calls. The upload endpoint
+has a parameter `publish_directly`. You can modify the upload command
+that you get from the upload page like this:
+
+```
+curl "http://nomad-lab.eu/prod/rae/api/uploads/?token=<your-token>&publish_directly=true" -T <local_file>
+```
+
+### Save transfer of files
+
+HTTP makes it easy for you to upload files via browser and curl, but it is not an
+ideal protocol for the stable transfer of large and many files. Alternatively, we can organize
+a separate manual file transfer to our servers. We will put your prepared upload
+files (.zip or .tag.gz) on a predefined path on the NOMAD servers. NOMAD allows to *"upload"*
+files directly from its servers via an additional `local_path` parameter:
+
+```
+curl -X PUT "http://nomad-lab.eu/prod/rae/api/uploads/?token=<your-token>&local_path=<path-to-upload-file>"
+```
diff --git a/docs/upload.rst b/docs/upload.rst
deleted file mode 100644
index 1b4b931c13..0000000000
--- a/docs/upload.rst
+++ /dev/null
@@ -1,53 +0,0 @@
-==================
-How to upload data
-==================
-
-To contribute your data to the repository, please, login to our `upload page <../gui/uploads>`_
-(you need to register first, if you do not have a NOMAD account yet).
-
-*A note for returning NOMAD users!* We revised the upload process with browser based upload
-alongside new shell commands. The new Upload page allows you to monitor upload processing
-and verify processing results before publishing your data to the Repository.
-
-The `upload page <../gui/uploads>`_ acts as a staging area for your data. It allows you to
-upload data, to supervise the processing of your data, and to examine all metadata that
-NOMAD extracts from your uploads. The data on the upload page will be private and can be
-deleted again. If you are satisfied with our processing, you can publish the data.
-Only then, data will become publicly available and cannot be deleted anymore.
-You will always be able to access, share, and download your data. You may curate your data
-and create datasets to give them a hierarchical structure. These functions are available
-from the Your data page by selecting and editing data.
-
-You should upload many files at the same time by creating .zip or .tar files of your folder structures.
-Ideally, input and output files are accompanied by relevant auxiliary files. NOMAD will
-consider everything within a single directory as related.
-
-**A note for VASP users** on the handling of **POTCAR** files: NOMAD takes care of it; you don't
-need to worry about it. We understand that according to your VASP license, POTCAR files are
-not supposed to be visible to the public. Thus, in agreement with Georg Kresse, NOMAD will
-extract the most important information of POTCAR files and store it in the files named
-``POTCAR.stripped``. These files can be assessed and downloaded by anyone, while the original
-POTCAR files are only available to the uploader and assigned co-authors.
-This is done automatically; you don't need to do anything.
-
-Once published, data cannot be erased. Linking a corrected version to a corresponding older
-one ("erratum") will be possible soon. Files from an improved calculation, even for the
-same material, will be handled as a new entry.
-
-You can publish data as being open access or restricted for up to three years (with embargo).
-For the latter you may choose with whom you want to share your data. We strongly support the
-idea of open access and thus suggest to impose as few restrictions as possible from the very
-beginning. In case of open access data, all uploaded files are downloadable by any user.
-Additional information, e.g. pointing to publications or how your data should be cited,
-can be provided after the upload. Also DOIs can be requested. The restriction on data
-can be lifted at any time. You cannot restrict data that was published as open access.
-
-Unless published without an embargo, all your information will be private and only visible
-to you (or NOMAD users you explicitly shared your data with). Viewing private data will
-always require a login.
-
-By uploading you confirm authorship of the uploaded calculations. Co-authors must be specified
-after the upload process. This procedure is very much analogous to the submission of a
-publication to a scientific journal.
-
-Upload of data is free of charge.
\ No newline at end of file
-- 
GitLab