From 19c408bfea09c865a90752a4866e966c70818cad Mon Sep 17 00:00:00 2001 From: Markus Scheidgen <markus.scheidgen@gmail.com> Date: Tue, 19 Mar 2024 18:15:28 +0100 Subject: [PATCH] Added documentation for downloading files and data with curl. Changelog: Added --- docs/howto/programmatic/api.md | 33 +++++- docs/howto/programmatic/download.md | 158 ++++++++++++++++++++++++++++ mkdocs.yml | 1 + nomad/mkdocs.py | 2 +- 4 files changed, 192 insertions(+), 2 deletions(-) create mode 100644 docs/howto/programmatic/download.md diff --git a/docs/howto/programmatic/api.md b/docs/howto/programmatic/api.md index e74317bdcf..8cc57926fc 100644 --- a/docs/howto/programmatic/api.md +++ b/docs/howto/programmatic/api.md @@ -451,4 +451,35 @@ response = requests.post( {{ doc_snippet('archive-required') }} -{{ metainfo_data() }} \ No newline at end of file +{{ metainfo_data() }} + +## Limits + +The API allows you to ask many requests in parallel and to put a lot of load +on NOMAD servers. Since this can accidentally or deliberately reduce the service +quality for other, we have to enforce a few limits. + +- *rate limit*: you can only run a certain amount of requests at the same time +- *rate limit*: you can only run a certain amount of requests per second +- *api limit*: many API endpoints will enforce a maximum page size + +If you get responses with an HTTP code **503 Service Unavailable**, you are hitting +a rate limit and you cannot use the service until you fall back into our limits. Consider, +to ask fewer requests in a larger time frame. + +Rate limits are enforced based on your IP address. Please note that when you or your +colleagues are sharing a single external IPs from within a local network, e.g. +via [NAT](https://en.wikipedia.org/wiki/Network_address_translation), +you are also sharing the rate limits. +Depending on the NOMAD installation, these limits can be as low as 30 requests per second +or 10 concurrent requests. + +Consider to use endpoints that allow you to retrieve full +pages of resources, instead of endpoints that force you to access resources one at a time. +See also the sections on [types of data](#different-kinds-of-data) and [pagination](#pagination). + +However, pagination also has its limits and you might ask for pages that are too large. +If you get responses in the 400 range, e.g. **422 Unprocessable Content** or **400 Bad request**, +you might hit an api limit. Those responses are typically accompanied by an error message +in the response body that will inform you about the limit, e.g. the maximum allowed +page size. \ No newline at end of file diff --git a/docs/howto/programmatic/download.md b/docs/howto/programmatic/download.md new file mode 100644 index 0000000000..7a585519bc --- /dev/null +++ b/docs/howto/programmatic/download.md @@ -0,0 +1,158 @@ +A common use-case for the NOMAD API is to download large amounts of NOMAD data. +In this how-to guide, we use curl and API endpoints +that stream .zip files to download many resources with a single request directly from +the command line. + +## Prerequisites + +Here is some background information to understand the examples better. + +### curl + +To download resources from a REST API using curl, you can utilize the powerful command-line tool +to send HTTP requests and retrieve the desired data. Curl provides a simple and efficient +way to interact with RESTful APIs, allowing you to specify the necessary headers, parameters, +and authentication details. Whether you need to download files, retrieve JSON data, or access +other resources, curl offers a flexible and widely supported solution for programmatically +fetching data from REST APIs. + +### Raw files vs processed data + +We are covering two types of resources: *raw files* and *processed data*. +The former is organized into uploads and sub directory. The organization depends +on how the author was providing the files. +The later is organized by entries. Each NOMAD entry has corresponding structured data. + +Endpoints that target raw files typically contain `raw`, e.g. `uploads/<id>/raw` +or `entries/raw/query`. Endpoints that target processed data contain `archive` +(because we call the entirety of all processed data the NOMAD Archive), e.g. +`entries/<id>/archive` or `entries/archive/query`. + +### Entry vs upload + +API endpoints for data download either target *entries* or *uploads*. For both types +of entities, endpoints for raw files and processed data (as well as searchable metadata) +exist. API endpoint paths start with the entity, e.g. `uploads/<id>/raw` or `entries/<id>/raw`. + +## Download a whole upload + +Let's assume you want to download an entire upload. In this example the upload id is +`wW45wJKiREOYTY0ARuknkA`. + +```sh +curl -X GET "{{ nomad_url() }}/v1/uploads/wW45wJKiREOYTY0ARuknkA/raw" -o download.zip +``` + +This will create a `download.zip` file in the current folder. The zip file will contain +the raw file directory of the upload. + +The used `uploads/<id>/raw` endpoint is only available for published uploads. For those, +all raw files have already been +packed into a zip file and this endpoint simply lets you download it. This is the simplest +and most reliable download implementation. + +Alternatively, you can download specific files or sub-directories. This method is available +for all uploads. Including un-published uploads. + +```sh +curl -X GET "{{ nomad_url() }}/v1/uploads/wW45wJKiREOYTY0ARuknkA/raw/?compress=true" -o download.zip +``` + +This endpoint looks very similar, but is implemented very differently. Note that we +put an empty path `/` to the end of the URL, plus a query parameter `compress=true`. +The path can be replaced with any directory or file path in the upload; `/` would denote the +whole upload. The query parameter says that we want to download the whole directory +as a zip file, instead of an individual file. This traverses through all files and +creates a zip file on the fly. + +## Download a whole dataset + +Now let's assume that you want to download all raw files that are associated with +all the entries of an entire dataset. In this example the dataset DOI is +`10.17172/NOMAD/2023.11.17-2`. + +```sh +curl -X POST "{{ nomad_url() }}/v1/entries/raw/query" \ +-H 'Content-Type: application/json' \ +-d '{ + "query": { + "datasets.doi": "10.17172/NOMAD/2023.11.17-2" + } +}' \ +-o download.zip +``` + +This time, we use the `entries/raw/query` endpoint that is based on entries and not on uploads. Here, we +select entries with a query. In the example, we query for the dataset DOI, but you +can replace this with any NOMAD search query (look out for the `<>` symbol on the +[search interface]({{ nomad_url() }}/../gui/search/entries)). The zip file will contain all raw files from all the +directories that have the mainfile of one of the entries that match the queries. + +This might not necessarily download all uploaded files. Alternatively, you can use a query to +get all upload ids and then use the method from the previous section: + +```sh +curl -X POST "{{ nomad_url() }}/v1/entries/query" \ +-H 'Content-Type: application/json' \ +-d '{ + "query": { + "datasets.doi": "10.17172/NOMAD/2023.11.17-2" + }, + "pagination": { + "page_size": 0 + }, + "aggregations": { + "upload_ids": { + "terms": { + "quantity": "upload_id" + } + } + } +}' +``` + +The last command will print JSON data that contains all the upload ids. It uses +the `entries/query` endpoint that allows you to query NOMAD's search. +It does not return any results (`page_size: 0`), +but performs an aggregation over all search results and collects the upload ids +from all entries. + +## Download some processed data for a whole dataset + +Similar to raw files, you can also download processed data. This is also an +entry based operation based on a query. This time we also specify a `required` +to explain which parts of the processed data, we are interested in: + +```sh +curl -X POST "{{ nomad_url() }}/v1/entries/archive/download/query" \ +-H 'Content-Type: application/json' \ +-d '{ + "query": { + "datasets.doi": "10.17172/NOMAD/2023.11.17-2" + }, + "required": { + "metadata": { + "entry_id": "*", + "mainfile": "*", + "upload_id": "*" + }, + "results": { + "material": "*" + }, + "run": { + "system[-1]": { + "atoms": "*" + } + } + } +}' \ +-o download.zip +``` + +Here we use the `entries/archive/download/query` endpoint. The result is a zip file +with one json file per entry. There are no directories and the files are named +`<entry-id>.json`. To associate the json files with entries, you should require +information that tells you more about the entries, e.g. `required.metadata.mainfile`. + +See also the [How to access processed data](./archive_query.md) how-to guide. + diff --git a/mkdocs.yml b/mkdocs.yml index 32ccbffaf2..56b2fa02ef 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -21,6 +21,7 @@ nav: - Use NORTH: howto/manage/north.md - Programmatic use: - Use the API: howto/programmatic/api.md # TODO separate into How-to and Explanation/Reference + - Download data: howto/programmatic/download.md - Publish data using python: howto/programmatic/publish_python.md - Install nomad-lab: howto/programmatic/pythonlib.md - Access processed data: howto/programmatic/archive_query.md diff --git a/nomad/mkdocs.py b/nomad/mkdocs.py index 068ebe6b28..e39fa49bc7 100644 --- a/nomad/mkdocs.py +++ b/nomad/mkdocs.py @@ -209,7 +209,7 @@ def define_env(env): @env.macro def nomad_url(): # pylint: disable=unused-variable # TODO Fix the configuration during build time. - return 'https://nomad-lab.eu/prod/v1/staging/api' + return 'https://nomad-lab.eu/prod/v1/api' # return config.api_url() @env.macro -- GitLab