Commit 10217838 authored by Markus Scheidgen's avatar Markus Scheidgen
Browse files

Merge branch 'v1.0.3' into 'master'

Merge for release v1.0.3

Closes #730

See merge request !580
parents a83a76d8 d61f9514
Pipeline #124344 skipped with stage
......@@ -81,14 +81,14 @@ python tests:
stage: test
image: $TEST_IMAGE
services:
- name: rabbitmq:3.7.17
- name: rabbitmq:3.9.13
alias: rabbitmq
- name: docker.elastic.co/elasticsearch/elasticsearch:6.8.15
alias: elastic
# fix issue with running elastic in gitlab ci runner:
# https://gitlab.com/gitlab-org/gitlab-ce/issues/42214
command: [ "bin/elasticsearch", "-Ediscovery.type=single-node" ]
- name: mongo:4
- name: mongo:5.0.6
alias: mongo
variables:
RABBITMQ_ERLANG_COOKIE: SWQOKODSQALRPCLNMEQG
......
......@@ -46,6 +46,12 @@ contributing, and API reference.
Omitted versions are plain bugfix releases with only minor changes and fixes.
### v1.0.3
- refactored DCAT to use fast api, added DOIs
- refactored ArchiveQuery client
- documentation and fixes for Oasis with keycloak
- many minor GUI bugfixes
### v1.0.0
- new search interface
- new v1 API (entries, materials, upload, datasets, sync)
......@@ -244,4 +250,4 @@ The first production version of nomad@fairdi as the upload API and gui for NOMAD
### v0.4.2
- bugfixes regarding the migration
- better migration configurability and reproducibility
- scales to multi node kubernetes deployment
\ No newline at end of file
- scales to multi node kubernetes deployment
# Query and Access Processed Data
The `ArchiveQuery` allows you to search for entries and access their parsed and processed *archive* data
at the same time. Furthermore, all data is accessible through a convenient Python interface
based on the [NOMAD metainfo](archive.md) rather than plain JSON.
## Basic Usage
To define a query, one can, for example, write
```python
from nomad.client.archive import ArchiveQuery
query = ArchiveQuery(query={}, required={}, page_size=10, results_max=10000)
```
Although the above query object has an empty query.
The query object is constructed only. To access the desired data, users need to perform two operations manually.
### Fetch
The fetch process is carried out **synchronously**. Users can call
```python
number_of_entries = query.fetch()
```
to perform the fetch process to fetch up to `results_max` entries. An indicative number `n` can be provided `fetch(n)`
. Given that each upload may contain various numbers of entries, the fetch process guarantees at least `n` entries
will be fetched. The exact number is determined by `page_size`, indicating how many uploads in each page. However, this
would be limited to the `results_max`. The exact qualified number of entries will be returned. Meanwhile, the qualified
upload list would be populated with their IDs. To check all qualified upload IDs, one can call `upload_list()` method to
return the full list.
```python
print(query.upload_list())
```
If applicable, it is possible to fetch a large number of entries first and then perform a second fetch by using some
upload ID in the first fetch result as the `after` argument so that some middle segment can be downloaded.
### Download
After fetching the qualified uploads, the desired data can be downloaded **asynchronously**. One can call
```python
results = query.download()
```
to download up to `results_max` entries. The downloaded results are returned as a list. Alternatively, it is possible to
just download a portion of previously fetched entries at a single time. For example,
```python
# previously fetched for example 1000 entries
# but only download the first 100 (approx.) entries
results = query.download(100)
```
The same `download(n)` method can be called repeatedly. If there are no sufficient entries, new entries will be
automatically fetched. If there are no more entries, the returned result list is empty. For example,
```python
total_results = []
while True:
result = query.download(100)
if len(result) == 0:
break
total_results.extend(result)
```
There is no retry mechanism in the download process. If any uploads fail to be downloaded due to server error, it is
kept in the list otherwise removed.
## A Complete Rundown
Here we show a valid query and acquire data from server.
We first define the desired query and construct the object. We limit the maximum number of entries to be 10000 and 10
uploads per page.
```python
from nomad.client.archive import ArchiveQuery
required = {
'workflow': {
'calculation_result_ref': {
'energy': '*',
'system_ref': {
'chemical_composition_reduced': '*'
}
}
}
}
query = {
'results.method.simulation.program_name': 'VASP',
'results.material.elements': ['Ti']
}
query = ArchiveQuery(query=query, required=required, page_size=10, results_max=10000)
```
Let's fetch some entries.
```python
query.fetch(1000)
print(query.upload_list())
```
If we print the upload list, it would be
```text
[('-19NlAwxTCCXb6YT9Plifw', 526), ('-2ewONNGTZ68zuTQ6zrRZw', 4), ('-3LrFBvFQtCtmEp3Hy15EA', 12), ('-3ofEqLvSZiqo59vtf-TAQ', 4), ('-4W-jogwReafpdva4ELdrw', 32), ('-BLVfvlJRWawtHyuUWvP_g', 68), ('-Dm30DqRQX6pZUbJYwHUmw', 320), ('-Jfjp-lZSjqyaph2chqZfw', 6), ('-K2QS7s4QiqRg6nMPqzaTw', 82), ('-Li36ZXhQPucJvkd8yzYoA', 10)]
```
So upload `-19NlAwxTCCXb6YT9Plifw` has 526 qualified entries, upload `-2ewONNGTZ68zuTQ6zrRZw` has 4 qualified entries,
and so on. The summation of the above entries gives 1064 entries in total as shown in terminal message.
Now data can be downloaded.
```python
result = query.download(100)
print(f'Downloaded {len(result)} entries.') # Downloaded 526 entries.
```
Since the first upload has 526 entries, they will be downloaded in this call. The list would have the first upload
removed as it has been downloaded.
```text
[('-2ewONNGTZ68zuTQ6zrRZw', 4), ('-3LrFBvFQtCtmEp3Hy15EA', 12), ('-3ofEqLvSZiqo59vtf-TAQ', 4), ('-4W-jogwReafpdva4ELdrw', 32), ('-BLVfvlJRWawtHyuUWvP_g', 68), ('-Dm30DqRQX6pZUbJYwHUmw', 320), ('-Jfjp-lZSjqyaph2chqZfw', 6), ('-K2QS7s4QiqRg6nMPqzaTw', 82), ('-Li36ZXhQPucJvkd8yzYoA', 10)]
```
It is possible to download more data.
```python
result = query.download(300)
print(f'Downloaded {len(result)} entries.') # Downloaded 440 entries.
```
The first six uploads will be downloaded to meet the required (at least) 300 entries, which total to 440 entries. What's
left in the list would be
```text
[('-Jfjp-lZSjqyaph2chqZfw', 6), ('-K2QS7s4QiqRg6nMPqzaTw', 82), ('-Li36ZXhQPucJvkd8yzYoA', 10)]
```
We perform one more download call to illustrate that fetch process will be automatically performed.
```python
result = query.download(100)
print(f'Downloaded {len(result)} entries.') # Downloaded 102 entries.
```
In the above, we request additional 100 entries, however, the list contains only `6+82+10=98` entries, fetch process
will be called to fetch new entries from server. You will see the following message in terminal.
```text
Fetching remote uploads...
787 entries are qualified and added to the download list.
Downloading required data...
Downloaded 102 entries.
[('-NiRWNGjS--JtFoEnYrCfg', 8), ('-OcPUKZtS6u3lXlkWBM4qg', 129), ('-PA35e2ZRsq4AdDfBU4M_g', 14), ('-TG77dGiSTyrDAFNqTKa6Q', 366), ('-VzlPYtnS4q1tSl3NOmlCw', 178), ('-XeqzVqwSMCJFwhvDqWs8A', 14), ('-Y7gwnleQI6Q61jp024fXQ', 16), ('-Zf1RO1MQXegYVTbFybtQQ', 8), ('-Zm4S9VGRdOX-kbF1J5lOA', 50)]
```
## Argument List
The following arguments are acceptable for `ArchiveQuery`.
- `owner` : `str` The scope of data to access. Default: `'visible'`
- `query` : `dict` The API query. There are no validations of any means carried out by the class, users shall make sure
the provided query is valid. Otherwise, server would return error message.
- `required` : `dict` The required quantities.
- `url` : `str` The database url. It can be the one of your local database. The official NOMAD database is used be
default if no valid one defined. Default: `http://nomad-lab.eu/prod/v1/api`
- `after` : `str` It can be understood that the data is stored in a sequential list. Each upload has a unique ID,
if `after` is not provided, the query always starts from the first upload. One can choose to query the uploads in the
middle of storage by assigning a proper value of `after`.
- `results_max` : `int` Determine how many entries to download. Note each upload may have multiple entries.
- `page_size` : `int` Page size.
- `username` : `str` Username for authentication.
- `password` : `str` Password for authentication.
- `retry` : `int` In the case of server errors, the fetch process is automatically retried every `sleep_time` seconds.
This argument limits the maximum times of retry.
- `sleep_time` : `float` The interval of fetch retry.
\ No newline at end of file
# Run NOMAD parser locally
If you install `nomad-lab[parsers]`, you can use the NOMAD parsers locally on your computer.
To use the NOMAD parsers from the command line, you can use the parse CLI command. The parse command will automatically match the right parser to your code output file and run the parser. There are two output formats, `--show-metadata` (a JSON representation of the basic metadata) and `--show-archive` (a JSON representation of the full parse results).
```sh
nomad parser --show-archive <path-to-your-mainfile-code-output-file>
```
You can also use the NOMAD parsers within Python, as shown below. This will give you the parse results as metainfo objects to conveniently analyze the results in Python. See metainfo for more details on how to use the metainfo in Python.
```python
import sys
from nomad.client import parse, normalize_all
# match and run the parser
archive = parse(sys.argv[1])
# run all normalizers
normalize_all(archive)
# get the 'main section' section_run as a metainfo object
section_run = archive.run[0]
# get the same data as JSON serializable Python dict
python_dict = section_run.m_to_dict()
```
You can also clone a parser project to debug or fix a parser:
```sh
git clone https://github.com/nomad-coe/nomad-parser-vasp.git
cd nomad-parser-vasp
git checkout metainfo-refactor
python -m nomad.cli nomad parser --show-archive <path-to-your-vasp-file>
```
Our parsers are hosted in github. They are in the [nomad-coe](https://github.com/nomad-coe) organization. They are typically named `nomad-parser-<code-name>`. The parser version
that fits the NOMAD v1 metainfo schema is typically in the `metainfo-refactor` branch.
Run the CLI with `python -m nomad.cli` to automatically include the current working directory
in the Python path. This will use the cloned parser code over the installed parser code.
\ No newline at end of file
......@@ -28,7 +28,7 @@ your OASIS and the central NOMAD user management and to allow your users to uplo
Your machine needs to be accessible under this hostname from the public internet. The host
name needs to be registered in the central NOMAD in order to configure the central user-
management correctly.
- Your NOMAD account should act as an admin account for your OASIS. This account must be declared
- You need to have a NOMAD account that acts as an *admin account* for your OASIS. This account must be declared
to the central NOMAD as an OASIS admin in order to give it the necessary rights in the central user management.
- You must know your NOMAD user-id. This information has to be provided by us.
......@@ -36,20 +36,22 @@ Please [write us](mailto:support@nomad-lab.eu) to register your NOMAD account as
admin and to register your hostname. Please replace the indicated configuration items with
the right information.
In principle, you can also run your own user management. This is not yet documented.
The central user management will make synchronizing data between NOMAD installations easer and generally recommend to use the central system.
But in principle, you can also run your own user management. See the section on
[your own user management](#provide-and-connect-your-own-user-management).
## Docker and docker compose
### Pre-requisites
NOMAD software is distributed as a set of docker containers and there are also other services required
that can be run with docker. Further, we use docker-compose to setup
all necessary container in the simplest way possible.
NOMAD software is distributed as a set of docker containers and there are also other services required that can be run with docker.
Further, we use docker-compose to setup all necessary container in the simplest way possible.
You will need a single computer, with **docker** and **docker-compose** installed.
You will need a single computer, with **docker** and **docker-compose** installed. Refer
to the official [docker](https://docs.docker.com/engine/install/) and [docker-compose](https://docs.docker.com/compose/install/)
documentation for installation instructions.
The following will run all necessary services with docker. These comprise: a **mongodb**
The following will run all necessary services with docker. These comprise: a **mongo**
database, an **elasticsearch**, a **rabbitmq** distributed task queue, the NOMAD **app**,
NOMAD **worker**, and NOMAD **gui**. In this [introduction](index.md#architecture),
you will learn what each service does and why it is necessary.
......@@ -57,8 +59,7 @@ you will learn what each service does and why it is necessary.
### Configuration overview
All docker container are configured via docker-compose and the respective `docker-compose.yaml` file.
Further, we will need to mount some configuration files to configure the NOMAD services within
their respective containers.
Further, we will need to mount some configuration files to configure the NOMAD services within their respective containers.
There are three files to configure:
......@@ -195,7 +196,7 @@ client:
services:
api_host: '<your-host>'
api_prefix: '/nomad-oasis'
api_base_path: '/nomad-oasis'
admin_user_id: '<your admin user id>'
keycloak:
......@@ -225,7 +226,6 @@ You need to change the following:
A few things to notice:
- Be secretive about your admin credentials; make sure this file is not publicly readable.
- We will use your hostname as `deployment_id`. When you publish uploads from your Oasis to the
central NOMAD, this will be added as upload metadata and allows to see where the upload came
from.
......@@ -291,13 +291,23 @@ A few things to notice:
- It configures the base path (`nomad-oasis`) at multiple places. It needs to be changed, if you use a different base path.
- You can use the server for additional content if you like.
- `client_max_body_size` sets a limit to the possible upload size.
- If you operate the GUI container behind another proxy, keep in mind that your proxy should not buffer requests/responses to allow streaming of large requests/responses for `../api/uploads` and `../api/raw`.
### gunicorn
Gunicorn is the WSGI-server that runs the nomad app. Consult the
[gunicorn documentation](https://docs.gunicorn.org/en/stable/configure.html) for
configuration options.
You can add an additional reverse proxy in front or modify the nginx in the docker-compose.yaml
to [support https](http://nginx.org/en/docs/http/configuring_https_servers.html).
If you operate the GUI container behind another proxy, keep in mind that your proxy should
not buffer requests/responses to allow streaming of large requests/responses for `api/v1/uploads` and `api/v1/.*/download`.
An nginx reverse proxy location on an additional reverse proxy, could have these directives
to ensure the correct http headers and allows download and upload of large files:
```nginx
client_max_body_size 35g;
proxy_set_header Host $host;
proxy_pass_request_headers on;
proxy_buffering off;
proxy_request_buffering off;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_pass http://<your-oasis-host>/nomad-oasis;
```
### Running NOMAD
......@@ -310,7 +320,7 @@ docker-compose pull
In the beginning and to simplify debugging, it is recommended to start the services separately:
```sh
docker-compose up -d mongodb elastic rabbitmq
docker-compose up -d mongo elastic rabbitmq
docker-compose up app worker gui
```
......@@ -356,6 +366,121 @@ If you want to report problems with your OASIS. Please provide the logs for
- nomad_oasis_worker
- nomad_oasis_gui
### Provide and connect your own user management
NOMAD uses [keycloak](https://www.keycloak.org/) for its user management. NOMAD uses
keycloak in two way. First, the user authentication uses the OpenID Connect/OAuth interfaces provided by keycloak.
Second, NOMD uses the keycloak realm-management API to get a list of existing users. Keycloak is highly customizable and numerous options to
connect keycloak to existing identity providers exist.
This tutorial assumes that you have a some understanding about what keycloak is and
how it works.
In the following, we provide basic installation steps for running your own keycloak in
the NOMAD Oasis docker-compose. First, add a keycloak service to the `docker-compose.yaml`:
```yaml
services:
# keycloak user management
keycloak:
restart: always
image: jboss/keycloak:16.1.1
container_name: nomad_oasis_keycloak
environment:
- PROXY_ADDRESS_FORWARDING=true
- KEYCLOAK_FRONTEND_URL=http://<your-host>/keycloak/auth
volumes:
- keycloak:/opt/jboss/keycloak/standalone/data
# Uncomment to get access to the admin console.
# ports:
# - 8080:8080
```
Also add links to the `keycloak` service in the `app` and `worker` service:
```yaml
services:
app:
links:
- keycloak
worker:
links:
- keycloak
```
A few notes:
- The environment variables on the keycloak service allow to use keycloak behind the nginx proxy with a path prefix `keycloak`.
- By default, keycloak will use a simple H2 file database stored in the given volume. Keycloak offers many other options to connect SQL databases.
- We will use keycloak with our nginx proxy here, but you can also host-bind the port `8080` to access keycloak directly.
Second, we add a keycloak location to the nginx config:
```nginx
location /keycloak {
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
rewrite /keycloak/(.*) /$1 break;
proxy_pass http://keycloak:8080;
}
```
A few notes:
- Again, we are using `keycloak` as a path prefix. We configure the headers to allow
keycloak to pick up the rewritten url.
Third, we modify the keycloak configuration in the `nomad.yaml`:
```yaml
services:
admin_user_id: 'a9e97ae9-7568-4b44-bb9f-7a7c6be898e8'
keycloak:
server_url: 'http://keycloak:8080/auth'
public_server_url: 'http://<your-host>/keycloak/auth'
realm_name: nomad
username: 'admin'
password: 'password'
oasis: true
```
A few notes:
- There are two urls to configure for keycloak. The `server_url` is used by the nomad
services to directly communicate with keycloak within the docker network. The `public_server_url`
is used by the UI for perform the authentication flow.
- The particular `admin_user_id` is the Oasis admin user in the provided example realm
configuration. See below.
As a last step, we need to configure keycloak. First, run the keycloak service and
update nginx with the new config.
```sh
docker-compose up keycloak
docker exec nomad_oasis_gui nginx -s reload
```
If you open `http://<yourhost>/keycloak/auth` in a browser, you will see that there is no
admin users. Second, we need to create a keycloak admin account:
```sh
docker exec nomad_oasis_keycloak /opt/jboss/keycloak/bin/add-user-keycloak.sh -u admin -p <PASSWORD>
docker restart nomad_oasis_keycloak
```
Give it a second to restart. After, you can login to the admin console at `http://<yourhost>/keycloak/auth`.
Keycloak uses `realms` to manage users and clients. We need to create a realm that NOMAD
can use. We prepared an example realm with the necessary NOMAD client, an Oasis admin,
and a test user. You can create a new realm through the admin console. Select this file
to import our example configuration.
A few notes on the realm configuration:
- Realm and client settings are almost all default keycloak settings.
- You should change the password of the admin user in the nomad realm.
- The admin user in the nomad realm has the additional `view-users` client role for `realm-management`
assigned. This is important, because NOMAD will use this user to retrieve the list of possible
users for managing co-authors and reviewers on NOMAD uploads.
- The realm has one client `nomad_public`. This has a basic configuration. You might
want to adapt this to your own policies. In particular you can alter the valid redirect URIs to
your own host.
## Base Linux (without docker)
### Pre-requisites
......
# Using the Python library
# Install the Python library
NOMAD provides a Python package called `nomad-lab`.
## Install
The package is hosted on [pypi](https://pypi.org/project/nomad-lab/)
and you can install it with *pip* (or conda).
......@@ -42,163 +40,3 @@ The various extras have the following meaning:
- *dev*, additional tools that are necessary to develop NOMAD
- *all*, all of the above
## Access parsed NOMAD data with `ArchiveQuery`
The `ArchiveQuery` allows you to search for entries and access their parsed *archive* data
at the same time. Furthermore, all data is accessible through a convenient Python interface
based on the [NOMAD metainfo](archive.md) rather than plain JSON.
Here is an example:
```py
query = ArchiveQuery(
query={
'results.method.simulation.program_name': 'VASP',
'results.material.elements': ['Ti', 'O'],
'results.method.simulation.geometry_optimization': {
'convergence_tolerance_energy_difference:lt': 1e-22
}
},
required={
'workflow': {
'calculation_result_ref': {
'energy': '*',
'system_ref': {
'chemical_composition_reduced': '*'
}
}
}
},
parallel=10,
max=100)
```
This instantiates an `ArchiveQuery`. You can print some details about the query:
```py
print(query)
```
This gives you a general overview of the query. For example which search was performed on
the NOMAD API, how many entries were found or what has already been downloaded, etc.
```py
Query: {
"and": [
{
"results.method.simulation.program_name": "VASP",
"results.material.elements": [
"Ti",
"O"
],
"results.method.simulation.geometry_optimization": {
"convergence_tolerance_energy_difference:lt": 1e-22
}
},
{
"quantities": [
"run.system.chemical_composition_reduced",
"run.calculation.system_ref",
"run.calculation.energy",
"workflow",
"workflow.calculation_result_ref"
]
}
]
}
Total number of entries that fulfil the query: 252
Number queried entries: 252
Number of entries loaded in the last api call: 70
Bytes loaded in the last api call: 53388
Bytes loaded from this query: 53388
Number of downloaded entries: 70
Number of made api calls: 1
```
This `ArchiveQuery` does not download all archive data immediately. More and more data will be
downloaded as you iterate through the query:
```py
for result in query:
calc = result.workflow[0].calculation_result_ref
formula = calc.system_ref.chemical_composition_reduced
total_energy = calc.energy.total.value.to(units.eV)
print(f'{formula}: {total_energy}')
```
The resulting output can look like this:
```
O10K2Ti3La2: -136.76387842 electron_volt
Li2O10Ti3La2: -139.15455203 electron_volt
O8Ti4: -107.30373862 electron_volt
O8Ca2Ti4: -116.52240913000001 electron_volt
...
```
Let's discuss the used `ArchiveQuery` parameters:
- `query`, this is an arbitrary API query as discussed in the under [Queries in the API section](api.md#queries).
- `required`, this optional parameter allows you to specify which parts of an archive you require. This is also
described under [Access archives in API section](api.md#access-archives).
- `per_page`, this optional parameter allows you to specify how many results should be downloaded at once. For mass download of many results, we recommend ~100. If you are only interested in the first results a lower number may increase performance.
- `max`, with this optional parameter, we limit the maximum number of entries that are downloaded to avoid accidentally iterating through a result set of unknown and potentially large size.
- `owner` and `auth`, allow you to access private data or to specify you only want to
query your own data. See also [owner](api.md#owner) and [auth](api.md#authentication) in the API section. Her is an example with authentication:
```py
from nomad.client import ArchiveQuery, Auth
query = ArchiveQuery(
owner='user',
required={
'run': {
'system[-1]': '*'
}
},
authentication=Auth(user='yourusername', password='yourpassword'))
```
The archive query object can be treated as Python list-like. You use indices and ranges to select results. Each result is a Python object. The attributes of these objects are
determined by NOMAD's schema, [the metainfo](archive.md).
This energy value is a number with an attached unit (Joule), which can be converted to something else (e.g. eV). {{ metainfo_data() }}
The create query object keeps all results in memory. Keep this in mind, when you are accessing a large amount of query results.
## Use NOMAD parser locally
If you install `nomad-lab[parsers]`, you can use the NOMAD parsers locally on your computer.
To use the NOMAD parsers from the command line, you can use the parse CLI command. The parse command will automatically match the right parser to your code output file and run the parser. There are two output formats, `--show-metadata` (a JSON representation of the basic metadata) and `--show-archive` (a JSON representation of the full parse results).
```sh
nomad parser --show-archive <path-to-your-mainfile-code-output-file>
```
You can also use the NOMAD parsers within Python, as shown below. This will give you the parse results as metainfo objects to conveniently analyze the results in Python. See metainfo for more details on how to use the metainfo in Python.
```python
import sys
from nomad.client import parse, normalize_all