diff --git a/docs/apis/api.md b/docs/apis/api.md index 277a6e0ca67e3ea0716704bcf040aff20216fcd3..1d1e4e30fd1bb0e9f0e2d71f86ab11b5f1e65ea4 100644 --- a/docs/apis/api.md +++ b/docs/apis/api.md @@ -2,7 +2,7 @@ This guide is about using NOMAD's REST APIs directly, e.g. via Python's *request To access the processed data with our client library `nomad-lab` follow [How to access the processed data](archive_query.md). You watch our -[video tutorial on the API](../tutorial.md#access-data-via-api). +[video tutorial on the API](../tutorial/access_api.md#access-data-via-api). ## Different options to use the API diff --git a/docs/index.md b/docs/index.md index d403f067a72debbe6e531126dac6ed5f6ac00031..9c885241e2e8af0141e0fe17911d5c34710b4719 100644 --- a/docs/index.md +++ b/docs/index.md @@ -24,12 +24,19 @@ NOMAD is useful for scientists that work with data, for research groups that nee ### Tutorial -This series of [short videos will guide you through the main functionality of NOMAD](tutorial.md). -It covers th whole publish, explore, analyze cycle: +A series of tutorials will guide you through the main functionality of NOMAD. + +- [Upload and publish your own data](tutorial/upload_publish.md) +- [Use the search interface to identify interesting data](tutorial/explore.md) +- [Use the API to search and access processed data for analysis](tutorial/access_api.md) +- [Find and use the automations of the built-in schemas available in NOMAD](tutorial/builtin.md) +- [Create and use custom schemas in NOMAD](tutorial/custom.md) +- [Customization at its best: user-defined schema and automation](tutorial/plugins.md) +- [Third-party ELN integration](tutorial/third_party.md) + +- [Example data and exercises](https://www.fairmat-nfdi.eu/events/fairmat-tutorial-1/tutorial-1-materials) +- [More videos and tutorials on YouTube](https://youtube.com/playlist?list=PLrRaxjvn6FDW-_DzZ4OShfMPcTtnFoynT) -- Upload and publish your own data -- Use the search interface to identify interesting data -- Use the API to search and access processed data for analysis </div> <div markdown="block"> diff --git a/docs/learn/data.md b/docs/learn/data.md index 1445b46577df2837650a78952bb2ea71ee4f914a..7749d1c721a3ec90307263b229ecfa9098c386c7 100644 --- a/docs/learn/data.md +++ b/docs/learn/data.md @@ -62,28 +62,6 @@ specific types of data. </figcaption> </figure> -### Shared entry structure - -The processed data (archive) of each entry share the same structure. They all instantiate -the same root section `EntryArchive`. They all share common sections `metadata:EntryMetadata` -and `results:Results`. They also all contain a *data* section, but the used section -definition varies depending on the type of data of the specific entry. There is the -literal `data:EntryData` sub-section. Here `EntryData` is abstract and specific entries -will use concrete definitions that inherit from `EntryData`. There are also specific *data* -sections, like `run` for simulation data and `nexus` for nexus data. - -!!! attention - The results, originally only designed for computational data, will soon be revised - an replaced by a different section. However, the necessity and function of a section - like this remains. - -<figure markdown> -  - <figcaption> - All entries instantiate the same section share the same structure. - </figcaption> -</figure> - ### Base sections Base section is a very loose category. In principle, every section definition can be @@ -148,7 +126,7 @@ and browse based on sub-sections, or explore the Metainfo through packages. To see all user provided uploaded schemas, you can use a [search for the sub-section `definition`](https://nomad-lab.eu/prod/v1/gui/search/entries?quantities=definitions). The sub-section `definition` is a top-level `EntryArchive` sub-section. See also our -[how-to on writing and uploading schemas](http://127.0.0.1:8001/schemas/basics.html#uploading-schemas). +[how-to on writing and uploading schemas](../schemas/basics.md#uploading-schemas). ### Contributing to the Metainfo @@ -167,7 +145,7 @@ schemas, you most likely also upload data in archive files (or use ELNs to edit Here you can also provide schemas and data in the same file. In many case specific schemas will be small and only re-combine existing base sections. See also our -[how-to on writing schemas](http://127.0.0.1:8001/schemas/basics.html). +[how-to on writing schemas](../schemas/basics.md). ## Data @@ -180,7 +158,45 @@ The Metainfo has many serialized forms. You can write `.archive.json` or `.archi files yourself. NOMAD internally stores all processed data in [message pack](https://msgpack.org/). Some of the data is stored in mongodb or elasticsearch. When you request processed data via API, you receive it in JSON. When you use the [ArchiveQuery](../apis/archive_query.md), all data is represented -as Python objects (see also [here](http://127.0.0.1:8001/plugins/schemas.html#starting-example)). +as Python objects (see also [here](../plugins/schemas.md#starting-example)). No matter what the representation is, you can rely on the structure, names, types, shapes, and units defined in the schema to interpret the data. + +## Archive files: a shared entry structure + +Broadening the discussion on the *entry* files that one can find in NOMAD, both [schemas](#schema) or [processed data](#data) are serialized as the same kind of *archive file*, either `.archive.json` or `.archive.yaml`. + +The NOMAD archive file is indeed composed by several sections. + +NOMAD archive file:`EntryArchive` + +* definitions: `Definitions` +* metadata: `EntryMetadata` +* data: `EntryData` +* run: `Run` +* nexus: `Nexus` +* workflow: `Workflow` +* results: `Results` + +They all instantiate the same root section `EntryArchive`. They all share common sections `metadata:Metadata` +and `results:Results`. They also all contain a *data* section, but the used section +definition varies depending on the type of data of the specific entry. There is the +literal `data:EntryData` sub-section. Here `EntryData` is abstract and specific entries +will use concrete definitions that inherit from `EntryData`. There are also specific *data* +sections, like `run` for simulation data and `nexus` for nexus data. + +!!! note + As shown in [Uploading schemas](../schemas/basics.md#uploading-schemas), one can, in principle, create an archive file with both `definitions` and one of the *data* sections filled, although this is not always desired because it will stick together a schema and a particular instance of that schema. They should be kept separate so that it is still possible to generate new data files from the same schema file. + +!!! attention + The results, originally only designed for computational data, will soon be revised + an replaced by a different section. However, the necessity and function of a section + like this remains. + +<figure markdown> +  + <figcaption> + All entries instantiate the same section share the same structure. + </figcaption> +</figure> diff --git a/docs/reference/annotations.md b/docs/reference/annotations.md index 38c2ac7ce0c3334052085e31170ae1a499a75301..2513e5c692c1251de32f4078a135e05c3e11cc43 100644 --- a/docs/reference/annotations.md +++ b/docs/reference/annotations.md @@ -14,12 +14,105 @@ definitions: Many annotations control the representation of data in the GUI. This can be for plots or data entry/editing capabilities. -{{ pydantic_model('nomad.datamodel.metainfo.annotations.ELNAnnotation', heading='## eln') }} +{{ pydantic_model('nomad.datamodel.metainfo.annotations.ELNAnnotation', heading='## ELN annotations') }} +{{ pydantic_model('nomad.datamodel.metainfo.annotations.BrowserAnnotation', heading='## Browser') }} + +{{ pydantic_model('nomad.datamodel.metainfo.annotations.BrowserAnnotation', heading='## browser') }} + +### `label_quantity` + +This annotation goes in the section that we want to be filled with tabular data, not in the single quantities. +It is used to give a name to the instances that might be created by the parser. If it is not provided, the name of the section itself will be used as name. +Many times it is useful because, i. e., one might want to create a bundle of instances of, say, a "Substrate" class, each instance filename not being "Substrate_1", "Substrate_2", etc., but being named after a quantity contained in the class that is, for example, the specific ID of that sample. + + +```yaml +MySection: + more: + label_quantity: my_quantity + quantities: + my_quantity: + type: np.float64 + shape: ['*'] + description: "my quantity to be filled from the tabular data file" + unit: K + m_annotations: + tabular: + name: "Sheet1/my header" + plot: + x: timestamp + y: ./my_quantity +``` + +!!! important + The quantity designated as `label_quantity` should not be an array but a integer, float or string, to be set as the name of a file. If an array quantity is chosen, the parser would fall back to the use of the section as name. ## Tabular data -{{ pydantic_model('nomad.datamodel.metainfo.annotations.TabularParserAnnotation', heading='### tabular_parser') }} -{{ pydantic_model('nomad.datamodel.metainfo.annotations.TabularAnnotation', heading='### tabular') }} -{{ pydantic_model('nomad.datamodel.metainfo.annotations.PlotAnnotation', heading='## plot') }} +{{ pydantic_model('nomad.datamodel.metainfo.annotations.TabularAnnotation', heading='### `tabular`') }} + +Each and every quantity to be filled with data from tabular data files should be annotated as the following example. +A practical example is provided in [How To](../schemas/tabular.md#preparing-the-tabular-data-file) section. + +```yaml +my_quantity: + type: np.float64 + shape: ['*'] + description: "my quantity to be filled from the tabular data file" + unit: K + m_annotations: + tabular: + name: "Sheet1/my header" + plot: + x: timestamp + y: ./my_quantity +``` + +### `tabular_parser` + +One special quantity will be dedicated to host the tabular data file. In the following examples it is called `data_file`, it contains the `tabular_parser` annotation, as shown below. + +{{ pydantic_model('nomad.datamodel.metainfo.annotations.TabularParserAnnotation', heading = '') }} + +### Available Combinations + +|Tutorial ref.|`file_mode`|`mapping_mode`|`sections`|How to ref.| +|---|---|---|---|---| +|1|`current_entry`|`column`|`root`|[HowTo](../schemas/tabular.md#1-column-mode-current-entry-parse-to-root)| +|2|`current_entry`|`column`|my path|[HowTo](../schemas/tabular.md#2-column-mode-current-entry-parse-to-my-path)| +|<span style="color:red">np1</span>|`current_entry`|`row`|`root`|<span style="color:red">Not possible</span>| +|3|`current_entry`|`row`|my path|[HowTo](../schemas/tabular.md#3-row-mode-current-entry-parse-to-my-path)| +|<span style="color:red">np2</span>|`single_new_entry`|`column`|`root`|<span style="color:red">Not possible</span>| +|4|`single_new_entry`|`column`|my path|[HowTo](../schemas/tabular.md#4-column-mode-single-new-entry-parse-to-my-path)| +|<span style="color:red">np3</span>|`single_new_entry`|`row`|`root`|<span style="color:red">Not possible</span>| +|5|`single_new_entry`|`row`|my path|[HowTo](../schemas/tabular.md#5-row-mode-single-new-entry-parse-to-my-path)| +|<span style="color:red">np4</span>|`multiple_new_entries`|`column`|`root`|<span style="color:red">Not possible</span>| +|<span style="color:red">np5</span>|`multiple_new_entries`|`column`|my path|<span style="color:red">Not possible</span>| +|6|`multiple_new_entries`|`row`|`root`|[HowTo](../schemas/tabular.md#6-row-mode-multiple-new-entries-parse-to-root)| +|7|`multiple_new_entries`|`row`|my path|[HowTo](../schemas/tabular.md#7-row-mode-multiple-new-entries-parse-to-my-path)| + +```yaml +data_file: + type: str + description: "the tabular data file containing data" + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: column + file_mode: single_new_entry + sections: + - my_section/my_quantity +``` + +<!-- The available options are: + +|**name**|**type**|**description**| +|---|---|---| +|`parsing_options`|group of options|some pandas `Dataframe` options.| +|`mapping_options`|list of groups of options|they allow to choose among all the possible modes of parsing data from the spreadsheet file to the NOMAD archive file. Each group of options can be repeated in a list. | --> + + +{{ pydantic_model('nomad.datamodel.metainfo.annotations.PlotAnnotation', heading='## Plot') }} -{{ pydantic_model('nomad.datamodel.metainfo.annotations.BrowserAnnotation', heading='## browser') }} diff --git a/docs/schemas/2col.png b/docs/schemas/2col.png new file mode 100644 index 0000000000000000000000000000000000000000..dee931828fcb2bee2d953936be58721b828e66c4 Binary files /dev/null and b/docs/schemas/2col.png differ diff --git a/docs/schemas/2col_notes.png b/docs/schemas/2col_notes.png new file mode 100644 index 0000000000000000000000000000000000000000..cc43a120f245b9aef3c3d65e91c127fdae562f09 Binary files /dev/null and b/docs/schemas/2col_notes.png differ diff --git a/docs/schemas/basics.md b/docs/schemas/basics.md index 965d74d6968b00c33b626276bc11fc8037617c2f..8dba33bccfa2239a04aab56d495a2d215392ac79 100644 --- a/docs/schemas/basics.md +++ b/docs/schemas/basics.md @@ -1,6 +1,6 @@ # Write NOMAD Schemas in YAML -This guide explains how to write and upload NOMAD schemas in our `.archive.yaml` format. For more information visit the [learn section on schemas](../learn/data.md). +This guide explains how to write and upload NOMAD schemas in our `.archive.yaml` format. For more information on how an archive file is composed, visit the [learn section on schemas](../learn/data.md). ## Example data diff --git a/docs/schemas/columns.png b/docs/schemas/columns.png new file mode 100644 index 0000000000000000000000000000000000000000..6216c97b2727828fec4189e95af7f4db971489c0 Binary files /dev/null and b/docs/schemas/columns.png differ diff --git a/docs/schemas/rows.png b/docs/schemas/rows.png new file mode 100644 index 0000000000000000000000000000000000000000..5ccb774dcb73c71195aac63203a0f0a29961d72d Binary files /dev/null and b/docs/schemas/rows.png differ diff --git a/docs/schemas/rows_subsection.png b/docs/schemas/rows_subsection.png new file mode 100644 index 0000000000000000000000000000000000000000..79195547e844de1f2f08e64bcda6c317e4d23189 Binary files /dev/null and b/docs/schemas/rows_subsection.png differ diff --git a/docs/schemas/tabular.md b/docs/schemas/tabular.md index 4483ac6a8a2a56f2d168abaea27bd6a6c6215ba6..33ddeb704a98647996771454f240959b7c3daef2 100644 --- a/docs/schemas/tabular.md +++ b/docs/schemas/tabular.md @@ -1,296 +1,254 @@ -In order to import your data from a `.csv` or `Excel` file, NOMAD provides three distinct (and separate) ways, that -with each comes unique options for importing and interacting with your data. In order to better understand how to use -NOMAD tabular parser to import your data, follow three sections below. In each section you -can find a commented sample schema with a step-by-step guide on how to import your tabular data. - -Tabular parser, implicitly, parse the data into the same NOMAD entry where the datafile is loaded. Also, explicitly, -this can be defined by putting the corresponding annotations under `current_entry` (check the examples below). -In addition, tabular parser can be set to parse the data into new entry (or entries). For this, the proper annotations -should be appended to `new_entry` annotation in your schema file. - -Two main components of any tabular parser schema are: -1) implementing the correct base-section(s), and -2) providing a `data_file` `Quantity` with the correct `m_annotations`. - -Please bear in mind that the schema files should 1) follow the NOMAD naming convention -(i.e. `My_Name.archive.yaml`), and 2) be accompanied by your data file in order for NOMAD to parse them. -In the examples provided below, an `Excel` file is assumed to contain all the data, as both NOMAD and -`Excel` support multiple-sheets data manipulations and imports. Note that the `Excel` file name in each schema -should match the name of the `Excel` data file, which in case of using a `.csv` data file, it can be replaced by the -`.csv` file name. - -`TableData` (and any other section(s) that is inheriting from `TableData`) has a customizable checkbox Quantity -(i.e. `fill_archive_from_datafile`) to turn the tabular parser `on` or `off`. -If you do not want to have the parser running everytime you make a change to your archive data, it is achievable then via -unchecking the checkbox. It is customizable in the sense that if you do not wish to see this checkbox at all, -you can configure the `hide` parameter of the section's `m_annotations` to hide the checkbox. This in turn sets -the parser to run everytime you save your archive. +Refer to the [Reference guide](../reference/annotations.md) for the full list of annotations connected to this parser and to the [Tabular parser tutorial](../tutorial/custom.md#the-built-in-tabular-parser) for a detailed description of each of them. -Be cautious though! Turning on the tabular parser (or checking the box) on saving your data will cause -losing/overwriting your manually-entered data by the parser! +## Preparing the tabular data file -## Column-mode -The following sample schema creates one quantity off the entire column of an excel file (`column mode`). -For example, suppose in an excel sheet, several rows contain information of a chemical product (e.g. `purity` in one -column). In order to list all the purities under the column `purity` and import them into NOMAD, you can use the -following schema by substituting `My_Quantity` with any name of your choice (e.g. `Purity`), -`tabular-parser.data.xlsx` with the name of the `csv/excel` file where the data lies, and `My_Sheet/My_Column` with -sheet_name/column_name of your targeted data. The `Tabular_Parser` can also be changed to any arbitrary name of your -choice. +NOMAD and `Excel` support multiple-sheets data manipulations and imports. Each quantity in the schema will be annotated with a source path composed by sheet name and column header. The path to be used with the tabular data displayed below would be `Sheet1/My header 1` and it would be placed it the `tabular` annotation, see [Schema annotations](../tutorial/custom.md#to-be-an-entry-or-not-to-be-an-entry) section. -Important notes: +<p align="center" width="100%"> + <img width="30%" src="2col.png"> +</p> -- `shape: ['*']` under `My_Quantity` is essential to parse the entire column of the data file. -- The `data_file` `Quantity` can have any arbitrary name (e.g. `xlsx_file`) -- `My_Quantity` can also be defined within another subsection (see next sample schema) -- Use `current_entry` and append `column_to_sections` to specify which sub_section(s) is to be filled in -this mode. `Leaving this field empty` causes the parser to parse the entire schema under column mode. +In the case there is only one sheet in the Excel file, or when using a `.csv` file that is a single-sheet format, the sheet name is not required in the path. -```yaml ---8<-- "examples/data/docs/tabular-parser-col-mode.archive.yaml" -``` +The data sheets can be stored in one or more files depending on the user needs. Each sheet can independently be organized in one of the following ways: + +1) Columns:<br /> + each column contains an array of cells that we want to parse into one quantity. Example: time and temperature arrays to be plotted as x and y. + +<p align="center" width="100%"> + <img width="30%" src="columns.png"> +</p> + +2) Rows:<br /> + each row contains a set of cells that we want to parse into a section, i. e. a set of quantities. Example: an inventory tabular data file (for substrates, precursors, or more) where each column represents a property and each row corresponds to one unit stored in the inventory. + +<p align="center" width="100%"> + <img width="30%" src="rows.png"> +</p> + +3) Rows with repeated columns:<br /> -<b>Step-by-step guide to import your data using column-mode:</b> -After writing your schema file, you can create a new upload in NOMAD (or use an existing upload), -and upload both your `schema file` and the `excel/csv` file together (or zipped) to your NOMAD project. In the -`Overview` page of your NOMAD upload, you should be able to see a new entry created and appended to the `Process data` -section. Go to the entry page, click on `DATA` tab (on top of the screen) and in the `Entry` lane, your data -is populated under the `data` sub_section. +in addition to the mode 2), whenever the parser detects the presence of multiple columns (or multiple sets of columns) with same headers, these are taken as multiple instances of a subsection. More explanations will be delivered when showing the schema for such a structure. Example: a crystal growth process where each row is a step of the crystal growth and the repeated columns describe the "precursor materials", that can be more than one during such processes and they are described by the same "precursor material" section. -#### Row-mode Sample: -The sample schema provided below, creates separate instances of a repeated section from each row of an excel file -(`row mode`). For example, suppose in an excel sheet, you have the information for a chemical product -(e.g. `name` in one column), and each row contains one entry of the aforementioned chemical product. -Since each row is separate from others, in order to create instances of the same product out of all rows -and import them into NOMAD, you can use the following schema by substituting `My_Subsection`, -`My_Section` and `My_Quantity` with any appropriate name (e.g. `Substance`, `Chemical_product` -and `Name` respectively). +<p align="center" width="100%"> + <img width="45%" src="rows_subsection.png"> +</p> -Important notes: +Furthermore, we can insert comments before our data, we can use a special character to mark one or more rows as comment rows. The special character is annotated within the schema in the [parsing options](#parsing-options) section: -- This schema demonstrates how to import data within a subsection of another subsection, meaning the -targeted quantity should not necessarily go into the main `quantites`. -- Setting `row_to_sections` under `current_entry` signals that for each row in the sheet_name (provided in `My_Quantity`), -one instance of the corresponding (sub-)section (in this example, `My_Subsection` sub-section as it has the `repeats` -option set to true), will be appended. Please bear in mind that if this mode is selected, then all other quantities -in this sub_section, should exist in the same sheet_name. +<p align="center" width="100%"> + <img width="30%" src="2col_notes.png"> +</p> +## Inheriting the TableData base section + +`TableData` can be inherited adding the following lines in the yaml schema file:<br /> + +```yaml +MySection: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData +``` + +`EntryData` is usually also necessary as we will create entries from the section we are defining.<br /> +`TableData` provides a customizable checkbox quantity, called `fill_archive_from_datafile`, to turn the tabular parser `on` or `off`.<br /> +To avoid the parser running everytime a change is made to the archive data, it is sufficient to uncheck the checkbox. It is customizable in the sense that if you do not wish to see this checkbox at all, you can configure the `hide` parameter of the section's `m_annotations` to hide the checkbox. This in turn sets the parser to run everytime you save your archive. To hide it, add the following lines: ```yaml ---8<-- "examples/data/docs/tabular-parser-row-mode.archive.yaml" +MySection: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + hide: ['fill_archive_from_datafile'] ``` -<b>Step-by-step guide to import your data using row-mode:</b> +Be cautious though! Turning on the tabular parser (or checking the box) on saving your data will cause +losing/overwriting your manually-entered data by the parser! + +## Importing data in NOMAD + +After writing a schema file and creating a new upload in NOMAD (or using an existing upload), it is possible to upload the schema file. After creating a new Entry out of one section of the schema, the tabular data file must be dropped in the quantity designated by the `FileEditQuantity` annotation. After clicking save the parsing will start. In the Overview page of the NOMAD upload, new Entries are created and appended to the Processed data section. In the Entry page, clicking on DATA tab (on top of the screen) and in the Entry lane, the data is populated under the `data` subsection. +## Hands-on examples of all tabular parser modes -After writing your schema file, you can create a new upload in NOMAD (or use an existing upload), -and upload both your `schema file` and the `excel/csv` file together (or zipped) to your NOMAD project. In the -`Overview` page of your NOMAD upload, you should be able to see as many new sub-sections created and appended -to the repeating section as there are rows in your `excel/csv` file. -Go to the entry page of the new entries, click on `DATA` tab (on top of the screen) and in the `Entry` lane, -your data is populated under the `data` sub_section. +In this section eight examples will be presented, containing all the features available in tabular parser. Refer to the [Tutorial](../tutorial/custom.md#to-be-an-entry-or-not-to-be-an-entry) for more comments on the implications of the structures generated by the following yaml files. -#### Entry-mode Sample: -The following sample schema creates one entry for each row of an excel file (`entry mode`). -For example, suppose in an excel sheet, you have the information for a chemical product (e.g. `name` in one column), -and each row contains one entry of the aforementioned chemical product. Since each row is separate from others, in -order to create multiple archives of the same product out of all rows and import them into NOMAD, you can use the -following schema by substituting `My_Quantity` with any appropriate name (e.g. `Name`). -Important note: +### 1. Column mode, current Entry, parse to root -- To create new entries based on your entire schema, set `row_to_entries` to `- root`. Otherwise, you can -provide the relative path of specific sub_section(s) in your schema to create new entries. -- Leaving `row_to_entries` empty causes the parser to parse the entire schema using <b>column mode</b>! +<p align="center" width="100%"> + <img width="100%" src="../tutorial/tabular-1.png"> +</p> +The first case gives rise to the simplest data archive file. Here the tabular data file is parsed by columns, directly within the Entry where the `TableData` is inherited and filling the quantities in the root level of the schema (see dedicated how-to to learn [how to inherit tabular parser in your schema](../schemas/tabular.md#inheriting-the-tabledata-base-section)). + +!!! important + - `data_file` quantity, i.e. the tabular data file name, is located in the same Entry of the parsed quantities. + - double check that `mapping_options > sections` contains the right path. It should point to the (sub)section where the quantities are decorated with `tabular` annotation, i. e., the one to be filled with tabular data (`root` in this case). + - quantities parsed in `column` mode must have the `shape: ['*']` attribute, that means they are arrays and not scalars. ```yaml ---8<-- "examples/data/docs/tabular-parser-entry-mode.archive.yaml" +--8<-- "examples/data/docs/tabular-parser_1_column_current-entry_to-root.archive.yaml" ``` -<b>Step-by-step guide to import your data using entry-mode:</b> - -After writing your schema file, you can create a new upload in NOMAD (or use an existing upload), -and upload both your `schema file` and the `excel/csv` file together (or zipped) to your NOMAD project. In the -`Overview` page of your NOMAD upload, you should be able to see as many new entries created and appended -to the `Process data` section as there are rows in your `excel/csv` file. -Go to the entry page of the new entries, click on `DATA` tab (on top of the screen) and in the `Entry` lane, -your data is populated under the `data` sub_section. - -<b>Advanced options to use/set in tabular parser:</b> - -- If you want to populate your schema from multiple `excel/csv` files, you can -define multiple data_file `Quantity`s annotated with `tabular_parser` in the root level of your schema -(root level of your schema is where you inherit from `TableData` class under `base_sections`). -Each individual data_file quantity can now contain a list of sub_sections which are expected to be filled -using one- or all of the modes mentioned above. Check the `MyOverallSchema` section in -`Complex Schema` example below. It contains 2 data_file quantities that each one, contains separate instructions -to populate different parts of the schema. `data_file_1` is responsible to fill `MyColSubsection` while `data_file_2` -fills all sub_sections listed in `row_to_sections` and `entry_to_sections` under `new_entry`. - -- When using the entry mode, you can create a custom `Quantity` to hold a reference to each new entries -generated by the parser. Check the `MyEntrySubsection` section in the `Complex Schema` example below. -The `refs_quantity` is a `ReferenceEditQuantiy` with type `#/MyEntry` which tells the parser to -populate this quantity with a reference to the fresh entry of type `MyEntry`. Also, you may use -`tabular_pattern` annotation to explicitly set the name of the fresh entries. - -- If you have multiple columns with exact same name in your `excel/csv` file, you can parse them using row mode. -For this, define a repeating sub_section that handles your data in different rows and inside each row, define another -repeating sub_section that contains your repeating columns. Check `MySpecialRowSubsection` section in the -`Complex Schema` example below. `data_file_2` contains a repeating column called `row_quantity_2` and -we want to create a section out of each row and each column. This is done by -creating one row of type `MySpecialRowSubsection` and populate -`MyRowQuantity3` quantity from `row_quantity_3` column in the `csv` file, and appending each column of -`row_quantity_2` to `MyRowQuantity2`. +### 2. Column mode, current Entry, parse to my path + +<p align="center" width="100%"> + <img width="100%" src="../tutorial/tabular-2.png"> +</p> + +The parsing mode presented here only differs from the previous for the `sections` annotations. In this case the section that we want to fill with tabular data can be nested arbitrarily deep in the schema and the `sections` annotation must be filled with a forward slash path to the desired section, e. g. `my_sub_section/my_sub_sub_section`. + +!!! important + - `data_file` quantity, i.e. the tabular data file name, is located in the same Entry of the parsed quantities. + - double check that `mapping_options > sections` contains the right path. It should point to the (sub)section where the quantities are decorated with `tabular` annotation, i. e., the one to be filled with tabular data. + - the section to be parsed can be arbitrarily nested, given that the path provided in `sections` reachs it (e. g. `my_sub_sec/my_sub_sub_sec`). + - quantities parsed in `column` mode must have the `shape: ['*']` attribute, that means they are arrays and not scalars. ```yaml ---8<-- "examples/data/docs/tabular-parser-complex.archive.yaml" +--8<-- "examples/data/docs/tabular-parser_2_column_current-entry_to-path.archive.yaml" ``` -Here are all parameters for the two annotations `Tabular Parser` and `Tabular`. +### 3. Row mode, current Entry, parse to my path -{{ pydantic_model('nomad.datamodel.metainfo.annotations.TabularParserAnnotation', heading='### Tabular Parser') }} -{{ pydantic_model('nomad.datamodel.metainfo.annotations.TabularAnnotation', heading='### Tabular') }} +<p align="center" width="100%"> + <img width="100%" src="../tutorial/tabular-3.png"> +</p> -{{ pydantic_model('nomad.datamodel.metainfo.annotations.PlotAnnotation', heading='## Plot Annotation') }} +The current is the first example of parsing in row mode. This means that every row of the excel file while be placed in one instance of the section that is defined in `sections`. This section must be decorated with `repeats: true` annotation, it will allow to generate multiple instances that will be appended in a list with sequential numbers. Instead of sequential numbers, the list can show specific names if `label_quantity` annotation is appended to the repeated section. This annotation is included in the how-to example. The section is written separately in the schema and it does not need the `EntryData` inheritance because the instances will be grafted directly in the current Entry. As explained [below](#91-row-mode-current-entry-parse-to-root), it is not possible for `row` and `current_entry` to parse directly in the root because we need to create multiple instances of the selected subsection and organize them in a list. -## Built-in base sections for ELNs +!!! important + - `data_file` quantity, i.e. the tabular data file name, is located in the same Entry of the parsed quantities. + - double check that `mapping_options > sections` contains the right path. It should point to the (sub)section where the quantities are decorated with `tabular` annotation, i. e., the one to be filled with tabular data. + - the section to be parsed can be arbitrarily nested, given that the path provided in `sections` reachs it (e. g. `my_sub_sec/my_sub_sub_sec`). + - quantities parsed in `row` mode are scalars. + - make use of `repeats: true` in the subsection within the parent section `MySection`. + - `label_quantity` annotation uses a quantity as name of the repeated section. If it is not provided, a sequential number will be used for each instance. -Coming soon ... +```yaml +--8<-- "examples/data/docs/tabular-parser_3_row_current-entry_to-path.archive.yaml" +``` + +### 4. Column mode, single new Entry, parse to my path -## Custom normalizers +<p align="center" width="100%"> + <img width="100%" src="../tutorial/tabular-4.png"> +</p> -For custom schemas, you might want to add custom normalizers. All files are parsed -and normalized when they are uploaded or changed. The NOMAD metainfo Python interface -allows you to add functions that are called when your data is normalized. +One more step of complexity is added here: the parsing is not performed in the current Entry, but a new Entry it automatically generated and filled. +This structure foresees a parent Entry where we collect one or more tabular data files and possibly other info while we want to separate a specific entity of our data structure in another searchable Entry in NOMAD, e. g. a substrate Entry or a measurement Entry that would be collected inside a parent experiment Entry. We need to inherit `SubSect` class from `EntryData` because these will be standalone archive files in NOMAD. Parent and children Entries are connected by means of the `ReferenceEditQuantity` annotation in the parent Entry schema. This annotation is attached to a quantity that becomes a hook to the other ones, It is a powerful tool that allows to list in the overview of each Entry all the other referenced ones, allowing to build paths of referencing available at a glance. -Here is an example: +!!! important + - `data_file` quantity, i.e. the tabular data file name, is located in the parent Entry, the data is parsed in the child Entry. + - double check that `mapping_options > sections` contains the right path. It should point to the (sub)section where the quantities are decorated with `tabular` annotation, i. e., the one to be filled with tabular data. + - the section to be parsed can be arbitrarily nested, given that the path provided in `sections` reachs it (e. g. `my_sub_sec/my_sub_sub_sec`) + - quantities parsed in `column` mode must have the `shape: ['*']` attribute, that means they are arrays and not scalars. + - inherit also the subsection from `EntryData` as it must be a NOMAD Entry archive file. -```python ---8<-- "examples/archive/custom_schema.py" +```yaml +--8<-- "examples/data/docs/tabular-parser_4_column_single-new-entry_to-path.archive.yaml" ``` -To add a `normalize` function, your section has to inherit from `ArchiveSection` which -provides the base for this functionality. Now you can overwrite the `normalize` function -and add you own behavior. Make sure to call the `super` implementation properly to -support schemas with multiple inheritance. +### 5. Row mode, single new Entry, parse to my path + +<p align="center" width="100%"> + <img width="100%" src="../tutorial/tabular-5.png"> +</p> -If we parse an archive like this: +Example analogous to the previous, where the new created Entry contains now a repeated subsection with a list of instances made from each line of the tabular data file, as show in the [Row mode, current Entry, parse to my path](#3-row-mode-current-entry-parse-to-my-path) case. + +!!! important + - `data_file` quantity, i.e. the tabular data file name, is located in the parent Entry, the data is parsed in the child Entry. + - double check that `mapping_options > sections` contains the right path. It should point to the (sub)section where the quantities are decorated with `tabular` annotation, i. e., the one to be filled with tabular data. + - the section to be parsed can be arbitrarily nested, given that the path provided in `sections` reachs it (e. g. `my_sub_sec/my_sub_sub_sec`) + - quantities parsed in `row` mode are scalars. + - inherit also the subsection from `EntryData` as it must be a NOMAD Entry archive file. + - make use of `repeats: true` in the subsection within the parent section `MySection`. + - `label_quantity` annotation uses a quantity as name of the repeated section. If it is not provided, a sequential number will be used for each instance. ```yaml ---8<-- "examples/archive/custom_data.archive.yaml" +--8<-- "examples/data/docs/tabular-parser_5_row_single-new-entry_to-path.archive.yaml" ``` -we will get a final normalized archive that contains our data like this: - -```json -{ - "data": { - "m_def": "examples.archive.custom_schema.SampleDatabase", - "samples": [ - { - "added_date": "2022-06-18T00:00:00+00:00", - "formula": "NaCl", - "sample_id": "2022-06-18 00:00:00+00:00--NaCl" - } - ] - } -} -``` +### 6. Row mode, multiple new entries, parse to root -## Third-party integration +<p align="center" width="100%"> + <img width="100%" src="../tutorial/tabular-6.png"> +</p> -NOMAD offers integration with third-party ELN providers, simplifying the process of connecting -and interacting with external platforms. Three main external ELN solutions that are integrated into NOMAD -are: [elabFTW](https://www.elabftw.net/), [Labfolder](https://labfolder.com/) and [chemotion](https://chemotion.net/). -The process of data retrieval and data mapping onto NOMAD's schema -varies for each of these third-party ELN provider as they inherently allow for certain ways of communicating with their -database. Below you can find a <b>How-to</b> guide on importing your data from each of these external -repositories. +The last feature available for tabular parser is now introduced: `multiple_new_entries`. It is only meaningful for `row` mode because each row of the tabular data file will be placed in a new Entry that is an instance of a class defined in the schema, this would not make sense for columns, though, as they usually need to be parsed all together in one class of the schema, for example the "timestamp" and "temperature" columns in a spreadsheet file would need to lie in the same class as they belong to the same part of experiment. +A further comment is needed to explain the combination of this feature with `root`. As mentioned before, using `root` foresees to graft data directly in the present Entry. In this case, this means that a manyfold of Entries will be generated based on the only class available in the schema. These Entries will not be bundled together by a parent Entry but just live in our NOMAD Upload as a spare list. They might be referenced manually by the user with `ReferenceEditQuantity` in other archive files. Bundling them together in one overarching Entry already at the parsing stage would require the next and last example to be introduced. +!!!important + - `data_file` quantity, i.e. the tabular data file name, is located in the parent Entry, the data is parsed in the children Entries. + - double check that `mapping_options > sections` contains the right path. It should point to the (sub)section where the quantities are decorated with `tabular` annotation, i. e., the one to be filled with tabular data. + - quantities parsed in `row` mode are scalars. + - inherit also the subsection from `EntryData` as it must be a NOMAD Entry archive file. + - make use of `repeats: true` in the subsection within the parent section `MySection`. + - `label_quantity` annotation uses a quantity as name of the repeated section. If it is not provided, a sequential number will be used for each instance. -### elabFTW integration +```yaml +--8<-- "examples/data/docs/tabular-parser_6_row_multiple-new-entries_to-root.archive.yaml" +``` -elabFTW is part of [the ELN Consortium](https://github.com/TheELNConsortium) -and supports exporting experimental data in ELN file format. ELNFileFormat is a zipped file -that contains <b>metadata</b> of your elabFTW project along with all other associated data of -your experiments. +### 7. Row mode, multiple new entries, parse to my path -<b>How to import elabFTW data into NOMAD:</b> +<p align="center" width="100%"> + <img width="100%" src="../tutorial/tabular-7.png"> +</p> -Go to your elabFTW experiment and export your project as `ELN Archive`. Save the file to your filesystem under -your preferred name and location (keep the `.eln` extension intact). -To parse your ebalFTW data into NOMAD, -go to the upload page of NOMAD and create a new upload. In the `overview` page, upload your exported file (either by -drag-dropping it into the <i>click or drop files</i> box or by navigating to the path where you stored the file). -This causes triggering NOMAD's parser to create as many new entries in this upload as there are experiments in your -elabFTW project. +As anticipated in the previous example, `row` mode in connection to `multiple_new_entries` will produce a manyfold of instances of a specific class, each of them being a new Entry. In the present case, each instance will also automatically be placed in a `ReferenceEditQuantity` quantity lying in a subsection defined within the parent Entry, coloured in plum in the following example image. -You can inspect the parsed data of each of your entries (experiments) by going to the <b>DATA</b> -tab of each entry page. Under <i>Entry</i> column, click on <i>data</i> section. Now a new lane titled -`ElabFTW Project Import` should be visible. Under this section, (some of) the metadata of your project is listed. -There two sub-sections: 1) <b>experiment_data</b>, and 2) <b>experiment_files</b>. +!!!important + - `data_file` quantity, i.e. the tabular data file name, is located in the same Entry, the data is parsed in the children Entries. + - double check that `mapping_options > sections` contains the right path. It should point to the (sub)section where the quantities are decorated with `tabular` annotation, i. e., the one to be filled with tabular data. + - the section to be parsed can be arbitrarily nested, given that the path provided in `sections` reachs it (e. g. `my_sub_sec/my_sub_sub_sec`) + - quantities parsed in `row` mode are scalars. + - inherit also the subsection from `EntryData` as it must be a standalone NOMAD archive file. + - make use of `repeats: true` in the subsection within the parent section `MySection`. + - `label_quantity` annotation uses a quantity as name of the repeated section. If it is not provided, a sequential number will be used for each instance. -<b>experiment_data</b> section contains detailed information of the given elabFTW experiment, such as -links to external resources and extra fields. <b>experiment_files</b> section is a list of sub-sections -containing metadata and additional info of the files associated with the experiment. +```yaml +--8<-- "examples/data/docs/tabular-parser_7_row_multiple-new-entries_to-path.archive.yaml" +``` +### 8. The Sub-Subsection nesting schema -### Labfolder integration +<p align="center" width="100%"> + <img width="100%" src="../tutorial/tabular-8.png"> +</p> -Labfolder provides API endpoints to interact with your ELN data. NOMAD makes API calls to -retrieve, parse and map the data from your Labfolder instance/database to a NOMAD's schema. -To do so, the necessary information are listed in the table below: +If the tabular data file contains multiple columns with exact same name, there is a way to parse them using `row` mode. As explained in previous examples, this mode creates an instance of a subsection of the schema for each row of the file. Whenever column with same name are found they are interpreted as multiple instances of a sub-subsection nested inside the subsection. To build a schema with such a feature it is enough to have two nested classes, each of them bearing a `repeats: true` annotation. This structure can be applied to each and every of the cases above with `row` mode parsing. -<i>project_url</i>: - The URL address to the Labfolder project. it should follow this pattern: - 'https://your-labfolder-server/eln/notebook#?projectIds=your-project-id'. This is used to setup - the server and initialize the NOMAD schema. +!!!important + - make use of `repeats: true` in the subsection within the parent section `MySection` and also in the sub-subsection within `MySubSect`. + - `label_quantity` annotation uses a quantity as name of the repeated section. If it is not provided, a sequential number will be used for each instance. -<i>labfolder_email</i>: - The email (user credential) to authenticate and login the user. <b>Important Note</b>: this - information <b>is discarded</b> once the authentication process is finished. +```yaml +--8<-- "examples/data/docs/tabular-parser_8_row_current-entry_to-path_subsubsection.archive.yaml" +``` -<i>password</i>: - The password (user credential) to authenticate and login the user. <b>Important Note</b>: this - information <b>is discarded</b> once the authentication process is finished. +### 9. Not possible implementations -<b>How to import Labfolder data into NOMAD:</b> +Some combinations of `mapping_options`, namely `file_mode`, `mapping_mode`, and `sections`, can give rise to not interpretable instructions or not useful data structure. For the sake of completeness, a brief explanation of the five not possible cases will be provided. +#### 9.1 Row mode, current Entry, parse to root -To get your data transferred to NOMAD, first go to NOMAD's upload page and create a new upload. -Then click on `CREATE ENTRY` button. Select a name for your entry and pick `Labfolder Project Import` from -the `Built-in schema` dropdown menu. Then click on `CREATE`. This creates an entry where you can -insert your user information. Fill the `Project url`, `Labfolder email` and `password` fields. Once completed, -click on the `save icon` in the -top-right corner of the screen. This triggers NOMAD's parser to populate the schema of current ELN. -Now the metadata and all files of your Labfolder project should be populated in this entry. +`row` mode always requires a section instance to be populated with one row of cells from the tabular data file. Multiple instances are hence generated from the rows available in the file. The instances are organized in a list and the list must be necessarily hosted as a subsection in some parent section. That's why, within the parent section, a path in `sections` must be provided different from `root`. -The `elements` section lists all the data and files in your projects. There are 6 main data types -returned by Labfolder's API: `DATA`, `FILE`, `IMAGE`, `TABLE`, `TEXT` and `WELLPLATE`. `DATA` element is -a special Labfolder element where the data is structured in JSON format. Every data element in NOMAD has a special -`Quantity` called `labfolder_data` which is a flattened and aggregated version of the data content. -`IMAGE` element contains information of any image stored in your Labfolder project. `TEXT` element -contains data of any text field in your Labfodler project. +#### 9.2 Column mode, single new Entry, parse to root -### Chemotion integration +This would create a redundant Entry with the very same structure of the one where the `data_file` quantity is placed, the structure would furthermore miss a reference between the two Entries. A better result is achieved using a path in `sections` that would create a new Entry and reference it in the parent one. +#### 9.3 Row mode, single new Entry, parse to root -NOMAD supports importing your data from Chemotion repository via `chemotion` parser. The parser maps -your data that is structured under chemotion schema, into a predefined NOMAD schema. From your Chemotion -repo, you can export your entire data as a zip file which then is used to populate NOMAD schema. +As explained in the first section of not possible cases, when parsing in row mode we create multiple instances that cannot remain as standalone floating objects. They must be organized as a list in a subsection of the parent Entry. -<b>How to import Chemotion data into NOMAD:</b> +#### 9.4 Column mode, multiple new entries, parse to root -Go to your Chemotion repository and export your project. Save the file to your filesystem under -your preferred name and location (`your_file_name.zip`). -To get your data parsed into NOMAD, -go to the upload page of NOMAD and create a new upload. In the `overview` page, upload your exported file (either by -drag-dropping it into the <i>click or drop files</i> box or by navigating to the path where you stored the file). -This causes triggering NOMAD's parser to create one new entry in this upload. +This case would create a useless set of Entries containing one array quantity each. Usually, when parsing in column mode we want to parse together all the columns in the same section. -You can inspect the parsed data of each of this new entry by navigating to the <b>DATA</b> -tab of the current entry page. Under <i>Entry</i> column, click on <i>data</i> section. Now a new lane titled -`Chemotion Project Import` should be visible. Under this section, (some of) the metadata of your project is listed. -Also, there are various (sub)sections which are either filled depending on whether your datafile -contains information on them. +#### 9.5 Column mode, multiple new entries, parse to my path -If a section contains an image (or attachment) it is appended to the same section under `file` Quantity. +This case would create a useless set of Entries containing one array quantity each. Usually, when parsing in column mode we want to parse together all the columns in the same section. diff --git a/docs/tutorial.md b/docs/tutorial.md deleted file mode 100644 index e8e4945ebff94551ea6e7526297c340ecf092cb2..0000000000000000000000000000000000000000 --- a/docs/tutorial.md +++ /dev/null @@ -1,43 +0,0 @@ -This is a series of short videos that guide you through the main functionality of NOMAD. -It covers the whole data-life cycle: starting with data on your hard drive, -you will learn how to prepare, upload, publish data, and reference them with a DOI. -Furthermore, you will learn how to explore, download, and use data that were published on NOMAD before. -We will perform these steps with NOMAD's graphical user interface and its APIs. - -- [Example data and exercises](https://www.fairmat-nfdi.eu/events/fairmat-tutorial-1/tutorial-1-materials) -- [More videos and tutorials](https://youtube.com/playlist?list=PLrRaxjvn6FDW-_DzZ4OShfMPcTtnFoynT) - -!!! note - The NOMAD seen in the tutorials is an older version with a different color theme, - but all the demonstrated functionality is still available on the current version. - You'll find the NOMAD test installation mentioned in the first video - [here](https://nomad-lab.eu/prod/v1/test/gui/search/entries). - -## Uploading and publishing data - -This tutorial guides you through the basics of going from files on your computer -to a published dataset with DOI. - -<div class="youtube"> -<iframe src="https://www.youtube.com/embed/3rVvfYoUbO0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> -</div> - -## Exploring data on - -This tutorial shows how to use NOMAD's search interface and structured data browsing to explore available data. - -<div class="youtube"> -<iframe src="https://www.youtube.com/embed/38S2U-TIvxE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> -</div> - - -## Access data via API - -This video tutorial explains the basics of API and shows how to do simple requests -against the NOMAD api. - -<div class="youtube"> -<iframe src="https://www.youtube.com/embed/G1frBCrxC0g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> -</div> - - diff --git a/docs/tutorial/access_api.md b/docs/tutorial/access_api.md new file mode 100644 index 0000000000000000000000000000000000000000..5b145b5b37c55e93cb316b9c8dff982f6c0d0b02 --- /dev/null +++ b/docs/tutorial/access_api.md @@ -0,0 +1,13 @@ +This video tutorial explains the basics of API and shows how to do simple requests +against the NOMAD api. + +!!! note + The NOMAD seen in the tutorials is an older version with a different color theme, + but all the demonstrated functionality is still available on the current version. + You'll find the NOMAD test installation mentioned in the first video + [here](https://nomad-lab.eu/prod/v1/test/gui/search/entries). + +<div class="youtube"> +<iframe src="https://www.youtube.com/embed/G1frBCrxC0g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> +</div> + diff --git a/docs/tutorial/builtin.md b/docs/tutorial/builtin.md new file mode 100644 index 0000000000000000000000000000000000000000..d0ff3d97ff0f06bb885954d14b2af9ac9f46206a --- /dev/null +++ b/docs/tutorial/builtin.md @@ -0,0 +1,3 @@ +!!! attention + + This part of the documentation is still work in progress. diff --git a/docs/tutorial/custom.md b/docs/tutorial/custom.md new file mode 100644 index 0000000000000000000000000000000000000000..82dd3ca4b3cb86a1b76806babba932cad8cd325e --- /dev/null +++ b/docs/tutorial/custom.md @@ -0,0 +1,176 @@ +## What is a custom schema + +!!! attention + + This part of the documentation is still work in progress. + +An example of custom schema written in YAML language. + +```yaml +definitions: + name: 'My test ELN' + sections: + MySection: + base_sections: + - nomad.datamodel.data.EntryData + m_annotations: + eln: + quantities: + my_array_quantity_1: + type: str + shape: ['*'] + my_array_quantity_2: + type: str + shape: ['*'] +``` + +## The base sections + +!!! attention + + This part of the documentation is still work in progress. + +## Use of YAML files + +!!! attention + + This part of the documentation is still work in progress. + +## The built-in tabular parser + +NOMAD provides a standard parser to import your data from a spreadsheet file (`Excel` file with .xlsx extension) or from a CSV file (a Comma-Separated Values file with .csv extension). There are several ways to parse a tabular data file into a structured [data file](../learn/data.md#data), depending on which structure we want to give our data. Therefore, the tabular parser can be set very flexibly, directly from the [schema file](../learn/data.md#schema) through [annotations](../schemas/elns.md#annotations). +In this tutorial we will focus on most common modes of the tabular parser. A complete description of all modes is given in the [Reference](../reference/annotations.md#tabular_parser) section. You can also follow the dedicated [How To](../schemas/tabular.md) to see practical examples of the NOMAD tabular parser, in each section you can find a commented sample schema with a step-by-step guide on how to set it to obtain the desired final structure of your parsed data. +We will make use of the tabular parser in a custom yaml schema. To obtain some structured data in NOMAD with this parser:<br /> + +1) the schema files should follow the NOMAD [archive files](../learn/data.md#archive-files-a-shared-entry-structure) naming convention (i.e. `.archive.json` or `.archive.yaml` extension)<br /> +2) a data file must be instantiated from the schema file<br /> + + [comment]: <> (--> a link to the part upload etc should be inserted) + +3) a tabular data file must be dragged in the annotated [quantity](../schemas/basics.md#quantities) in order for NOMAD to parse it (the quantity is called `data_file` in the following examples) + +### To be an Entry or not to be an Entry + +To use this parser, three kinds of annotation must be included in the schema: `tabular`, `tabular_parser`, `label_quantity`. Refer to the dedicated [Reference](../reference/annotations.md#tabular-data) section for the full list of options. + +!!! important + The ranges of the three `mapping_options`, namely `file_mode`, `mapping_mode`, and `sections` can give rise to twelve different combinations (see table in [Reference](../reference/annotations.md#available-combinations)). It is worth to analyze each of them to understand which is the best choice to pursue from case to case. + Some of them give rise to "not possible" data structures but are still listed for completeness, a brief explanation of why it is not possible to implement them is also provided. + The main bring-home message is that a tabular data file can be parsed in one or more entries in NOMAD, giving rise to diverse and arbitrarily complex structures. + +In the following sections, two examples will be illustrated. A [tabular data file](../schemas/tabular.md#preparing-the-tabular-data-file) is parsed into one or more [data archive files](../learn/data.md#data), their structure is based on a [schema archive file](../learn/data.md#schema). NOMAD archive files are denoted as Entries. + +!!! note + From the NOMAD point of view, a schema file and a data file are the same kind of file where different sections have been filled (see [archive files description](../learn/data.md#archive-files-a-shared-entry-structure)). Specifically, a schema file has its `definitions` section filled while a data file will have its `data` section filled. See [How to write a schema](../schemas/basics.md#uploading-schemas) for a more complete description of an archive file. + +### Example 1 + +We want instantiate an object created from the schema already shown in the first [Tutorial section](#what-is-a-custom-schema) and populate it with the data contained in the following excel file. + +<p align="center" width="100%"> + <img width="30%" src="../schemas/2col.png"> +</p> + +The two columns in the file will be stored in a NOMAD Entry archive within two array quantities, as shown in the image below. In the case where the section to be filled is not in the root level of our schema but nested inside, it is useful to check the dedicated [How-to](../schemas/tabular.md#2-column-mode-current-entry-parse-to-my-path). + +<p align="center" width="100%"> + <img width="100%" src="../tutorial/tabular-1.png"> +</p> + +The schema will be decorated by the annotations mentioned at the beginning of this section and will look like this: + +```yaml +definitions: + name: 'My test ELN' + sections: + MySection: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: column + file_mode: current_entry + sections: + - '#root' + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + my_array_quantity_1: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 1" + my_array_quantity_2: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 2" +``` + +Here the tabular data file is parsed by columns, directly within the Entry where the `TableData` is inherited and filling the quantities in the root level of the schema (see dedicated how-to to learn [how to inherit tabular parser in your schema](../schemas/tabular.md#inheriting-the-tabledata-base-section)). + +!!! note + In yaml files a dash character indicates a list element. `mapping_options` is a list because it is possible to parse multiple tabular sheets from the same schema with different parsing options. `sections` in turn is a list because multiple sections of the schema can be parsed with same parsing options. + +### Example 2 + +<p align="center" width="100%"> + <img width="100%" src="../tutorial/tabular-6.png"> +</p> + +In this example, each row of the tabular data file will be placed in a new Entry that is an instance of a class defined in the schema. This would make sense for, say, an inventory spreadsheet where each row can be a separate entity such as a sample, a substrate, etc. +In this case, a manyfold of Entries will be generated based on the only class available in the schema. These Entries will not be bundled together by a parent Entry but just live in our NOMAD Upload as a spare list, to bundle them together it is useful to check the dedicated [How-to](../schemas/tabular.md#7-row-mode-multiple-new-entries-parse-to-my-path). They might still be referenced manually inside an overarching Entry, such as an experiment Entry, from the ELN with `ReferenceEditQuantity`. + +```yaml +definitions: + name: 'My test ELN' + sections: + MySection: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + more: + label_quantity: my_quantity_1 + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: row + file_mode: multiple_new_entries + sections: + - '#root' + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + my_quantity_1: + type: str + m_annotations: + tabular: + name: "My header 1" + my_quantity_2: + type: str + m_annotations: + tabular: + name: "My header 2" +``` \ No newline at end of file diff --git a/docs/tutorial/explore.md b/docs/tutorial/explore.md new file mode 100644 index 0000000000000000000000000000000000000000..c5225f98ac04467b1d57f889fbd337f74f189b0f --- /dev/null +++ b/docs/tutorial/explore.md @@ -0,0 +1,11 @@ +This tutorial shows how to use NOMAD's search interface and structured data browsing to explore available data. + +!!! note + The NOMAD seen in the tutorials is an older version with a different color theme, + but all the demonstrated functionality is still available on the current version. + You'll find the NOMAD test installation mentioned in the first video + [here](https://nomad-lab.eu/prod/v1/test/gui/search/entries). + +<div class="youtube"> +<iframe src="https://www.youtube.com/embed/38S2U-TIvxE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> +</div> diff --git a/docs/tutorial/plugins.md b/docs/tutorial/plugins.md new file mode 100644 index 0000000000000000000000000000000000000000..ff9ce6b1fdefac1dd73830d8d8701d3b85ded7e9 --- /dev/null +++ b/docs/tutorial/plugins.md @@ -0,0 +1,45 @@ +!!! attention + + This part of the documentation is still work in progress. + + + +## Custom normalizers + +For custom schemas, you might want to add custom normalizers. All files are parsed +and normalized when they are uploaded or changed. The NOMAD metainfo Python interface +allows you to add functions that are called when your data is normalized. + +Here is an example: + +```python +--8<-- "examples/archive/custom_schema.py" +``` + +To add a `normalize` function, your section has to inherit from `ArchiveSection` which +provides the base for this functionality. Now you can overwrite the `normalize` function +and add you own behavior. Make sure to call the `super` implementation properly to +support schemas with multiple inheritance. + +If we parse an archive like this: + +```yaml +--8<-- "examples/archive/custom_data.archive.yaml" +``` + +we will get a final normalized archive that contains our data like this: + +```json +{ + "data": { + "m_def": "examples.archive.custom_schema.SampleDatabase", + "samples": [ + { + "added_date": "2022-06-18T00:00:00+00:00", + "formula": "NaCl", + "sample_id": "2022-06-18 00:00:00+00:00--NaCl" + } + ] + } +} +``` \ No newline at end of file diff --git a/docs/tutorial/tabular-0.png b/docs/tutorial/tabular-0.png new file mode 100644 index 0000000000000000000000000000000000000000..bf1c71c47e5b64a71c01804f1b3dc0162a9f465d Binary files /dev/null and b/docs/tutorial/tabular-0.png differ diff --git a/docs/tutorial/tabular-1.png b/docs/tutorial/tabular-1.png new file mode 100644 index 0000000000000000000000000000000000000000..fbb6f0612a3488c79503596bb43cae019e433711 Binary files /dev/null and b/docs/tutorial/tabular-1.png differ diff --git a/docs/tutorial/tabular-2.png b/docs/tutorial/tabular-2.png new file mode 100644 index 0000000000000000000000000000000000000000..113161dc7012132d6bc110b2cda6e1807b06cd06 Binary files /dev/null and b/docs/tutorial/tabular-2.png differ diff --git a/docs/tutorial/tabular-3.png b/docs/tutorial/tabular-3.png new file mode 100644 index 0000000000000000000000000000000000000000..2633afded1d20249e4c3b22a7951a4b90e2bc3f7 Binary files /dev/null and b/docs/tutorial/tabular-3.png differ diff --git a/docs/tutorial/tabular-4.png b/docs/tutorial/tabular-4.png new file mode 100644 index 0000000000000000000000000000000000000000..11cb2e9b21a857e2afbcb0e2d934746c793a5eba Binary files /dev/null and b/docs/tutorial/tabular-4.png differ diff --git a/docs/tutorial/tabular-5.png b/docs/tutorial/tabular-5.png new file mode 100644 index 0000000000000000000000000000000000000000..c7833d18df3e5eee7b00ecf6ee1dc988031717fe Binary files /dev/null and b/docs/tutorial/tabular-5.png differ diff --git a/docs/tutorial/tabular-6.png b/docs/tutorial/tabular-6.png new file mode 100644 index 0000000000000000000000000000000000000000..573c788ff69b029c63625bd0b6c170794a8cbc91 Binary files /dev/null and b/docs/tutorial/tabular-6.png differ diff --git a/docs/tutorial/tabular-7.png b/docs/tutorial/tabular-7.png new file mode 100644 index 0000000000000000000000000000000000000000..1cd77800d7ca1797214c0f3c39f92d09aa8133df Binary files /dev/null and b/docs/tutorial/tabular-7.png differ diff --git a/docs/tutorial/tabular-8.png b/docs/tutorial/tabular-8.png new file mode 100644 index 0000000000000000000000000000000000000000..7ce3e9240aee97f8c3f6f066bde7743de1ffeb4f Binary files /dev/null and b/docs/tutorial/tabular-8.png differ diff --git a/docs/tutorial/third_party.md b/docs/tutorial/third_party.md new file mode 100644 index 0000000000000000000000000000000000000000..7d6598a572cd978dadf13a32d513d7c444fbb35a --- /dev/null +++ b/docs/tutorial/third_party.md @@ -0,0 +1,99 @@ +!!! attention + + This part of the documentation is still work in progress. + + +NOMAD offers integration with third-party ELN providers, simplifying the process of connecting +and interacting with external platforms. Three main external ELN solutions that are integrated into NOMAD +are: [elabFTW](https://www.elabftw.net/), [Labfolder](https://labfolder.com/) and [chemotion](https://chemotion.net/). +The process of data retrieval and data mapping onto NOMAD's schema +varies for each of these third-party ELN provider as they inherently allow for certain ways of communicating with their +database. Below you can find a <b>How-to</b> guide on importing your data from each of these external +repositories. + + +## elabFTW integration + +elabFTW is part of [the ELN Consortium](https://github.com/TheELNConsortium) +and supports exporting experimental data in ELN file format. ELNFileFormat is a zipped file +that contains <b>metadata</b> of your elabFTW project along with all other associated data of +your experiments. + +<b>How to import elabFTW data into NOMAD:</b> + +Go to your elabFTW experiment and export your project as `ELN Archive`. Save the file to your filesystem under +your preferred name and location (keep the `.eln` extension intact). +To parse your ebalFTW data into NOMAD, +go to the upload page of NOMAD and create a new upload. In the `overview` page, upload your exported file (either by +drag-dropping it into the <i>click or drop files</i> box or by navigating to the path where you stored the file). +This causes triggering NOMAD's parser to create as many new entries in this upload as there are experiments in your +elabFTW project. + +You can inspect the parsed data of each of your entries (experiments) by going to the <b>DATA</b> +tab of each Entry page. Under <i>Entry</i> column, click on <i>data</i> section. Now a new lane titled +`ElabFTW Project Import` should be visible. Under this section, (some of) the metadata of your project is listed. +There two sub-sections: 1) <b>experiment_data</b>, and 2) <b>experiment_files</b>. + +<b>experiment_data</b> section contains detailed information of the given elabFTW experiment, such as +links to external resources and extra fields. <b>experiment_files</b> section is a list of sub-sections +containing metadata and additional info of the files associated with the experiment. + + +## Labfolder integration + +Labfolder provides API endpoints to interact with your ELN data. NOMAD makes API calls to +retrieve, parse and map the data from your Labfolder instance/database to a NOMAD's schema. +To do so, the necessary information are listed in the table below: + +<i>project_url</i>: + The URL address to the Labfolder project. it should follow this pattern: + 'https://your-labfolder-server/eln/notebook#?projectIds=your-project-id'. This is used to setup + the server and initialize the NOMAD schema. + +<i>labfolder_email</i>: + The email (user credential) to authenticate and login the user. <b>Important Note</b>: this + information <b>is discarded</b> once the authentication process is finished. + +<i>password</i>: + The password (user credential) to authenticate and login the user. <b>Important Note</b>: this + information <b>is discarded</b> once the authentication process is finished. + +<b>How to import Labfolder data into NOMAD:</b> + +To get your data transferred to NOMAD, first go to NOMAD's upload page and create a new upload. +Then click on `CREATE ENTRY` button. Select a name for your Entry and pick `Labfolder Project Import` from +the `Built-in schema` dropdown menu. Then click on `CREATE`. This creates an Entry where you can +insert your user information. Fill the `Project url`, `Labfolder email` and `password` fields. Once completed, +click on the `save icon` in the +top-right corner of the screen. This triggers NOMAD's parser to populate the schema of current ELN. +Now the metadata and all files of your Labfolder project should be populated in this Entry. + +The `elements` section lists all the data and files in your projects. There are 6 main data types +returned by Labfolder's API: `DATA`, `FILE`, `IMAGE`, `TABLE`, `TEXT` and `WELLPLATE`. `DATA` element is +a special Labfolder element where the data is structured in JSON format. Every data element in NOMAD has a special +`Quantity` called `labfolder_data` which is a flattened and aggregated version of the data content. +`IMAGE` element contains information of any image stored in your Labfolder project. `TEXT` element +contains data of any text field in your Labfodler project. + +## Chemotion integration + +NOMAD supports importing your data from Chemotion repository via `chemotion` parser. The parser maps +your data that is structured under chemotion schema, into a predefined NOMAD schema. From your Chemotion +repo, you can export your entire data as a zip file which then is used to populate NOMAD schema. + +<b>How to import Chemotion data into NOMAD:</b> + +Go to your Chemotion repository and export your project. Save the file to your filesystem under +your preferred name and location (`your_file_name.zip`). +To get your data parsed into NOMAD, +go to the upload page of NOMAD and create a new upload. In the `overview` page, upload your exported file (either by +drag-dropping it into the <i>click or drop files</i> box or by navigating to the path where you stored the file). +This causes triggering NOMAD's parser to create one new Entry in this upload. + +You can inspect the parsed data of each of this new Entry by navigating to the <b>DATA</b> +tab of the current Entry page. Under <i>Entry</i> column, click on <i>data</i> section. Now a new lane titled +`Chemotion Project Import` should be visible. Under this section, (some of) the metadata of your project is listed. +Also, there are various (sub)sections which are either filled depending on whether your datafile +contains information on them. + +If a section contains an image (or attachment) it is appended to the same section under `file` Quantity. diff --git a/docs/tutorial/upload_publish.md b/docs/tutorial/upload_publish.md new file mode 100644 index 0000000000000000000000000000000000000000..08b18fb28f9a2567d4b3f2a0cd079a4c925157a4 --- /dev/null +++ b/docs/tutorial/upload_publish.md @@ -0,0 +1,17 @@ +This tutorial guides you through the basics of going from files on your computer +to a published dataset with DOI. + +It covers the whole data-life cycle: starting with data on your hard drive, +you will learn how to prepare, upload, publish data, and reference them with a DOI. +Furthermore, you will learn how to explore, download, and use data that were published on NOMAD before. +We will perform these steps with NOMAD's graphical user interface and its APIs. + +!!! note + The NOMAD seen in the tutorials is an older version with a different color theme, + but all the demonstrated functionality is still available on the current version. + You'll find the NOMAD test installation mentioned in the first video + [here](https://nomad-lab.eu/prod/v1/test/gui/search/entries). + +<div class="youtube"> +<iframe src="https://www.youtube.com/embed/3rVvfYoUbO0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> +</div> \ No newline at end of file diff --git a/examples/data/docs/tabular-parser-complex.archive.yaml b/examples/data/docs/tabular-parser-complex.archive.yaml index a483c15978cd99b034693cf5481a9f407912c91e..bdcc59f2d0b7c0275362dc70fc172fe54ac4d7e8 100644 --- a/examples/data/docs/tabular-parser-complex.archive.yaml +++ b/examples/data/docs/tabular-parser-complex.archive.yaml @@ -64,9 +64,8 @@ definitions: file_mode: current_entry sections: # list of subsections to be parsed by data_file_1 in row mode - MySpecialRowSubsection - - mapping_mode: entry + - mapping_mode: row file_mode: multiple_new_entries - with_file: true sections: # list of subsections to be parsed by data_file_1 in row mode - MyEntrySubsection MyRootQuantity: # This quantity lives in the root level which is parsed in the column mode diff --git a/examples/data/docs/tabular-parser-entry-mode.archive.yaml b/examples/data/docs/tabular-parser-entry-mode.archive.yaml index 88cfcdb5a635982c9f876d6559b344540b01e899..0482385a72b08c59e85fced6cc2de06d10f8068f 100644 --- a/examples/data/docs/tabular-parser-entry-mode.archive.yaml +++ b/examples/data/docs/tabular-parser-entry-mode.archive.yaml @@ -16,11 +16,10 @@ definitions: parsing_options: comment: '#' # Skipping lines in csv or excel file that start with the sign `#` mapping_options: - - mapping_mode: entry + - mapping_mode: row file_mode: multiple_new_entries - with_file: true sections: - - root + - '#root' My_quantity: type: str m_annotations: diff --git a/examples/data/docs/tabular-parser_1_column_current-entry_to-root.archive.yaml b/examples/data/docs/tabular-parser_1_column_current-entry_to-root.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..1b7d929074ef6fcbeb6199d5953718119581c23a --- /dev/null +++ b/examples/data/docs/tabular-parser_1_column_current-entry_to-root.archive.yaml @@ -0,0 +1,38 @@ +definitions: + name: 'My test ELN 1' + sections: + MySection1: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: column + file_mode: current_entry + sections: + - '#root' + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + my_array_quantity_1: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 1" + my_array_quantity_2: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_2_column_current-entry_to-path.archive.yaml b/examples/data/docs/tabular-parser_2_column_current-entry_to-path.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..9bac77f80a6892b9af7dae7170139c7af89a1786 --- /dev/null +++ b/examples/data/docs/tabular-parser_2_column_current-entry_to-path.archive.yaml @@ -0,0 +1,45 @@ +definitions: + name: 'My test ELN 2' + sections: + MySection2: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: column + file_mode: current_entry + sections: + - my_sub_section_2 + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + sub_sections: + my_sub_section_2: + section: '#/MySubSection2' + MySubSection2: + m_annotations: + eln: + quantities: + my_array_quantity_1: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 1" + my_array_quantity_2: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_3_row_current-entry_to-path.archive.yaml b/examples/data/docs/tabular-parser_3_row_current-entry_to-path.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..6fe345133787b42a5a262dd9fedfcdaca38c9531 --- /dev/null +++ b/examples/data/docs/tabular-parser_3_row_current-entry_to-path.archive.yaml @@ -0,0 +1,46 @@ +definitions: + name: 'My test ELN 3' + sections: + MySection3: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: row + file_mode: current_entry + sections: + - my_repeated_sub_section_3 + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + sub_sections: + my_repeated_sub_section_3: + repeats: true + section: '#/MySubSection3' + MySubSection3: + m_annotations: + eln: + more: + label_quantity: my_quantity_1 + quantities: + my_quantity_1: + type: str + m_annotations: + tabular: + name: "My header 1" + my_quantity_2: + type: str + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_4_column_single-new-entry_to-path.archive.yaml b/examples/data/docs/tabular-parser_4_column_single-new-entry_to-path.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..2acd3991a678be8bd651095b2acd4eee94417f03 --- /dev/null +++ b/examples/data/docs/tabular-parser_4_column_single-new-entry_to-path.archive.yaml @@ -0,0 +1,55 @@ +definitions: + name: 'My test ELN 4' + sections: + MySection4: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: column + file_mode: single_new_entry + sections: + - my_subsection_4 + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + sub_sections: + my_subsection_4: + section: + m_annotations: + eln: + quantities: + my_ref_quantity: + type: '#/MySubSection4' + m_annotations: + eln: + component: ReferenceEditQuantity + MySubSection4: + base_sections: + - nomad.datamodel.data.EntryData + m_annotations: + eln: + quantities: + my_array_quantity_1: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 1" + my_array_quantity_2: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_5_row_single-new-entry_to-path.archive.yaml b/examples/data/docs/tabular-parser_5_row_single-new-entry_to-path.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..bd5f910dd3d5be679c8e22ba8a83eb2095eff8ad --- /dev/null +++ b/examples/data/docs/tabular-parser_5_row_single-new-entry_to-path.archive.yaml @@ -0,0 +1,59 @@ +definitions: + name: 'My test ELN 5' + sections: + MySection5: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: row + file_mode: single_new_entry + sections: + - my_subsection_5/my_repeated_sub_section + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + sub_sections: + my_subsection_5: + section: + m_annotations: + eln: + quantities: + my_ref_quantity: + type: '#/MySubSection5' + m_annotations: + eln: + component: ReferenceEditQuantity + MySubSection5: + base_sections: + - nomad.datamodel.data.EntryData + m_annotations: + eln: + more: + label_quantity: my_quantity_1 + sub_sections: + my_repeated_sub_section: + repeats: true + section: + quantities: + my_quantity_1: + type: str + m_annotations: + tabular: + name: "My header 1" + my_quantity_2: + type: str + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_6_row_multiple-new-entries_to-root.archive.yaml b/examples/data/docs/tabular-parser_6_row_multiple-new-entries_to-root.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..d90e5a3910383e4579047a59e3443a36bbec429e --- /dev/null +++ b/examples/data/docs/tabular-parser_6_row_multiple-new-entries_to-root.archive.yaml @@ -0,0 +1,38 @@ +definitions: + name: 'My test ELN 6' + sections: + MySection6: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + more: + label_quantity: my_quantity_1 + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: row + file_mode: multiple_new_entries + sections: + - '#root' + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + my_quantity_1: + type: str + m_annotations: + tabular: + name: "My header 1" + my_quantity_2: + type: str + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_7_row_multiple-new-entries_to-path.archive.yaml b/examples/data/docs/tabular-parser_7_row_multiple-new-entries_to-path.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..4fa7ac96f330584dc9754ef31c19ce5123e5416c --- /dev/null +++ b/examples/data/docs/tabular-parser_7_row_multiple-new-entries_to-path.archive.yaml @@ -0,0 +1,56 @@ +definitions: + name: 'My test ELN 7' + sections: + MySection7: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: row + file_mode: multiple_new_entries + sections: + - my_repeated_sub_section_7 + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + sub_sections: + my_repeated_sub_section_7: + repeats: true + section: + m_annotations: + eln: + quantities: + my_ref_quantity: + type: '#/MySubSection7' + m_annotations: + eln: + component: ReferenceEditQuantity + MySubSection7: + base_sections: + - nomad.datamodel.data.EntryData + m_annotations: + eln: + more: + label_quantity: my_quantity_1 + quantities: + my_quantity_1: + type: str + m_annotations: + tabular: + name: "My header 1" + my_quantity_2: + type: str + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_8_row_current-entry_to-path_subsubsection.archive.yaml b/examples/data/docs/tabular-parser_8_row_current-entry_to-path_subsubsection.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..23b7b67891f593a833bdcbb311e6c3e5fee0971f --- /dev/null +++ b/examples/data/docs/tabular-parser_8_row_current-entry_to-path_subsubsection.archive.yaml @@ -0,0 +1,53 @@ +definitions: + name: 'My test ELN 8' + sections: + MySection8: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: row + file_mode: current_entry + sections: + - my_repeated_sub_section_8 + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + sub_sections: + my_repeated_sub_section_8: + repeats: true + section: '#/MySubSection8' + MySubSection8: + m_annotations: + eln: + more: + label_quantity: my_quantity_1 + quantities: + my_quantity_1: + type: str + m_annotations: + tabular: + name: "My header 1" + sub_sections: + my_repeated_sub_sub_section: + repeats: true + section: + more: + label_quantity: my_quantity_2 + quantities: + my_quantity_2: + type: str + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_np1_row_current-entry_to-root.archive.yaml b/examples/data/docs/tabular-parser_np1_row_current-entry_to-root.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..2d75bbb9f074e844291c654597b885521077a261 --- /dev/null +++ b/examples/data/docs/tabular-parser_np1_row_current-entry_to-root.archive.yaml @@ -0,0 +1,40 @@ +definitions: + name: 'My test ELN np1' + sections: + MySection: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: row + file_mode: current_entry + sections: + - '#root' + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + my_array_quantity_1: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 1" + my_array_quantity_2: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 2" +data: + m_def: MySection \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_np2_column_single-new-entry_to-root.archive.yaml b/examples/data/docs/tabular-parser_np2_column_single-new-entry_to-root.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..a4221f146c2f312b9a1735a891b48034cb36ad27 --- /dev/null +++ b/examples/data/docs/tabular-parser_np2_column_single-new-entry_to-root.archive.yaml @@ -0,0 +1,38 @@ +definitions: + name: 'My test ELN np2' + sections: + MySection: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: column + file_mode: single_new_entry + sections: + - '#root' + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + my_array_quantity_1: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 1" + my_array_quantity_2: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_np3_row_single-new-entry_to-root.archive.yaml b/examples/data/docs/tabular-parser_np3_row_single-new-entry_to-root.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..26c7c14431c3e5f456ac894816540d7044f1153c --- /dev/null +++ b/examples/data/docs/tabular-parser_np3_row_single-new-entry_to-root.archive.yaml @@ -0,0 +1,36 @@ +definitions: + name: 'My test ELN np3' + sections: + MySection: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: row + file_mode: single_new_entry + sections: + - '#root' + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + my_quantity_1: + type: str + m_annotations: + tabular: + name: "My header 1" + my_quantity_2: + type: str + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_np4_column_multiple-new-entries_to-root.archive.yaml b/examples/data/docs/tabular-parser_np4_column_multiple-new-entries_to-root.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..25b08d0ec343ba2946e5a6c7a1c58324aae600db --- /dev/null +++ b/examples/data/docs/tabular-parser_np4_column_multiple-new-entries_to-root.archive.yaml @@ -0,0 +1,38 @@ +definitions: + name: 'My test ELN np4' + sections: + MySection: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: column + file_mode: multiple_new_entries + sections: + - '#root' + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + my_array_quantity_1: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 1" + my_array_quantity_2: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/tabular-parser_np5_column_multiple-new-entries_to-path.archive.yaml b/examples/data/docs/tabular-parser_np5_column_multiple-new-entries_to-path.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..61a111b700eecdee11a95f268e45c1680f4a9387 --- /dev/null +++ b/examples/data/docs/tabular-parser_np5_column_multiple-new-entries_to-path.archive.yaml @@ -0,0 +1,56 @@ +definitions: + name: 'My test ELN np5' + sections: + MySection: + base_sections: + - nomad.datamodel.data.EntryData + - nomad.parsing.tabular.TableData + m_annotations: + eln: + quantities: + data_file: + type: str + default: test.xlsx + m_annotations: + tabular_parser: + parsing_options: + comment: '#' + mapping_options: + - mapping_mode: column + file_mode: multiple_new_entries + sections: + - my_repeated_sub_section + browser: + adaptor: RawFileAdaptor + eln: + component: FileEditQuantity + sub_sections: + my_repeated_sub_section: + repeats: true + section: + m_annotations: + eln: + quantities: + my_ref_quantity: + type: '#/MySubSect' + m_annotations: + eln: + component: ReferenceEditQuantity + MySubSect: + base_sections: + - nomad.datamodel.data.EntryData + m_annotations: + eln: + quantities: + my_array_quantity_1: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 1" + my_array_quantity_2: + type: str + shape: ['*'] + m_annotations: + tabular: + name: "My header 2" \ No newline at end of file diff --git a/examples/data/docs/test.xlsx b/examples/data/docs/test.xlsx new file mode 100644 index 0000000000000000000000000000000000000000..9d5e536a3181ec30673abbaaa08c7e7ca8e8da74 Binary files /dev/null and b/examples/data/docs/test.xlsx differ diff --git a/examples/data/light_eln/schema.archive.yaml b/examples/data/light_eln/schema.archive.yaml index 6214e83cab43d7f79b49325b8cb9102d6ac58bed..4265c785cdc7e097de1959b0cb56c8c8b409e58a 100644 --- a/examples/data/light_eln/schema.archive.yaml +++ b/examples/data/light_eln/schema.archive.yaml @@ -54,7 +54,7 @@ definitions: base_section: nomad.datamodel.metainfo.eln.Process quantities: instrument: - type: Instrument + type: '#/Instrument' m_annotations: eln: component: ReferenceEditQuantity @@ -65,8 +65,9 @@ definitions: template: processes: pvd_evaporation: {} + eln: base_sections: - - 'nomad.datamodel.metainfo.eln.Sample' + - 'nomad.datamodel.metainfo.basesections.CompositeSystem' - 'nomad.datamodel.data.EntryData' quantities: name: @@ -88,7 +89,7 @@ definitions: eln: component: AutocompleteEditQuantity # Allows to edit enums with an auto complete text form field chemicals: - type: Chemical # Types can also be other sections. This allows to reference a different section. + type: '#/Chemical' # Types can also be other sections. This allows to reference a different section. shape: ['*'] m_annotations: eln: @@ -154,8 +155,14 @@ definitions: # quantities in this section (and sub_section) with the column # data of .csv or .xlsx files. There is also a mode option that by default, is set to column. tabular_parser: - sep: '\t' - comment: '#' + parsing_options: + comment: '#' + sep: '\t' + mapping_options: + - mapping_mode: column + file_mode: current_entry + sections: + - '#root' browser: adaptor: RawFileAdaptor # Allows to navigate to files in the data browser eln: @@ -175,7 +182,8 @@ definitions: unit: mbar m_annotations: eln: - defaultDisplayUnit: mbar + # component: NumberEditQuantity + # defaultDisplayUnit: mbar ## MUST NOT BE AN ARRAY FOR THIS https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/932 tabular: name: Vacuum Pressure1 plot: diff --git a/gui/tests/artifacts.js b/gui/tests/artifacts.js index fee74ce10e84f3fd4c0394cad75cd23fd267f197..44f78a6e9d66a69f322851c58a8df312b9eda448 100644 --- a/gui/tests/artifacts.js +++ b/gui/tests/artifacts.js @@ -62367,55 +62367,6 @@ window.nomadArtifacts = { "m_def": "nomad.metainfo.metainfo.Section", "m_parent_index": 0, "m_parent_sub_section": "section_definitions", - "name": "TableRow", - "description": "Represents the data in one row of a table.", - "base_sections": [ - "/packages/14/section_definitions/1" - ], - "quantities": [ - { - "m_def": "nomad.metainfo.metainfo.Quantity", - "m_parent_index": 0, - "m_parent_sub_section": "quantities", - "name": "table_ref", - "description": "A reference to the table that this row is contained in.", - "type": { - "type_kind": "reference", - "type_data": "/packages/13/section_definitions/1" - } - } - ] - }, - { - "m_def": "nomad.metainfo.metainfo.Section", - "m_parent_index": 1, - "m_parent_sub_section": "section_definitions", - "name": "Table", - "description": "Represents a table with many rows.", - "base_sections": [ - "/packages/14/section_definitions/1" - ], - "quantities": [ - { - "m_def": "nomad.metainfo.metainfo.Quantity", - "m_parent_index": 0, - "m_parent_sub_section": "quantities", - "name": "row_refs", - "description": "References that connect to each row. Each row is stored in it individual entry.", - "type": { - "type_kind": "reference", - "type_data": "/packages/13/section_definitions/0" - }, - "shape": [ - "*" - ] - } - ] - }, - { - "m_def": "nomad.metainfo.metainfo.Section", - "m_parent_index": 2, - "m_parent_sub_section": "section_definitions", "name": "TableData", "description": "", "base_sections": [ diff --git a/mkdocs.yml b/mkdocs.yml index 681e00384cd6d055c7b58f1059c24e823e9c9393..3ca7ae09d0ce3d391dce21df22ee122b26067352 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -4,18 +4,25 @@ site_description: | site_author: The NOMAD Authors nav: - Home: index.md - - Tutorial: tutorial.md + - Tutorial: + - Uploading and publishing data: tutorial/upload_publish.md + - Exploring data on: tutorial/explore.md + - Access data via API: tutorial/access_api.md + - Built in schemas: tutorial/builtin.md + - Custom schemas: tutorial/custom.md + - Plugins: tutorial/plugins.md + - Third-party integration: tutorial/third_party.md - How-to guides: - Data Management: - How to upload/publish data: data/upload.md - How to use ELNs: data/eln.md - How to explore data: data/explore.md - How to use NORTH: data/north.md - - Schemas: + - Customize Schemas: - How to write a schema: schemas/basics.md - How to define ELNs: schemas/elns.md - How to use base sections: schemas/base_sections.md - - How to define tabular data: schemas/tabular.md + - How to use tabular parser: schemas/tabular.md - How to define workflows: schemas/workflows.md - How to reference hdf5: schemas/hdf5.md - Programming interfaces: diff --git a/nomad/datamodel/metainfo/annotations.py b/nomad/datamodel/metainfo/annotations.py index a015178c3b2de22913bfa86d9a49cc1001ee19d7..efcaf7b0b3aa500f5ed8ce68d084e5912fecf401 100644 --- a/nomad/datamodel/metainfo/annotations.py +++ b/nomad/datamodel/metainfo/annotations.py @@ -332,11 +332,10 @@ class BrowserAnnotation(AnnotationModel): class TabularMode(str, Enum): row = 'row' column = 'column' - entry = 'entry' class TabularParsingOptions(BaseModel): - skiprows: int = Field(None, description='Number of rows to skip') + skiprows: Union[List[int], int] = Field(None, description='Number of rows to skip') sep: str = Field(None, description='Character identifier of a separator') comment: str = Field(None, description='Character identifier of a commented line') separator: str = Field(None, description='Alias for `sep`') @@ -381,9 +380,6 @@ class TabularMappingOptions(BaseModel): `single_new_entry`: Creating a new entry and processing the data into this new NOMAD entry.<br/> `multiple_new_entries`: Creating many new entries and processing the data into these new NOMAD entries.<br/> ''') - with_file: bool = Field(False, description=''' - A boolean variable to set creation of new file(s) for each new entry. By default it is set to false - ''') sections: List[str] = Field(None, description=''' A `list` of paths to the (sub)sections where the tabular quantities are to be filled from the data extracted from the tabular file. @@ -419,14 +415,13 @@ class TabularParserAnnotation(AnnotationModel): nested sub-sections. The targeted sub-sections, will be considered when mapping table rows to quantities. Has to be used to annotate the quantity that holds the path to the `.csv` or excel file.<br/> `file_mode`: The character used to separate cells (specific to csv files).<br/> - `with_file`: A boolean variable to dump the processed/parsed data into a ascii-formatted YAML/JSON file.<br/> `sections`: The character denoting the commented lines.<br/> ''') class TabularAnnotation(AnnotationModel): ''' - Allows to map a quantity to a row of a tabular data-file. Should only be used + Allows to map a quantity to a row or a column of a spreadsheet data-file. Should only be used in conjunction with `tabular_parser`. ''' diff --git a/nomad/parsing/tabular.py b/nomad/parsing/tabular.py index 9f525b739d2a4e879f1b76d8f231a2f8e2c54ac8..60bbdcc3bf31d86681c7ef1d33d89b205dae35b9 100644 --- a/nomad/parsing/tabular.py +++ b/nomad/parsing/tabular.py @@ -16,8 +16,7 @@ # limitations under the License. # import os -import time -from typing import List, Dict, Callable, Set, Any, Tuple, Iterator, Union, Iterable, cast +from typing import List, Dict, Callable, Set, Any, Tuple, Iterator, Union, Iterable import pandas as pd from memoization import cached @@ -30,11 +29,9 @@ import yaml from nomad import utils from nomad.parsing import MatchingParser from nomad.units import ureg -from nomad.datamodel import EntryArchive, EntryMetadata -from nomad.datamodel.data import ArchiveSection, EntryData +from nomad.datamodel.data import ArchiveSection from nomad.metainfo import Section, Quantity, Package, Reference, MSection, Property -from nomad.metainfo.metainfo import MetainfoError, SubSection, MProxy, SectionProxy -from nomad.datamodel.context import Context +from nomad.metainfo.metainfo import MetainfoError, SubSection, MProxy from nomad.datamodel.metainfo.annotations import TabularAnnotation, TabularParserAnnotation, TabularFileModeEnum, \ TabularMode from nomad.metainfo.util import MSubSectionList @@ -47,7 +44,12 @@ from nomad.utils import generate_entry_id m_package = Package() -def create_archive(entry_dict, context, file_name, file_type): +root_mapping = { + 'root': '#root' +} + + +def create_archive(entry_dict, context, file_name, file_type, logger): if not context.raw_path_exists(file_name): with context.raw_file(file_name, 'w') as outfile: if file_type == 'json': @@ -55,6 +57,10 @@ def create_archive(entry_dict, context, file_name, file_type): elif file_type == 'yaml': yaml.dump(entry_dict, outfile) context.upload.process_updated_raw_file(file_name, allow_modify=True) + else: + logger.error( + f'{file_name} archive file already exists.' + f'If you intend to reprocess the older archive file, remove the existing one and run reprocessing again.') def traverse_to_target_data_file(section, path_list: List[str]): @@ -68,20 +74,6 @@ def traverse_to_target_data_file(section, path_list: List[str]): raise MetainfoError(f'The path {temp} in path_to_data_file does not exist') -class TableRow(EntryData): - ''' Represents the data in one row of a table. ''' - table_ref = Quantity( - type=Reference(SectionProxy('Table')), - description='A reference to the table that this row is contained in.') - - -class Table(EntryData): - ''' Represents a table with many rows. ''' - row_refs = Quantity( - type=Reference(TableRow.m_def), shape=['*'], - description='References that connect to each row. Each row is stored in it individual entry.') - - class TabularParserError(Exception): ''' Tabular-parser related errors. ''' pass @@ -106,7 +98,9 @@ class TableData(ArchiveSection): for quantity_def in self.m_def.all_quantities.values(): annotation = quantity_def.m_get_annotations('tabular_parser') annotation = annotation[0] if isinstance(annotation, list) else annotation - if annotation: + # this normalizer potentially creates new archives,.to avoid recursive call of the normalizer created by + # this normalizer, we check for if the parent is already parsed + if annotation and not archive.data.m_parent.metadata.entry_name: self.tabular_parser(quantity_def, archive, logger, annotation) def tabular_parser(self, quantity_def: Quantity, archive, logger, annotation: TabularParserAnnotation): @@ -138,30 +132,6 @@ class TableData(ArchiveSection): with archive.m_context.raw_file(data_file) as f: data = read_table_data(data_file, f, **parsing_options) - # Checking for any quantities in the root level of the TableData that is - # supposed to be filled from the excel file - for quantity_name, quantity in self.m_def.all_properties.items(): - if isinstance(quantity, Quantity) and getattr(self, quantity_name) is None and \ - quantity.m_get_annotations('tabular') is not None: - col_data = quantity.m_get_annotations('tabular').name - if '/' in col_data: - # extract the sheet & col names if there is a '/' in the 'name' - sheet_name, col_name = col_data.split('/') - if sheet_name not in list(data): - continue - try: - df = pd.DataFrame.from_dict(data.loc[0, sheet_name]) - self.m_set(quantity, np.array(df.loc[:, col_name])) - except Exception: - continue - else: - # Otherwise, assume the sheet_name is the first sheet of Excel/csv - try: - df = pd.DataFrame.from_dict(data.iloc[0, 0]) - self.m_set(quantity, np.array(df.loc[:, col_data])) - except Exception: - continue - mapping_options = annotation.mapping_options if mapping_options: for mapping_option in mapping_options: @@ -170,72 +140,56 @@ class TableData(ArchiveSection): mapping_mode = mapping_option.mapping_mode column_sections = mapping_option.sections if mapping_mode == TabularMode.column else None row_sections = mapping_option.sections if mapping_mode == TabularMode.row else None - entry_sections = mapping_option.sections if mapping_mode == TabularMode.entry else None - with_file = mapping_option.with_file except Exception: - raise TabularParserError("Couldn't extract the list of mapping_options. Double-check the mapping_options") + raise TabularParserError( + "Couldn't extract the list of mapping_options. Double-check the mapping_options") if file_mode == TabularFileModeEnum.current_entry: + # Checking for any quantities in the root level of the TableData that is + # supposed to be filled from the excel file + for quantity_name, quantity in self.m_def.all_properties.items(): + if isinstance(quantity, Quantity) and getattr(self, quantity_name) is None and \ + quantity.m_get_annotations('tabular') is not None: + col_data = quantity.m_get_annotations('tabular').name + if '/' in col_data: + # extract the sheet & col names if there is a '/' in the 'name' + sheet_name, col_name = col_data.split('/') + if sheet_name not in list(data): + continue + try: + df = pd.DataFrame.from_dict(data.loc[0, sheet_name]) + self.m_set(quantity, np.array(df.loc[:, col_name])) + except Exception: + continue + else: + # Otherwise, assume the sheet_name is the first sheet of Excel/csv + try: + df = pd.DataFrame.from_dict(data.iloc[0, 0]) + self.m_set(quantity, np.array(df.loc[:, col_data])) + except Exception: + continue if column_sections: _parse_column_mode(self, column_sections, data, logger=logger) if row_sections: _parse_row_mode(self, row_sections, data, logger) if file_mode == TabularFileModeEnum.multiple_new_entries: - for entry_section in entry_sections: - if entry_section == 'root': - self._parse_entry_mode(data, self.m_def, archive, with_file, mode='root', logger=logger) + for row_section in row_sections: + if row_section == root_mapping['root']: + self._parse_entry_mode(data, self.m_def, archive, is_root=root_mapping['root'], logger=logger) else: - entry_section_list = entry_section.split('/') + entry_section_list = row_section.split('/') entry_section_instance = create_subsection( self.m_def.all_properties[entry_section_list.pop(0)], entry_section_list) - self._parse_entry_mode(data, entry_section_instance, archive, with_file, logger=logger) + self._parse_entry_mode(data, entry_section_instance, archive, logger=logger) if file_mode == TabularFileModeEnum.single_new_entry: - section = self.m_def.all_properties[row_sections[0].split('/')[0]].sub_section.section_cls() if column_sections: - parse_columns(data, section) + self._parse_single_new_entry(parse_columns, data, column_sections, archive, logger) if row_sections: - # If there is a match, then remove the matched sections from row_sections so the main entry - # does not populate the matched row_section - is_quantity_def = False - for quantity_def in section.m_def.all_quantities.values(): - if isinstance(quantity_def.type, Reference): - try: - section_to_entry = quantity_def.type.target_section_def.section_cls() - is_quantity_def = True - except AttributeError: - continue - if not is_quantity_def: - raise TabularParserError( - f"No reference quantity is defined in {row_sections[0].split('/')[0]} section.") - matched_rows = [re.sub(r"^.*?\/", "", row) for row in row_sections] - _parse_row_mode(section_to_entry, matched_rows, data, logger) - - child_archive = EntryArchive( - data=section_to_entry, - m_context=archive.m_context, - metadata=EntryMetadata(upload_id=archive.m_context.upload_id, entry_name=section.m_def.name)) - - filename = f'{section.m_def.name}.archive.yaml' - create_archive(child_archive.m_to_dict(), archive.m_context, filename, 'yaml') - - child_entry_id = generate_entry_id(archive.m_context.upload_id, filename, None) - entry_id_dct = {filename: child_entry_id} - ref_quantity_proxy = MProxy( - m_proxy_value=f'../upload/archive/{child_entry_id}#/data', - m_proxy_context=self.m_context) - section.m_set(quantity_def, ref_quantity_proxy) - - if hasattr(self, row_sections[0].split('/')[0]): - setattr(self, row_sections[0].split('/')[0], None) - self.m_add_sub_section( - self.m_def.all_properties[row_sections[0].split('/')[0]], section, -1) - - if not with_file: - _delete_entry_file(entry_id_dct, archive, logger=logger) + self._parse_single_new_entry(_parse_row_mode, data, row_sections, archive, logger) else: parse_columns(data, self) @@ -248,10 +202,63 @@ class TableData(ArchiveSection): except AttributeError: self.fill_archive_from_datafile = False - def _parse_entry_mode(self, data, subsection_def, archive, with_file, mode=None, logger=None): + def _parse_single_new_entry(self, parser, data, section_list, archive, logger): + for single_entry_section in section_list: + target_section_str = single_entry_section.split('/')[0] if '/' in single_entry_section else single_entry_section + if target_section_str == root_mapping['root']: + target_section = self.m_def.section_cls() + section_to_entry = target_section + elif target_section_str in single_entry_section: + target_section = self.m_def.all_properties[target_section_str].sub_section.section_cls() + section_to_entry = target_section + is_quantity_def = False + for quantity_def in target_section.m_def.all_quantities.values(): + if isinstance(quantity_def.type, Reference): + try: + section_to_entry = quantity_def.type.target_section_def.section_cls() + is_quantity_def = True + except AttributeError: + continue + if not is_quantity_def: + pass + # raise TabularParserError( + # f"To create a new entry from {target_section_str}, it should be of type Reference.") + if parser.__code__.co_argcount == 2: + parser(data, section_to_entry) + else: + # If there is a match, then remove the matched sections from row_sections so the main entry + # does not populate the matched row_section + matched_rows = [re.sub(r"^.*?\/", "", single_entry_section)] + parser(section_to_entry, matched_rows, data, logger) + entry_name = set_entry_name(quantity_def, target_section, 0) + + from nomad.datamodel import EntryArchive, EntryMetadata + + child_archive = EntryArchive( + data=section_to_entry, + m_context=archive.m_context, + metadata=EntryMetadata(upload_id=archive.m_context.upload_id, entry_name=entry_name)) + + filename = f'{target_section.m_def.name}.archive.yaml' + create_archive(child_archive.m_to_dict(), archive.m_context, filename, 'yaml', logger) + + child_entry_id = generate_entry_id(archive.m_context.upload_id, filename, None) + + if is_quantity_def: + ref_quantity_proxy = MProxy( + m_proxy_value=f'../upload/archive/{child_entry_id}#/data', + m_proxy_context=self.m_context) + target_section.m_set(quantity_def, ref_quantity_proxy) + + if hasattr(self, single_entry_section.split('/')[0]): + setattr(self, single_entry_section.split('/')[0], None) + self.m_add_sub_section( + self.m_def.all_properties[single_entry_section.split('/')[0]], target_section, -1) + + def _parse_entry_mode(self, data, subsection_def, archive, is_root=False, logger=None): section = None is_referenced_section = False - if mode: + if is_root: section = subsection_def quantity_def = subsection_def child_sections = parse_table(data, subsection_def, logger=logger) @@ -261,6 +268,7 @@ class TableData(ArchiveSection): try: section = quantity_def.type.target_section_def.section_cls is_referenced_section = True + break except AttributeError: continue if not section: @@ -286,17 +294,23 @@ class TableData(ArchiveSection): except AttributeError: pass - entry_id_dct: Dict[str, str] = {} + # creating new entries for each new child_archives + from nomad.datamodel import EntryArchive, EntryMetadata + + # if mode is #root when creating multiple new entries, then append the first child to the current entry, + # and create new ones from second child onwards + if is_root: + first_child = child_sections.pop(0) + first_child_entry_name = archive.metadata.mainfile.split('.archive') + self.m_update_from_dict(first_child.m_to_dict()) + for index, child_section in enumerate(child_sections): filename = f"{mainfile_name}_{index}.entry_data.archive.{file_type}" - try: - entry_name: str = quantity_def.m_get_annotations('entry_name', None) - if entry_name.startswith('#'): - entry_name = f"{getattr(child_section, entry_name.split('/')[-1])}_{index}" - else: - entry_name = f"{quantity_def.m_get_annotations('entry_name', None)}_{index}" - except Exception: - entry_name = f"{quantity_def.name}_{index}" + if is_root: + entry_name: str = f'{first_child_entry_name[0]}_{index + 1}.archive{first_child_entry_name[1]}' + else: + entry_name: str = set_entry_name(quantity_def, child_section, index) + try: child_archive = EntryArchive( data=child_section, @@ -304,11 +318,10 @@ class TableData(ArchiveSection): metadata=EntryMetadata(upload_id=archive.m_context.upload_id, entry_name=entry_name)) except Exception: raise TabularParserError('New entries could not be generated.') - create_archive(child_archive.m_to_dict(), archive.m_context, filename, file_type) + create_archive(child_archive.m_to_dict(), archive.m_context, filename, file_type, logger=logger) if is_referenced_section: child_entry_id = generate_entry_id(archive.m_context.upload_id, filename, None) - entry_id_dct.update({filename: child_entry_id}) ref_quantity_proxy = MProxy( m_proxy_value=f'../upload/archive/{child_entry_id}#/data', m_proxy_context=self.m_context) @@ -317,41 +330,33 @@ class TableData(ArchiveSection): self.m_add_sub_section(subsection_def, section_ref, -1) - if not with_file: - _delete_entry_file(entry_id_dct, archive, logger=logger) - m_package.__init_metainfo__() +def set_entry_name(quantity_def, child_section, index) -> str: + if name := child_section.m_def.more.get('label_quantity', None): + entry_name = f"{child_section[name]}_{index}" + elif isinstance(quantity_def.type, Reference): + entry_name = f"{quantity_def.type._target_section_def.name}_{index}" + else: + entry_name = f"{quantity_def.name}_{index}" + return entry_name + + def _parse_column_mode(main_section, list_of_columns, data, logger=None): for column_section in list_of_columns: - try: - column_section_list = column_section.split('/') - section = create_subsection( - main_section.m_def.all_properties[column_section_list.pop(0)], - column_section_list).sub_section.section_cls() - except Exception: - logger.error( - f'{column_section} sub_section does not exist. There might be a problem in schema definition') - parse_columns(data, section) - setattr(main_section, column_section, section) - - -def _delete_entry_file(entry_id_dct, archive, logger=None): - from nomad.processing import Entry, ProcessStatus - while entry_id_dct: - tmp: Dict[str, str] = {} - for filename, entry_id in entry_id_dct.items(): - if Entry.objects(entry_id=entry_id).first().process_status == ProcessStatus.RUNNING: - tmp.update({filename: entry_id}) - else: - try: - os.remove(os.path.join(archive.m_context.upload.upload_files.external_os_path, 'raw', filename)) - except Exception: - logger.warning(f'Failed to remove archive {filename}.') - entry_id_dct = tmp - time.sleep(.5) + if column_section != root_mapping['root']: + try: + column_section_list = column_section.split('/') + section = create_subsection( + main_section.m_def.all_properties[column_section_list.pop(0)], + column_section_list).sub_section.section_cls() + except Exception: + logger.error( + f'{column_section} sub_section does not exist. There might be a problem in schema definition') + parse_columns(data, section) + setattr(main_section, column_section, section) def append_section_to_subsection(main_section, section_name: str, source_section: MSection): @@ -379,8 +384,11 @@ def _parse_row_mode(main_section, row_sections, data, logger): for section_name in section_names: section_name_list = section_name.split('/') section_name_str = section_name_list[0] - target_sub_section = main_section.m_def.all_properties[section_name_str] - section_def = target_sub_section.sub_section + try: + target_sub_section = main_section.m_def.all_properties[section_name_str] + section_def = target_sub_section.sub_section + except Exception: + raise TabularParserError('row-mode failed to parse the list of subsections') if not list_of_visited_sections.count(section_name_str): list_of_visited_sections.append(section_name_str) @@ -425,68 +433,63 @@ def _create_column_to_quantity_mapping(section_def: Section): annotation = annotation[0] if isinstance(annotation, list) else annotation if annotation and annotation.name: col_name = annotation.name - else: - col_name = quantity.name - if len(path) > 0: - col_name = f'{".".join([item[0].name for item in path])}.{col_name}' - if col_name in mapping: - raise MetainfoError( - f'The schema has non unique column names. {col_name} exists twice. ' - f'Column names must be unique, to be used for tabular parsing.') + if col_name in mapping: + raise MetainfoError( + f'The schema has non unique column names. {col_name} exists twice. ' + f'Column names must be unique, to be used for tabular parsing.') - def set_value( - section: MSection, value, section_path_to_top_subsection=[], path=path, quantity=quantity, - annotation: TabularAnnotation = annotation): + def set_value( + section: MSection, value, section_path_to_top_subsection=[], path=path, quantity=quantity, + annotation: TabularAnnotation = annotation): - for sub_section, section_def in path: - next_section = None - try: - next_section = section.m_get_sub_section(sub_section, -1) - except (KeyError, IndexError): - pass - if not next_section: - next_section = section_def.section_cls() - section.m_add_sub_section(sub_section, next_section, -1) - section = next_section - - if annotation and annotation.unit: - value *= ureg(annotation.unit) - - # NaN values are not supported in the metainfo. Set as None - # which means that they are not stored. - if isinstance(value, float) and math.isnan(value): - value = None - - if isinstance(value, (int, float, str, pd.Timestamp)): - value = np.array([value]) - - if value is not None: - if len(value.shape) == 1 and len(quantity.shape) == 0: - if len(value) == 1: - value = value[0] - elif len(value) == 0: - value = None - else: + for sub_section, section_def in path: + next_section = None + try: + next_section = section.m_get_sub_section(sub_section, -1) + except (KeyError, IndexError): + pass + if not next_section: + next_section = section_def.section_cls() + section.m_add_sub_section(sub_section, next_section, -1) + section = next_section + + if annotation and annotation.unit: + value *= ureg(annotation.unit) + + # NaN values are not supported in the metainfo. Set as None + # which means that they are not stored. + if isinstance(value, float) and math.isnan(value): + value = None + + if isinstance(value, (int, float, str, pd.Timestamp)): + value = np.array([value]) + + if value is not None: + if len(value.shape) == 1 and len(quantity.shape) == 0: + if len(value) == 1: + value = value[0] + elif len(value) == 0: + value = None + else: + raise MetainfoError( + f'The shape of {quantity.name} does not match the given data.') + elif len(value.shape) != len(quantity.shape): raise MetainfoError( f'The shape of {quantity.name} does not match the given data.') - elif len(value.shape) != len(quantity.shape): - raise MetainfoError( - f'The shape of {quantity.name} does not match the given data.') - section.m_set(quantity, value) - _section_path_list: List[str] = list(_get_relative_path(section)) - _section_path_str: str = '/'.join(_section_path_list) - section_path_to_top_subsection.append(_section_path_str) - mapping[col_name] = set_value + section.m_set(quantity, value) + _section_path_list: List[str] = list(_get_relative_path(section)) + _section_path_str: str = '/'.join(_section_path_list) + section_path_to_top_subsection.append(_section_path_str) + mapping[col_name] = set_value for sub_section in section_def.all_sub_sections.values(): if sub_section in properties: continue next_base_section = sub_section.sub_section properties.add(sub_section) - for sub_section_section in next_base_section.all_inheriting_sections + [next_base_section]: - add_section_def(sub_section_section, path + [(sub_section, sub_section_section,)]) + add_section_def(next_base_section, path + [(sub_section, next_base_section,)]) add_section_def(section_def, []) return mapping @@ -586,8 +589,10 @@ def parse_table(pd_dataframe, section_def: Section, logger): logger.error( 'could not parse cell', details=dict(row=row_index, column=col_name), exc_info=e) - if col_index > 0: + if col_index > 0 and temp_quantity_path_container[0].split('/')[1:]: path_quantities_to_top_subsection.update(temp_quantity_path_container) + elif col_index > 0 and not temp_quantity_path_container[0].split('/')[1:]: + raise TabularParserError(f'there is a repeated column {column} that is not placed into a subsection in the schema. Please fix the schema.') except Exception as e: logger.error('could not parse row', details=dict(row=row_index), exc_info=e) @@ -612,7 +617,11 @@ def _strip_whitespaces_from_df_columns(df): transformed_column_names: Dict[str, str] = {} for col_name in list(df.columns): cleaned_col_name = col_name.strip().split('.')[0] - if count := list(transformed_column_names.values()).count(cleaned_col_name): + count = 0 + for string in transformed_column_names.values(): + if cleaned_col_name in string: + count += 1 + if count: transformed_column_names.update({col_name: f'{cleaned_col_name}.{count}'}) else: transformed_column_names.update({col_name: col_name.strip()}) @@ -636,7 +645,7 @@ def _append_subsections_from_section(section_name: List[str], target_section: MS def read_table_data( path, file_or_path=None, - comment: str = None, sep: str = None, skiprows: int = None, separator: str = None): + comment: str = None, sep: str = None, skiprows: Union[list[int], int] = None, separator: str = None): import pandas as pd df = pd.DataFrame() @@ -709,47 +718,10 @@ class TabularDataParser(MatchingParser): data = pd.DataFrame.from_dict(data.iloc[0, 0]) return [str(item) for item in range(0, data.shape[0])] - def parse( - self, mainfile: str, archive: EntryArchive, logger=None, - child_archives: Dict[str, EntryArchive] = None - ): + def parse(self, logger=None, **kwargs): if logger is None: logger = utils.get_logger(__name__) - - # We use mainfile to check the files existence in the overall fs, - # and archive.metadata.mainfile to get an upload/raw relative schema_file - schema_file = self._get_schema(mainfile, archive.metadata.mainfile) - if schema_file is None: - logger.error('Tabular data file without schema.', details=( - 'For a tabular file like name.schema.archive.csv, there has to be an ' - 'uploaded schema like schema.archive.yaml')) - return - - try: - schema_archive = cast(Context, archive.m_context).load_raw_file( - schema_file, archive.metadata.upload_id, None) - package = schema_archive.definitions - section_def = package.section_definitions[0] - except Exception as e: - logger.error('Could not load schema', exc_info=e) - return - - if TableRow.m_def not in section_def.base_sections: - logger.error('Schema for tabular data must inherit from TableRow.') - return - - annotation: TabularParserAnnotation = section_def.m_get_annotations('tabular_parser') - kwargs = annotation.dict(include={'comment', 'sep', 'skiprows'}) if annotation else {} - data = read_table_data(mainfile, **kwargs) - child_sections = parse_table(data, section_def, logger=logger) - assert len(child_archives) == len(child_sections) - - table = Table() - archive.data = table - - child_section_refs: List[MSection] = [] - for child_archive, child_section in zip(child_archives.values(), child_sections): - child_archive.data = child_section - child_section_refs.append(child_section) - child_section.table_ref = table - table.row_refs = child_section_refs + logger.error(''' + You are trying to use the legacy tabular parser. Now it is encapsulated in buitl-in TableData + ''') + return diff --git a/tests/data/parsers/tabular/my_schema.archive.yaml b/tests/data/parsers/tabular/my_schema.archive.yaml index e7b45871e7018a39f2429efe798061d5b649785b..d73f5ce865754ee0f3f160700931d1f6199370e9 100644 --- a/tests/data/parsers/tabular/my_schema.archive.yaml +++ b/tests/data/parsers/tabular/my_schema.archive.yaml @@ -12,11 +12,10 @@ definitions: parsing_options: comment: '#' mapping_options: - - mapping_mode: entry + - mapping_mode: row file_mode: multiple_new_entries - with_file: true sections: - - root + - '#root' my_quantity_1: type: str m_annotations: diff --git a/tests/data/test_examples.py b/tests/data/test_examples.py index 7420ed14259e12999e3a9c83c663f638dcfe1355..3dde25dfd32449acb93e540fb94190731427b9e8 100644 --- a/tests/data/test_examples.py +++ b/tests/data/test_examples.py @@ -65,21 +65,83 @@ def test_sample_tabular(mainfile, assert_xpaths, raw_files, no_warn): assert archive.m_xpath(xpath) is not None -@pytest.mark.parametrize('test_files', [ +@pytest.mark.parametrize('test_files,number_of_entries', [ pytest.param([ 'examples/data/docs/tabular-parser-entry-mode.archive.yaml', 'examples/data/docs/tabular-parser-entry-mode.xlsx' - ], id='simple_entry_mode'), + ], 5, id='simple_entry_mode'), pytest.param([ 'examples/data/docs/tabular-parser-complex.archive.yaml', 'examples/data/docs/data_file_1.csv', 'examples/data/docs/data_file_2.csv' - ], id='complex_entry_mode') + ], 6, id='complex_entry_mode') ]) -def test_sample_entry_mode(test_files, mongo, test_user, raw_files, monkeypatch, proc_infra): +def test_sample_entry_mode(mongo, test_user, raw_files, monkeypatch, proc_infra, test_files, number_of_entries): upload = _create_upload('test_upload_id', test_user.user_id, test_files) assert upload is not None - assert upload.processed_entries_count == 6 + assert upload.processed_entries_count == number_of_entries for entry in Entry.objects(upload_id='test_upload_id'): assert entry.process_status == ProcessStatus.SUCCESS + + +@pytest.mark.parametrize('test_files, status', [ + pytest.param([ + 'examples/data/docs/tabular-parser_1_column_current-entry_to-root.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_1'), + pytest.param([ + 'examples/data/docs/tabular-parser_2_column_current-entry_to-path.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_2'), + pytest.param([ + 'examples/data/docs/tabular-parser_3_row_current-entry_to-path.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_3'), + pytest.param([ + 'examples/data/docs/tabular-parser_4_column_single-new-entry_to-path.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_4'), + pytest.param([ + 'examples/data/docs/tabular-parser_5_row_single-new-entry_to-path.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_5'), + pytest.param([ + 'examples/data/docs/tabular-parser_6_row_multiple-new-entries_to-root.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_6'), + pytest.param([ + 'examples/data/docs/tabular-parser_7_row_multiple-new-entries_to-path.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_7'), + pytest.param([ + 'examples/data/docs/tabular-parser_8_row_current-entry_to-path_subsubsection.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_8'), + pytest.param([ + 'examples/data/docs/tabular-parser_np1_row_current-entry_to-root.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_9'), + pytest.param([ + 'examples/data/docs/tabular-parser_np2_column_single-new-entry_to-root.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_10'), + pytest.param([ + 'examples/data/docs/tabular-parser_np3_row_single-new-entry_to-root.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_11'), + pytest.param([ + 'examples/data/docs/tabular-parser_np4_column_multiple-new-entries_to-root.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_12'), + pytest.param([ + 'examples/data/docs/tabular-parser_np5_column_multiple-new-entries_to-path.archive.yaml', + 'examples/data/docs/test.xlsx' + ], ProcessStatus.SUCCESS, id='test_13'), +]) +def test_tabular_doc_examples(mongo, test_user, raw_files, monkeypatch, proc_infra, test_files, status): + upload = _create_upload('test_upload_id', test_user.user_id, test_files) + assert upload is not None + + for entry in Entry.objects(upload_id='test_upload_id'): + assert entry.process_status == status diff --git a/tests/parsing/test_tabular.py b/tests/parsing/test_tabular.py index 19819e28143f94b9ab2de025ea3d6b505221e3db..224d906e23ac1996e6c3faf32edac20aef8b0bad 100644 --- a/tests/parsing/test_tabular.py +++ b/tests/parsing/test_tabular.py @@ -310,7 +310,7 @@ def test_tabular_entry_mode(mongo, test_user, raw_files, monkeypatch, proc_infra upload.process_upload() upload.block_until_complete() assert upload is not None - assert upload.processed_entries_count == 2 + assert upload.processed_entries_count == 1 for entry in Entry.objects(upload_id='test_upload_id'): assert entry.process_status == ProcessStatus.SUCCESS @@ -477,10 +477,21 @@ def test_tabular_row_mode(raw_files, monkeypatch, test_case, section_placeholder tabular_parser: parsing_options: comment: '#' + mapping_options: + - mapping_mode: column + file_mode: current_entry + sections: + - '#root' header_0: + m_annotations: + tabular: + name: header_0 type: str header_1: type: str + m_annotations: + tabular: + name: header_1 data: m_def: MyTable data_file: test.my_schema.archive.csv @@ -651,7 +662,7 @@ def test_tabular_csv(raw_files, monkeypatch, schema, content): strip(''' header_0,header_1 a,b - '''), id='checkoing checkbox' + '''), id='checking checkbox' ), pytest.param( strip(''' @@ -722,7 +733,7 @@ def test_tabular_checkbox(raw_files, monkeypatch, schema, content): assert main_archive.data.fill_archive_from_datafile is True main_archive.data.MySubsection[0].header_0 = 'c' run_normalize(main_archive) - assert main_archive.data.MySubsection[0].header_0 == 'a' + assert main_archive.data.MySubsection[0].header_0 == 'c' def get_files(schema=None, content=None):