diff --git a/docs/schema/elns.md b/docs/schema/elns.md index afeb1296ff7058923fd74afc061e85cb33b387da..f72abe92fb086f04c38ab9d7f661ca4af906b653 100644 --- a/docs/schema/elns.md +++ b/docs/schema/elns.md @@ -38,7 +38,6 @@ NOMAD's upload page: --8<-- "examples/data/eln/schema.archive.yaml" ``` - ## ELN Annotations The `eln` annotations can contain the following keys: @@ -53,6 +52,69 @@ ELN edit annotations and components [here]({{ nomad_url() }}/../gui/dev/editquan ## Tabular Annotations +In order to import your data from a `.csv` or `Excel` file, NOMAD provides three distinct (and separate) ways, that +with each comes unique options for importing and interacting with your data. To better understand how to use +NOMAD parsers to import your data, three commented sample schemas are presented below. Also, each section follows +and extends a general example explained thereafter. Two main components of any tabular parser schema are +1) using the correct base section as well as 2) providing a `data_file` quantity with the correct `m_annotations` +(except for the entry mode). Please bear in mind that the schema files should 1) follow the NOMAD naming convention +(i.e. `My_Name.archive.yaml`), and 2) be accompanied by your data file in order for NOMAD to parse them. +In the examples provided below, an `Excel` file is assumed to contain all the data, as both NOMAD and +`Excel` support multiple-sheets data manipulations and imports. Note that the `Excel` file name in each schema +should match the name of the `Excel` data file, which in case of using a `.csv` data file, it can be replaced by the +`.csv` file name. + +The following sample schema creates one quantity off the entire column of an excel file (`col mode`). +For example, suppose in an excel sheet, several rows contain information of a chemical product (e.g. `purity` in one +column). In order to list all the purities under the column `purity` and import them into NOMAD, you can use the +following schema by substituting `My_Quantity` with any name of your choice (e.g. `Purity`), +`tabular-parser.data.xlsx` with the name of the csv/excel file where the data lies, and `My_Sheet/My_Column` with +sheet_name/column_name of your targeted data. The `Tabular_Parser` is also an arbitrary name that can be changed. + +Important notes: + +- `shape: ['*']` under `My_Quantity` is essential to parse the entire column of the data file. +- `My_Quantity` can also be defined within another subsection (see next schema sample) + +```yaml +--8<-- "examples/data/docs/tabular-parser-col-mode.archive.yaml" +``` + +The sample schema provided below, creates separate instances of a repeated section from each row of an excel file +(`row mode`). For example, suppose in an excel sheet, you have the information for a chemical product +(e.g. `name` in one column), and each row contains one entry of the aforementioned chemical product. +Since each row is separate from others, in order to create instaces of the same product out of all rows +and import them into NOMAD, you can use the following schema by substituting `My_Subsection`, +`My_Section` and `My_Quantity` with any appropriate name (e.g. `Substance`, `Chemical_product` +and `Name` respectively). + +Important notes: + +- This schema demonstrates how to import data within a subsection of another subsection, meaning the +targeted quantity should not necessarily go into the main `quantites`. +- Setting `mode` to `row` signals that for each row in the sheet_name (provided in `My_Quantity`), +one instance of the corresponding (sub-)section (in this example, `My_Subsection` sub-section as it has the `repeats` +option set to true), will be appended. Please bear in mind that if this mode is selected, then all other quantities +should exist in the same sheet_name. + +```yaml +--8<-- "examples/data/docs/tabular-parser-row-mode.archive.yaml" +``` + +The following sample schema creates one entry for each row of an excel file (`entry mode`). +For example, suppose in an excel sheet, you have the information for a chemical product (e.g. `name` in one column), +and each row contains one entry of the aforementioned chemical product. Since each row is separate from others, in +order to create multiple archives of the same product out of all rows and import them into NOMAD, you can use the +following schema by substituting `My_Quantity` with any appropriate name (e.g. `Name`). + +Important note: + +- For entry mode, the convention for reading data from csv/excel file is to provide only the column name and the +data are assumed to exist in the first sheet + +```yaml +--8<-- "examples/data/docs/tabular-parser-entry-mode.archive.yaml" +``` Tabular annotation accepts the following keys: {{ get_schema_doc('tabular') }} @@ -64,7 +126,7 @@ Plot annotation is a wrapper for [plotly](https://plotly.com) library. One can u which can be customized by using plotly commands. See [plot examples]({{ nomad_url() }}/../gui/dev/plot). -## Build-in base sections for ELNs +## Built-in base sections for ELNs Coming soon ... diff --git a/docs/schema/suggestions.yaml b/docs/schema/suggestions.yaml index 4e084654d75a99701ed66459f1f963c1e20cbd0a..aa25e364b891587ba344ca8eaeabdf3639c66ce3 100644 --- a/docs/schema/suggestions.yaml +++ b/docs/schema/suggestions.yaml @@ -4,9 +4,13 @@ m_annotations: plot: "plot annotations" tabular: - name: "Either < column name > in csv and xls or in the format of < sheet name >/< column name > only for excel files" + name: "Either < column name > in `.csv` and `excel` or in the format of < sheet name >/< column name > only for `excel` files" unit: "The unit to display the data" - comment: "A character denoting the commented lines in excel or csv files" + comment: "A character denoting the commented lines in `excel` or `.csv` files" + sep: "In case of reading data from a `.csv` file, the separator annotation (e.g. `','` for comma or `'\\t'` for tab) can be specified here." + separator: "Aliases for `sep`." + mode: "Either `column` or `row`. Use only when setting `TableData` as a base-section. Defaults to `column`." + target_sub_section: "List of paths to the targeted repeating subsection < section >/< sub-sections >/ ... /< subsections >" eln: component: "The name of ELN edit component" diff --git a/examples/data/docs/tabular-parser-col-mode.archive.yaml b/examples/data/docs/tabular-parser-col-mode.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..d09b653d9b93d2500d8b2fac5edfe6ec9e3a6b93 --- /dev/null +++ b/examples/data/docs/tabular-parser-col-mode.archive.yaml @@ -0,0 +1,33 @@ +# This schema is specially made for demonstration of implementing a tabular parser with +# column mode. +definitions: + name: 'Tabular Parser example schema' + sections: + Tabular_Parser: # The main section that contains the quantities to be read from an excel file. + # This name can be changed freely. + base_sections: + - nomad.parsing.tabular.TableData + quantities: + data_file: + type: str + m_annotations: + tabular_parser: # The tabular_parser annotation, will treat the values of this + # quantity as files. It will try to interpret the files and fill + # quantities in this section (and sub_sections) with the column + # data of .csv or .xlsx files. + comment: '#' # Skipping lines in csv or excel file that start with the sign `#` + mode: column # Here the mode can be set. If removed, by default, + # the parser assumes mode to be column + My_Quantity: + type: str + shape: ['*'] + m_annotations: + tabular: # The tabular annotation defines a mapping to column headers used in tabular data files + name: My_Sheet/My_Column # Here you can define where the data for the given quantity is to be taken from + # The convention for selecting the name is if the data is to be taken from an excel file, + # you can specify the sheet_name followed by a forward slash and the column_name to target the desired quantity. + # If only a column name is provided, then the first sheet in the excel file (or the .csv file) + # is assumed to contain the targeted data. +data: + m_def: Tabular_Parser # this is a reference to the section definition above + data_file: tabular-parser.data.xlsx # name of the excel/csv file to be uploaded along with this schema yaml file \ No newline at end of file diff --git a/examples/data/docs/tabular-parser-entry-mode.archive.xlsx b/examples/data/docs/tabular-parser-entry-mode.archive.xlsx new file mode 100644 index 0000000000000000000000000000000000000000..b96fffc18ce1772d4ebe65f18c5fd37730b2a3fc Binary files /dev/null and b/examples/data/docs/tabular-parser-entry-mode.archive.xlsx differ diff --git a/examples/data/docs/tabular-parser-entry-mode.archive.yaml b/examples/data/docs/tabular-parser-entry-mode.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..6ef95ab1c5123f61f68bcd36e6396279f14a4ff5 --- /dev/null +++ b/examples/data/docs/tabular-parser-entry-mode.archive.yaml @@ -0,0 +1,17 @@ +# This schema is specially made for demonstration of implementing a tabular parser with +# entry mode. +definitions: + name: 'Tabular Parser example schema' # The main section that contains the quantities to be read from an excel file + # This name can be changed freely. + sections: + Tabular_Parser: + base_sections: + - nomad.parsing.tabular.TableRow # To create entries from each row in the excel file + # the base section should inherit from `nomad.parsing.tabular.TableRow`. For this specific case, + # the datafile should be accompanied + quantities: + My_Quantity: + type: str + m_annotations: + tabular: + name: My_Column diff --git a/examples/data/docs/tabular-parser-row-mode.archive.yaml b/examples/data/docs/tabular-parser-row-mode.archive.yaml new file mode 100644 index 0000000000000000000000000000000000000000..a4ba739569222064851b1deb1f93cb81d58053c6 --- /dev/null +++ b/examples/data/docs/tabular-parser-row-mode.archive.yaml @@ -0,0 +1,35 @@ +# This schema is specially made for demonstration of implementing a tabular parser with +# row mode. +definitions: + name: 'Tabular Parser example schema' + sections: + Tabular_Parser: # The main section that contains the quantities to be read from an excel file + # This name can be changed freely. + base_sections: + - nomad.parsing.tabular.TableData # Here we specify that we need to acquire the data from a .xlsx or a .csv file + quantities: + data_file: + type: str + m_annotations: + tabular_parser: + comment: '#' # Skipping lines in csv or excel file that start with the sign `#` + mode: row # Setting mode to row signals that for each row in the sheet_name (provided in quantity) + target_sub_section: # This is the reference to where the targeted (sub-)section lies within this example schema file + - My_Subsection/My_Section + sub_sections: + My_Subsection: + section: + sub_sections: + My_Section: + repeats: true # The repeats option set to true means there can be multiple instances of this + # section + section: + quantities: + My_Quantity: + type: str + m_annotations: + tabular: # The tabular annotation defines a mapping to column headers used in tabular data files + name: My_Sheet/My_Column # sheet_name and column name of the targeted data in csv/xlsx file +data: + m_def: Tabular_Parser # this is a reference to the section definition above + data_file: tabular-parser.data.xlsx # name of the excel/csv file to be uploaded along with this schema yaml file diff --git a/examples/data/docs/tabular-parser.data.xlsx b/examples/data/docs/tabular-parser.data.xlsx new file mode 100644 index 0000000000000000000000000000000000000000..b96fffc18ce1772d4ebe65f18c5fd37730b2a3fc Binary files /dev/null and b/examples/data/docs/tabular-parser.data.xlsx differ diff --git a/examples/data/eln/schema.archive.yaml b/examples/data/eln/schema.archive.yaml index 7e36c0cd2fb0c45e11ac96157b756fc89192a380..09514f70a578a7a09c165d461f91bbf4de3fed6d 100644 --- a/examples/data/eln/schema.archive.yaml +++ b/examples/data/eln/schema.archive.yaml @@ -1,11 +1,11 @@ -# Schemas can be defines as yaml files like this. The archive.yaml format will be +# Schemas can be defined as yaml files like this. The archive.yaml format will be # interpreted by nomad as a nomad archive. Therefore, all definitions have to be # put in a top-level section called "definitions" definitions: # The "definitions" section is interpreted as a nomad schema package # Schema packages can have a name: name: 'Electronic Lab Notebook example schema' - # Schema packages contain section definitions. This is wear the interesting schema + # Schema packages contain section definitions. This is where the interesting schema # information begins. sections: # Here we define a section called "Chemical": @@ -122,7 +122,7 @@ definitions: section: # The sub-section's section, is itself a section definition m_annotations: - eln: # ads the sub-section to the eln and allows users to create new instances of this sub-section + eln: # adds the sub-section to the eln and allows users to create new instances of this sub-section # We can also nest sub_sections. It goes aribitrarely deep. sub_sections: pvd_evaporation: @@ -152,7 +152,7 @@ definitions: # The tabular_parser annotation, will treat the values of this # quantity as files. It will try to interpret the files and fill # quantities in this section (and sub_section) with the column - # data of .csv or .xlsx files. + # data of .csv or .xlsx files. There is also a mode option that by default, is set to column. tabular_parser: sep: '\t' comment: '#' @@ -213,4 +213,3 @@ definitions: eln: component: NumberEditQuantity - diff --git a/gui/src/utils.js b/gui/src/utils.js index 7baecb23a999e87749bf002d16629de7105b849a..87952d70a2d6882318ddf81efc10fa4fe42b1e73 100644 --- a/gui/src/utils.js +++ b/gui/src/utils.js @@ -375,7 +375,7 @@ export function formatInteger(value) { * @return {str} The timestamp with new formatting */ export function formatTimestamp(value) { - if (value.search(/(\+|Z)/) === -1) { // search for timezone information + if (value.search(/([+-][0-9]{2}:[0-9]{2}|Z)\b/) === -1) { // search for timezone information try { // assume UTC timestamp from server and attempt to manually add UTC timezone, // new Date will wrongly assume local timezone. diff --git a/nomad/parsing/tabular.py b/nomad/parsing/tabular.py index a450e595923e2b6652d3cefd3db38f77b92432cf..4de34ebe65aab79e2cc9769fd634396461d3ed1d 100644 --- a/nomad/parsing/tabular.py +++ b/nomad/parsing/tabular.py @@ -37,6 +37,14 @@ from nomad.parsing.parser import MatchingParser m_package = Package() +def to_camel_case(snake_str: str): + '''Take as input a snake case variable and return a camel case one''' + + components = snake_str.split('_') + + return ''.join(f'{x[0].upper()}{x[1:].lower().capitalize()}' for x in components) + + class TableRow(EntryData): ''' Represents the data in one row of a table. ''' table_ref = Quantity( @@ -61,6 +69,9 @@ class TableData(ArchiveSection): self.tabular_parser(quantity, archive, logger, **tabular_parser_annotation) def tabular_parser(self, quantity_def: Quantity, archive, logger, **kwargs): + if logger is None: + logger = utils.get_logger(__name__) + if not quantity_def.is_scalar: raise NotImplementedError('CSV parser is only implemented for single files.') @@ -71,7 +82,45 @@ class TableData(ArchiveSection): with archive.m_context.raw_file(self.data_file) as f: data = read_table_data(self.data_file, f, **kwargs) - parse_columns(data, self) + tabular_parser_mode = 'column' if kwargs.get('mode') is None else kwargs.get('mode') + if tabular_parser_mode == 'column': + parse_columns(data, self) + + elif tabular_parser_mode == 'row': + # Returning one section for each row in the given sheet_name/csv_file + sections = parse_table(data, self.m_def, logger=logger) + + # The target_sub_section contains the ref to the location of which the sections are to be appended. + # Calling setattr will populate the non-repeating middle sections. + section_names: List[str] = kwargs.get('target_sub_section') + top_level_section_list: List[str] = [] + for section_name in section_names: + section_name_str = section_name.split('/')[0] + if top_level_section_list.count(section_name_str): + continue + else: + top_level_section_list.append(section_name_str) + if self.__getattr__(section_name_str) is None: + self.__setattr__(section_name_str, sections[0][section_name_str]) + sections.pop(0) + else: + continue + + # For each returned section, navigating to the target (repeating) section in self and appending the section + # data to self. + for section in sections: + for section_name in section_names: + section_name_list = section_name.split('/') + top_level_section = section_name_list.pop(0) + self_updated = self[top_level_section] + section_updated = section[top_level_section] + for section_path in section_name_list: + self_updated = self_updated[section_path] + section_updated = section_updated[section_path] + self_updated.append(section_updated[0]) + + else: + raise MetainfoError(f'The provided mode {tabular_parser_mode} should be either "column" or "row".') m_package.__init_metainfo__() @@ -89,7 +138,7 @@ def _create_column_to_quantity_mapping(section_def: Section): continue properties.add(quantity) - tabular_annotation = quantity.m_annotations.get('tabular', None) + tabular_annotation = quantity.m_annotations.get('tabular', {}) if tabular_annotation and 'name' in tabular_annotation: col_name = tabular_annotation['name'] else: @@ -105,7 +154,11 @@ def _create_column_to_quantity_mapping(section_def: Section): def set_value(section: MSection, value, path=path, quantity=quantity, tabular_annotation=tabular_annotation): import numpy as np for sub_section, section_def in path: - next_section = section.m_get_sub_section(sub_section, -1) + next_section = None + try: + next_section = section.m_get_sub_section(sub_section, -1) + except (KeyError, IndexError): + pass if not next_section: next_section = section_def.section_cls() section.m_add_sub_section(sub_section, next_section, -1) @@ -140,7 +193,7 @@ def _create_column_to_quantity_mapping(section_def: Section): mapping[col_name] = set_value for sub_section in section_def.all_sub_sections.values(): - if sub_section in properties or sub_section.repeats: + if sub_section in properties: continue next_base_section = sub_section.sub_section properties.add(sub_section) @@ -183,22 +236,43 @@ def parse_table(pd_dataframe, section_def: Section, logger): section_def for each row. The sections are filled with the cells from their respective row. ''' + # section_def = section.m_def import pandas as pd data: pd.DataFrame = pd_dataframe sections: List[MSection] = [] + main_sheet: Set[Any] = set() mapping = _create_column_to_quantity_mapping(section_def) # type: ignore - for row_index, row in data.iterrows(): + + # data object contains the entire excel file with all of its sheets (given that an + # excel file is provided, otherwise it contains the csv file). if a sheet_name is provided, + # the corresponding sheet_name from the data is extracted, otherwise its assumed that + # the columns are to be extracted from first sheet of the excel file. + for column in mapping: + if column == 'data_file': + continue + sheet_name = {column.split('/')[0]} if '/' in column else {0} + main_sheet = main_sheet.union(sheet_name) + if main_sheet.isdisjoint(sheet_name): + raise Exception('The columns for each quantity should be coming from one single sheet') + + assert len(main_sheet) == 1 + sheet_name = main_sheet.pop() + df = pd.DataFrame.from_dict(data.loc[0, sheet_name] if isinstance(sheet_name, str) else data.iloc[0, sheet_name]) + + for row_index, row in df.iterrows(): section = section_def.section_cls() try: - for column in data: - if column in mapping: + for column in mapping: + col_name = column.split('/')[1] if '/' in column else column + + if col_name in df: try: - mapping[column](section, row[column]) + mapping[column](section, row[col_name]) except Exception as e: logger.error( f'could not parse cell', - details=dict(row=row_index, column=column), exc_info=e) + details=dict(row=row_index, column=col_name), exc_info=e) except Exception as e: logger.error(f'could not parse row', details=dict(row=row_index), exc_info=e) sections.append(section) @@ -208,21 +282,32 @@ def parse_table(pd_dataframe, section_def: Section, logger): def read_table_data(path, file_or_path=None, **kwargs): import pandas as pd + df = pd.DataFrame() if file_or_path is None: file_or_path = path + if path.endswith('.xls') or path.endswith('.xlsx'): excel_file: pd.ExcelFile = pd.ExcelFile( file_or_path if isinstance(file_or_path, str) else file_or_path.name) - df = pd.DataFrame() for sheet_name in excel_file.sheet_names: df.loc[0, sheet_name] = [ - pd.read_excel(excel_file, sheet_name=sheet_name, **kwargs) - .to_dict()] + pd.read_excel(excel_file, sheet_name=sheet_name, + comment=kwargs.get('comment'), + skiprows=kwargs.get('skiprows')).to_dict()] else: - df = pd.DataFrame() + if kwargs.get('sep') is not None: + sep_keyword = kwargs.get('sep') + elif kwargs.get('separator') is not None: + sep_keyword = kwargs.get('sep') + else: + sep_keyword = None df.loc[0, 0] = [ - pd.read_csv(file_or_path, engine='python', **kwargs).to_dict() + pd.read_csv(file_or_path, engine='python', + comment=kwargs.get('comment'), + sep=sep_keyword, + skipinitialspace=True + ).to_dict() ] return df @@ -277,7 +362,6 @@ class TabularDataParser(MatchingParser): self, mainfile: str, archive: EntryArchive, logger=None, child_archives: Dict[str, EntryArchive] = None ): - import pandas as pd if logger is None: logger = utils.get_logger(__name__) @@ -305,7 +389,6 @@ class TabularDataParser(MatchingParser): tabular_parser_annotation = section_def.m_annotations.get('tabular-parser', {}) data = read_table_data(mainfile, **tabular_parser_annotation) - data = pd.DataFrame.from_dict(data.iloc[0, 0]) child_sections = parse_table(data, section_def, logger=logger) assert len(child_archives) == len(child_sections) diff --git a/requirements.txt b/requirements.txt index b763636cd8fcb8f4a44f84ed7226c296d25e61c3..2c83a7d68cda083026b02faf56d10bc94d438aa5 100644 --- a/requirements.txt +++ b/requirements.txt @@ -48,6 +48,8 @@ xrdtools==0.1.1 openpyxl==3.0.9 # [infrastructure] +importlib-metadata==4.13.0 +pyOpenSSL==21.0.0 optimade[mongo]==0.18.0 structlog==20.1.0 elasticsearch==7.17.1 @@ -91,6 +93,7 @@ oauthenticator==14.2.0 validators==0.18.2 aiofiles==0.8.0 joblib==1.1.0 +toposort==1.7 # [dev] markupsafe==2.0.1 diff --git a/tests/data/parsers/tabular/Test.xlsx b/tests/data/parsers/tabular/Test.xlsx index 832c17dabe7a76b2e5e5c44f1724aab5e02e8ba0..2716823d96e5fea4e35daa4894530291194c46e3 100644 Binary files a/tests/data/parsers/tabular/Test.xlsx and b/tests/data/parsers/tabular/Test.xlsx differ diff --git a/tests/data/test_examples.py b/tests/data/test_examples.py index 5a2c8ece82a6b392c6e300527910564ffe8d6a74..0bb9d0cd35a4ca774ef9ae2e74d7eade2542a65e 100644 --- a/tests/data/test_examples.py +++ b/tests/data/test_examples.py @@ -33,3 +33,17 @@ def test_eln(mainfile, assert_xpaths, raw_files, no_warn): for xpath in assert_xpaths: assert archive.m_xpath(xpath) is not None + + +@pytest.mark.parametrize('mainfile, assert_xpaths', [ + pytest.param('tabular-parser-col-mode.archive.yaml', ['data.My_Quantity'], id='col_mode'), + pytest.param('tabular-parser-row-mode.archive.yaml', ['data.My_Subsection.My_Section[4].My_Quantity'], + id='row_mode'), + pytest.param('tabular-parser-entry-mode.archive.yaml', [], id='entry_mode'), +]) +def test_sample_tabular(mainfile, assert_xpaths, raw_files, no_warn): + mainfile_directory = 'examples/data/docs' + archive = run_processing(mainfile_directory, mainfile) + + for xpath in assert_xpaths: + assert archive.m_xpath(xpath) is not None diff --git a/tests/parsing/test_tabular.py b/tests/parsing/test_tabular.py index 1d23fc4f174ab3545299e69b36168cd145139c5c..ffa1e19158a7cd0b3c9add6c1c6729a771f03d55 100644 --- a/tests/parsing/test_tabular.py +++ b/tests/parsing/test_tabular.py @@ -20,6 +20,7 @@ import pytest import os import os.path import pandas as pd +import re from nomad import config from nomad.datamodel.datamodel import EntryArchive, EntryMetadata @@ -30,6 +31,16 @@ from nomad.parsing.parser import ArchiveParser from tests.normalizing.conftest import run_normalize +def quantity_generator(quantity_name, header_name, shape='shape: [\'*\']'): + base_case = f'''{quantity_name}: + type: str + {shape} + m_annotations: + tabular: + name: {header_name}''' + return re.sub(r'\n\s*\n', '\n', base_case) + + @pytest.mark.parametrize('schema,content', [ pytest.param( strip(''' @@ -123,7 +134,7 @@ def test_tabular(raw_files, monkeypatch, schema, content): definitions: name: 'A test schema for excel file parsing' sections: - MovpeSto_schema: + My_schema: base_section: nomad.datamodel.data.EntryData sub_sections: process: @@ -132,24 +143,16 @@ def test_tabular(raw_files, monkeypatch, schema, content): quantities: data_file: type: str - description: | - A reference to an uploaded .xlsx m_annotations: tabular_parser: comment: '#' - browser: - adaptor: RawFileAdaptor - eln: - component: FileEditQuantity - experiment_identifier: + quantity_1: type: str m_annotations: tabular: - name: Experiment Identifier - eln: - component: StringEditQuantity + name: column_1 data: - m_def: MovpeSto_schema + m_def: My_schema process: data_file: Test.xlsx '''), id='w/o_sheetName_rowMode'), @@ -158,7 +161,7 @@ def test_tabular(raw_files, monkeypatch, schema, content): definitions: name: 'A test schema for excel file parsing' sections: - MovpeSto_schema: + My_schema: base_section: nomad.datamodel.data.EntryData sub_sections: process: @@ -167,24 +170,16 @@ def test_tabular(raw_files, monkeypatch, schema, content): quantities: data_file: type: str - description: | - A reference to an uploaded .xlsx m_annotations: tabular_parser: comment: '#' - browser: - adaptor: RawFileAdaptor - eln: - component: FileEditQuantity - experiment_identifier: + quantity_1: type: str m_annotations: tabular: - name: Overview/Experiment Identifier - eln: - component: StringEditQuantity + name: sheet_1/column_1 data: - m_def: MovpeSto_schema + m_def: My_schema process: data_file: Test.xlsx '''), id='w_sheetName_rowMode'), @@ -193,7 +188,7 @@ def test_tabular(raw_files, monkeypatch, schema, content): definitions: name: 'A test schema for excel file parsing' sections: - MovpeSto_schema: + My_schema: base_section: nomad.datamodel.data.EntryData sub_sections: process: @@ -207,32 +202,28 @@ def test_tabular(raw_files, monkeypatch, schema, content): m_annotations: tabular_parser: comment: '#' - browser: - adaptor: RawFileAdaptor - eln: - component: FileEditQuantity - experiment_identifier: + quantity_1: type: str m_annotations: tabular: - name: Overview/Experiment Identifier - eln: - component: StringEditQuantity - pyrotemperature: + name: sheet_1/column_1 + quantity_2: type: np.float64 shape: ['*'] unit: K - description: My test description here m_annotations: tabular: - name: Deposition Control/Pyrotemperature + name: sheet_2/column_2 data: - m_def: MovpeSto_schema + m_def: My_schema process: data_file: Test.xlsx '''), id='w_sheetName_colMode') ]) -def test_xlsx_tabular(raw_files, monkeypatch, schema): +def test_tabular_entry_mode(raw_files, monkeypatch, schema): + ''' + Testing TabularParser parser. This feature creates an entry out of each row from the given excel/csv file + ''' _, schema_file = get_files(schema) excel_file = os.path.join(os.path.dirname(__file__), '../../tests/data/parsers/tabular/Test.xlsx') @@ -246,10 +237,209 @@ def test_xlsx_tabular(raw_files, monkeypatch, schema): run_normalize(main_archive) assert main_archive.data is not None - assert 'experiment_identifier' in main_archive.data.process - assert main_archive.data.process.experiment_identifier == '22-01-21-MA-255' - if 'pyrotemperature' in main_archive.data.process: - assert len(main_archive.data.process['pyrotemperature']) == 6 + assert 'quantity_1' in main_archive.data.process + assert main_archive.data.process.quantity_1 == 'value_1' + if 'quantity_2' in main_archive.data.process: + assert len(main_archive.data.process['quantity_2']) == 6 + + +@pytest.mark.parametrize('test_case,section_placeholder,sub_sections_placeholder,quantity_placeholder,csv_content', [ + pytest.param('test_1', '', '', quantity_generator('quantity_0', 'header_0'), + 'header_0,header_1\n0_0,0_1\n1_0,1_1', id='simple'), + pytest.param('test_2', f'''Mysection: + quantities: + {quantity_generator('quantity_0', 'header_0')} + ''', '''sub_sections: + my_substance: + section: Mysection''', '', 'header_0,header_1\n0_0,0_1\n1_0,1_1', + id='nested'), +]) +def test_tabular_column_mode(raw_files, monkeypatch, test_case, section_placeholder, quantity_placeholder, + sub_sections_placeholder, csv_content): + ''' + Testing the TableData normalizer using default mode (column mode). This feature creates a list of values + out of the given column in the excel/csv file for the given quantity. + ''' + base_schema = '''definitions: + name: 'Eln' + sections: + + My_schema: + base_sections: + - nomad.parsing.tabular.TableData + quantities: + data_file: + type: str + description: + m_annotations: + tabular_parser: + comment: '#' + + +data: + m_def: My_schema + data_file: test.my_schema.archive.csv''' + + schema = base_schema.replace('', section_placeholder)\ + .replace('', sub_sections_placeholder)\ + .replace('', quantity_placeholder)\ + .replace('', test_case) + schema = re.sub(r'\n\s*\n', '\n', schema) + csv_file, schema_file = get_files(schema, csv_content) + + class MyContext(ClientContext): + def raw_file(self, path, *args, **kwargs): + return open(csv_file, *args, **kwargs) + context = MyContext(local_dir='') + + main_archive, _ = get_archives(context, schema_file, None) + ArchiveParser().parse(schema_file, main_archive) + run_normalize(main_archive) + + assert main_archive.data is not None + if 'test_1' in schema: + assert main_archive.data.quantity_0 == ['0_0', '1_0'] + elif 'test_2' in schema: + assert main_archive.data.my_substance.quantity_0 == ['0_0', '1_0'] + + +@pytest.mark.parametrize('test_case,section_placeholder,target_sub_section_placeholder,sub_sections_placeholder,csv_content', [ + pytest.param('test_1', '', '- my_substance1', '''my_substance1: + repeats: true + section: Substance1''', 'header_0,header_1\n0_0,0_1\n1_0,1_1', id='simple_1_section'), + pytest.param('test_2', f'''Substance2: + quantities: + {quantity_generator('quantity_2', 'header_2', shape='')} + ''', '''- my_substance1 + - my_substance2''', '''my_substance1: + repeats: true + section: Substance1 + my_substance2: + repeats: true + section: Substance2''', 'header_0,header_1,header_2\n0_0,0_1,0_2\n1_0,1_1,1_2', id='simple_2_sections'), + pytest.param('test_3', '', '- subsection_1/my_substance1', f'''subsection_1: + section: + m_annotations: + eln: + dict() + sub_sections: + my_substance1: + repeats: true + section: + base_section: Substance1''', 'header_0,header_1,header_2\n0_0,0_1,0_2\n1_0,1_1,1_2', id='nested')]) +def test_tabular_row_mode(raw_files, monkeypatch, test_case, section_placeholder, target_sub_section_placeholder, + sub_sections_placeholder, csv_content): + ''' + Testing the TableData normalizer with mode set to row. This feature is used to create a section out of each row in a + given sheet_name of an excel file or a csv file, and append it to the repeating (sub)section(s). + ''' + base_schema = f'''definitions: + name: 'Eln' + sections: + Substance1: + quantities: + {quantity_generator('quantity_4', 'header_0', shape='')} + + My_schema: + base_sections: + - nomad.parsing.tabular.TableData + quantities: + data_file: + type: str + description: + m_annotations: + tabular_parser: + comment: '#' + mode: row + target_sub_section: + + sub_sections: + +data: + m_def: My_schema + data_file: test.my_schema.archive.csv''' + + schema = base_schema.replace('', section_placeholder) \ + .replace('', target_sub_section_placeholder) \ + .replace('', sub_sections_placeholder) \ + .replace('', test_case) + schema = re.sub(r'\n\s*\n', '\n', schema) + csv_file, schema_file = get_files(schema, csv_content) + + class MyContext(ClientContext): + def raw_file(self, path, *args, **kwargs): + return open(csv_file, *args, **kwargs) + context = MyContext(local_dir='') + + main_archive, _ = get_archives(context, schema_file, None) + ArchiveParser().parse(schema_file, main_archive) + run_normalize(main_archive) + + assert main_archive.data is not None + if 'test_1' in schema: + assert len(main_archive.data.my_substance1) == 2 + ii = 0 + for item in main_archive.data.my_substance1: + assert item.quantity_4 == f'{ii}_0' + ii += 1 + elif 'test_2' in schema: + assert len(main_archive.data.my_substance2) == 2 + ii = 0 + for item in main_archive.data.my_substance2: + assert item.quantity_2 == f'{ii}_2' + ii += 1 + elif 'test_3' in schema: + assert len(main_archive.data.subsection_1.my_substance1) == 2 + ii = 0 + for item in main_archive.data.subsection_1.my_substance1: + assert item.quantity_4 == f'{ii}_0' + ii += 1 + + +@pytest.mark.parametrize('schema,content', [ + pytest.param( + strip(''' + definitions: + sections: + MyTable: + base_section: nomad.parsing.tabular.TableData + quantities: + data_file: + type: str + m_annotations: + tabular_parser: + comment: '#' + mode: column + header_0: + type: str + header_1: + type: str + data: + m_def: MyTable + data_file: test.my_schema.archive.csv + '''), + strip(''' + header_0, header_1 + a,b + '''), id='space in header' + ) +]) +def test_tabular_csv(raw_files, monkeypatch, schema, content): + '''Tests that missing data is handled correctly. Pandas by default + interprets missing numeric values as NaN, which are incompatible with + metainfo. + ''' + csv_file, schema_file = get_files(schema, content) + + class MyContext(ClientContext): + def raw_file(self, path, *args, **kwargs): + return open(csv_file, *args, **kwargs) + context = MyContext(local_dir='') + + main_archive, _ = get_archives(context, schema_file, None) + ArchiveParser().parse(schema_file, main_archive) + run_normalize(main_archive) + assert main_archive.data.header_1 is not None @pytest.mark.parametrize('schema,content,missing', [