Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
nomad-FAIR
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
nomad-lab
nomad-FAIR
Commits
1dffd0ee
Commit
1dffd0ee
authored
1 year ago
by
Alvin Noe Ladines
Browse files
Options
Downloads
Patches
Plain Diff
Fix HDF5 ref docs
Changelog: Fixed
parent
b86979e6
No related branches found
No related tags found
1 merge request
!1795
Fix HDF5 ref docs
Changes
4
Show whitespace changes
Inline
Side-by-side
Showing
4 changed files
docs/howto/customization/hdf5.md
+33
-35
33 additions, 35 deletions
docs/howto/customization/hdf5.md
docs/howto/overview.md
+1
-1
1 addition, 1 deletion
docs/howto/overview.md
mkdocs.yml
+1
-1
1 addition, 1 deletion
mkdocs.yml
nomad/datamodel/hdf5.py
+1
-2
1 addition, 2 deletions
nomad/datamodel/hdf5.py
with
36 additions
and
39 deletions
docs/howto/customization/hdf5.md
+
33
−
35
View file @
1dffd0ee
# How to handle large quantities
with HDF5
# How to
use HDF5 to
handle large quantities
The NOMAD schemas and processed data system
is
designed to describe and manage
The NOMAD schemas and processed data system
are
designed to describe and manage
intricate hierarchies of connected data. This is ideal for metadata and lots of small
data quantities, but does not work for large quantities. Quantities are atomic and
are always managed as a whole; there is currently no functionality to stream or
...
...
@@ -8,27 +8,26 @@ splice large quantities. Consequently, tools that produce or work with such data
cannot scale.
To address the issue, the option to use auxiliary storage systems optimized for large
data is implemented. In the following we discuss two
ways to write large datasets to HDF5.
The first is the use of the quantity type
`HDF5Reference`
and second is the addition of
quantity annotation
.
data is implemented. In the following we discuss two
quantity types to enable the writing
of large datasets to HDF5:
`HDF5Reference`
and
`HDF5Dataset`
. These are defined in
`nomad.datamodel.hdf5`
.
## HDF5Reference
HDF5Reference is a metainfo quantity type intended to reference datasets in external raw
HDF5 files.
This can also be used to write large data into an HDF5 file following th
e
structure of the nomad archive. In following example schema, we defin
e t
w
o HDF5
Referenc
e
quantities to illustrate these functionaliti
es.
HDF5 files.
It is assumed that the dataset exists in an HDF5 file and the referenc
e
is assigned to this quantity. Static methods to read from and writ
e to
an
HDF5
file ar
e
implemented. The following example illustrates how to use th
es
e
.
```
python
from
nomad.datamodel
import
ArchiveSection
from
nomad.
metainfo
import
HDF5Reference
from
nomad.
datamodel.hdf5
import
HDF5Reference
class
LargeData
(
ArchiveSection
):
value_external
=
Quantity
(
type
=
HDF5Reference
)
value
=
Quantity
(
type
=
HDF5Reference
)
```
The writing and reading of quantity values to and from an HDF5 file occur
s
during
The writing and reading of quantity values to and from an HDF5 file occur during
processing. For illustration purposes, we mock this by creating
`ServerContext`
. Furthermore,
we use this section definition for the
`data`
sub-section of EntryArchive.
...
...
@@ -51,43 +50,42 @@ archive = EntryArchive(
data
=
LargeData
(),
)
archive
.
data
.
value_external
=
'
external.h5#/path/to/data
'
archive
.
data
.
value
=
np
.
eye
(
5
)
archive
.
data
.
value
# '/uploads/test_upload/archive/test_entry#/data/value'
data
=
np
.
eye
(
3
)
path
=
'
external.h5#path/to/data
'
HDF5Reference
.
write_dataset
(
archive
,
data
,
path
)
archive
.
data
.
value
=
path
HDF5Reference
.
read_dataset
(
archive
,
path
)
array
([[
1.
,
0.
,
0.
],
[
0.
,
1.
,
0.
],
[
0.
,
0.
,
1.
]])
```
For
`value_external`
, we assign a reference to a dataset
`/path/to/data`
in a raw HDF5
file
`external.h5`
in the same upload. This will simply store this reference and will not
write it to another HDF5 file. To reference a file in another upload, follow the
same form for
[
reference values
](
basics.md#different-forms-of-references
)
e.g.
`/uploads/<upload_id>/raw/large_data.hdf5#group/large_field`
In contrast, when assigning an array to
`value`
, this is written to an HDF5 extension of the
entry archive and serialized as
`/uploads/test_upload/archive/test_entry#/data/value`
.
The structure of the HDF5 file will be the same as that of the archive.
We use
`write_dataset`
to write our data into a raw HDF5 file in
`test_upload`
with the
filename and dataset location in
`path`
. Additionally, archive is required to resolve the
upload metadata. We then assign the reference to the dataset to
`value`
. To reference a
file in another upload, follow the same form for
[
reference values
](
basics.md#different-forms-of-references
)
e.g.
`/uploads/<upload_id>/raw/large_data.hdf5#group/large_field`
.
!!! important
When reassigning a different value for an HDF5 archive quantity, it is necessary that the data
attributes (shape and type) are preserved.
## Existing quantities for large arrays
To read a dataset, use
`read_dataset`
and provide a reference. This will return the value
cast in the type of the dataset.
For existing quantity definitions which one uses for large arrays, it is also possible
to write the data to the HDF5 representation of the archive. This can be done by adding
a
`serialization`
annotation to the quantity definition.
## HDF5Dataset
To use HDF5 storage for archive quantities, one should use
`HDF5Dataset`
.
```
python
from
nomad.datamodel.
metainfo.annotations
import
HDF5SerializationAnnotation
from
nomad.datamodel.
hdf5
import
HDF5Dataset
class
LargeData
(
ArchiveSection
):
value
=
Quantity
(
type
=
np
.
float64
)
value
.
m_annotations
=
dict
(
serialization
=
HDF5SerializationAnnotation
())
value
=
Quantity
(
type
=
HDF5Dataset
)
```
Upon serialization, the assigned value will also be written to the archive HDF5 file.
However, the value will remain in memory. This is the difference compared to HDF5Rerence
where the value is immediately written to an HDF5 file and serialized as reference.
During serialization, one also needs to provide the archive context in order to resolve
the reference.
The assigned value will also be written to the archive HDF5 file and serialized as
`/uploads/test_upload/archive/test_entry#/data/value`
.
```
python
archive
.
data
.
value
=
np
.
ones
(
3
)
...
...
This diff is collapsed.
Click to expand it.
docs/howto/overview.md
+
1
−
1
View file @
1dffd0ee
...
...
@@ -65,12 +65,12 @@ Customize NOMAD, write plugins, and tailor NOMAD Oasis.
-
[
Use base sections
](
customization/base_sections.md
)
-
[
Parse tabular data
](
customization/tabular.md
)
-
[
Define workflows
](
customization/workflows.md
)
-
[
Reference hdf5 files
](
customization/hdf5.md
)
-
[
Write plugins
](
customization/plugins.md
)
-
[
Write a python schema
](
customization/schemas.md
)
-
[
Write a parser
](
customization/parsers.md
)
-
[
Write a normalizer
](
customization/normalizers.md
)
-
[
Work with units
](
customization/units.md
)
-
[
Use HDF5 to handle large quantities
](
customization/hdf5.md
)
</div>
<div
markdown=
"block"
>
...
...
This diff is collapsed.
Click to expand it.
mkdocs.yml
+
1
−
1
View file @
1dffd0ee
...
...
@@ -39,12 +39,12 @@ nav:
-
Use base sections
:
howto/customization/base_sections.md
-
Parse tabular data
:
howto/customization/tabular.md
-
Define workflows
:
howto/customization/workflows.md
-
Handle large quantities
:
howto/customization/hdf5.md
-
Write plugins
:
howto/customization/plugins.md
-
Write a schema plugin
:
howto/customization/schemas.md
-
Write a parser
:
howto/customization/parsers.md
-
Write a normalizer
:
howto/customization/normalizers.md
-
Work with units
:
howto/customization/units.md
-
Use HDF5 to handle large quantities
:
howto/customization/hdf5.md
-
Development
:
-
Get started
:
howto/develop/setup.md
-
Navigate the code
:
howto/develop/code.md
...
...
This diff is collapsed.
Click to expand it.
nomad/datamodel/hdf5.py
+
1
−
2
View file @
1dffd0ee
...
...
@@ -47,7 +47,7 @@ def read_hdf5_dataset(hdf5_file: h5py.File, path: str) -> h5py.Dataset:
)[
match
[
'
path
'
]]
def
write_hdf5_dataset
(
value
:
Any
,
hdf5_file
:
h5py
.
File
,
path
:
str
)
->
str
:
def
write_hdf5_dataset
(
value
:
Any
,
hdf5_file
:
h5py
.
File
,
path
:
str
)
->
None
:
"""
Write data to HDF5 file.
"""
...
...
@@ -59,7 +59,6 @@ def write_hdf5_dataset(value: Any, hdf5_file: h5py.File, path: str) -> str:
dtype
=
value
.
dtype
if
hasattr
(
value
,
'
dtype
'
)
else
None
,
)
dataset
[...]
=
value
.
magnitude
if
hasattr
(
value
,
'
magnitude
'
)
else
value
return
dataset
class
_HDF5Reference
(
DataType
):
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment