The NoMaD Laboratory and Big-Data Analytics: Extracting hidden information from repositories of computational materials science
Fawzi Mohamed^1^, Luca M. Ghiringhelli^1^, Christian Carbogno^1^, Claudia Draxl1,^2^, Alessandro De Vita^3^, Daan Frenkel^4^, Francesc Illas^5^, Risto Nieminen^6^, Angel Rubio^7^,^8^, Kristian Sommer Thygesen^9^, and Matthias Scheffler^1^ (*)
^1^ Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin
^2^ Humboldt-Universität zu Berlin, Physics Department and IRIS Adlershof, Zum Großen Windkanal 6, D-12489 Berlin
^3^ King’s College London, Department of Physics, Strand, London WC2R 2LS, United Kingdom,
^4^ Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
^5^ Universitat de Barcelona, Departament de Química Física & IQTCUB, c/Martí i Franquès 1, E-08028 Barcelona
^6^ Aalto University, COMP Centre of Excellence, Department of Applied Physics, PO Box 11100, FI-00076 Aalto, Espoo
^7^ Max Planck Institute for the Structure and Dynamics of Matter, Luruper Chaussee 149, D-22761 Hamburg
^8^ University of the Basque Country CFM CSIC-UPV/EHU-MPC and DIPC, Nano-Bio Spectroscopy group and ETSF Scientific Development Centre Department of Materials Physics, Avenida de Tolosa 72, E-20018 Donostia
^9^ Technical University of Denmark, CAMD, Physics Department, Anker Engelundsvej 1, Bld. 307, DK-2800 Kgs. Lyngby
Initiatives like the NoMaD Repository 1 give access to the raw data of computational material-science studies performed with a variety of codes. To take advantage of the wealth of information hidden in this very heterogeneous, open access data, the base layer of the NoMaD Lab 2 − an infrastructure to perform advanced big-data analytics and complex queries over this kind of data − is presented.
First a translation layer transforms the data into a standardized format that uses an hdf5 or a json file. This file uses a flexible classification system that can be easily extended, to describe the data stored. As a consequence, all data are stored in a uniform, robust and extensible representation.
Then the reactive (responsive, resilient, message-driven and dynamically resizable) application that started the standardization might complete the data calculating derived quantities. Finally Apache Flink 3 (a fast large-scale data-processing engine) is used to efficiently perform complex queries on the extracted data.
Illustrative examples of the queries and data analytics enabled by the NoMaD Lab, and of its scalability, are demonstrated.
(*) Work done in collaboration with the NoMaD team