Automatic, physical data extraction from scientific publications for application to generative molecular design in computational materials discovery

Ronaldo Giro; Mohab Elkaref; Hsianghan Hsu; Nathan Herr; Geeth de Mel; Mathias B Steiner

Automatic, physical data extraction from scientific publications for application to generative molecular design in computational materials discovery

ORAL

Abstract

One of the major barriers for the application of artificial intelligence (AI) in materials design and discovery is the lack of training data for machine-learning models. Despite the recent emergence of public data repositories in materials sciences, the data formats are not standardized and searchability of application specific data sets is limited. This contrasts with the vast amounts of structured data tables available in published papers nowadays. In this contribution, we will present a method and research tool that allows the annotation and automatic extraction of physical and chemical data tables from document files. The necessary configuration steps include: (i) defining a corpus of papers which are relevant to the discovery application of interest; (ii) reviewing and selecting the extracted tables and converting the files, and (iii) transforming the materials’ names into a machine-readable string format. With the above steps completed, we obtain an integrated data table with materials properties that is used for training the AI models. In our research, we have used the above method to collect about 500 data entries with the following polymer properties: CO₂ permeability and CO₂/N₂ selectivity. Currently, the amount of data entries we have extracted is limited by the number of documents in the corpus. Finally, we discuss our initial results obtained with AI models trained on the extracted data tables for designing high-performance membranes for carbon dioxide capture and separation.

March 6, 2023, 1:00 PM – March 6, 2023, 1:12 PM

Presenters

Ronaldo Giro

IBM Research - Brazil

Authors

Ronaldo Giro

IBM Research - Brazil
Mohab Elkaref

IBM Research - UK
Hsianghan Hsu

IBM Research - Tokyo
Nathan Herr

IBM Research - UK
Geeth de Mel

IBM Research - UK
Mathias B Steiner

IBM Research - Brazil