APS Logo

Automatic, physical data extraction from scientific publications for application to generative molecular design in computational materials discovery

ORAL

Abstract

One of the major barriers for the application of artificial intelligence (AI) in materials design and discovery is the lack of training data for machine-learning models. Despite the recent emergence of public data repositories in materials sciences, the data formats are not standardized and searchability of application specific data sets is limited. This contrasts with the vast amounts of structured data tables available in published papers nowadays. In this contribution, we will present a method and research tool that allows the annotation and automatic extraction of physical and chemical data tables from document files. The necessary configuration steps include: (i) defining a corpus of papers which are relevant to the discovery application of interest; (ii) reviewing and selecting the extracted tables and converting the files, and (iii) transforming the materials’ names into a machine-readable string format. With the above steps completed, we obtain an integrated data table with materials properties that is used for training the AI models. In our research, we have used the above method to collect about 500 data entries with the following polymer properties: CO2 permeability and CO2/N2 selectivity. Currently, the amount of data entries we have extracted is limited by the number of documents in the corpus. Finally, we discuss our initial results obtained with AI models trained on the extracted data tables for designing high-performance membranes for carbon dioxide capture and separation.

Presenters

  • Ronaldo Giro

    IBM Research - Brazil

Authors

  • Ronaldo Giro

    IBM Research - Brazil

  • Mohab Elkaref

    IBM Research - UK

  • Hsianghan Hsu

    IBM Research - Tokyo

  • Nathan Herr

    IBM Research - UK

  • Geeth de Mel

    IBM Research - UK

  • Mathias B Steiner

    IBM Research - Brazil