APS Logo

Constructing Superconductivity and Magnetism Databases using Large Language Models

ORAL

Abstract

Large language models can effectively process natural language and extract key information from even very technical text. We use this capability to process several thousand materials science and condensed matter papers in order to obtain material parameters and experimental techniques. This work encompasses the entire large language model stack: First, we construct the pipeline for constructing plain text from pdfs and problems in correcting text from optical character recognition. We also discuss extending default tokenizers to include more technical text and commonly used abbreviations, characters, and chemical formulas. Different prompting strategies are discussed. We carefully examine error in material parameter extraction. The effectiveness of the pipeline is tested on a human-labelled superconducting material database, which also provides a convenient source of training data. Finally, we compare several large language models of different size and fine-tuning of the models in order to speed up inference.

Publication: Planned paper:<br>Constructing Superconductivity and Magnetism Databases using Large Language Models

Presenters

  • Louis D Primeau

    University of Tennessee

Authors

  • Louis D Primeau

    University of Tennessee

  • Yang Zhang

    University of Tennessee

  • Adrian Del Maestro

    University of Tennessee, University of Tennessee-Knoxville