Constructing Superconductivity and Magnetism Databases using Large Language Models
ORAL
Abstract
Large language models can effectively process natural language and extract key information from even very technical text. We use this capability to process several thousand materials science and condensed matter papers in order to obtain material parameters and experimental techniques. This work encompasses the entire large language model stack: First, we construct the pipeline for constructing plain text from pdfs and problems in correcting text from optical character recognition. We also discuss extending default tokenizers to include more technical text and commonly used abbreviations, characters, and chemical formulas. Different prompting strategies are discussed. We carefully examine error in material parameter extraction. The effectiveness of the pipeline is tested on a human-labelled superconducting material database, which also provides a convenient source of training data. Finally, we compare several large language models of different size and fine-tuning of the models in order to speed up inference.
–
Publication: Planned paper:<br>Constructing Superconductivity and Magnetism Databases using Large Language Models
Presenters
-
Louis D Primeau
University of Tennessee
Authors
-
Louis D Primeau
University of Tennessee
-
Yang Zhang
University of Tennessee
-
Adrian Del Maestro
University of Tennessee, University of Tennessee-Knoxville