Automated classification of nuclear science literature in NucScholar using Natural Language Processing (NLP)

Char Juin Chin

Automated classification of nuclear science literature in NucScholar using Natural Language Processing (NLP)

ORAL

Abstract

The current means by which researchers and evaluators process nuclear bibliographic information begins at the Nuclear Science References (NSR) database, a platform of critical importance to the nuclear data pipeline. NucScholar seeks to use NLP to improve the effectiveness of NSR by automatically categorizing papers by subject matter, identifying keywords, and extracting data. This work explores the efficacy of different NLP techniques in classifying nuclear science papers as either experimental or theoretical. This was accomplished by preprocessing and vectorizing a sample of papers using Latent Semantic Analysis and doc2vec models before applying classification algorithms such as decision trees. The approach of logistic regression using doc2vec performed best with an >85% accuracy, whereas the clustering algorithm underperformed regardless of how the input vectors were generated. This work contributes to the development of NucScholar, a new NLP-based engine for the automated classification of nuclear science literature.

Oct. 29, 2022, 11:30 AM – Oct. 29, 2022, 11:42 AM

Publication: Planned paper: The NucScholar project: an AI-powered archiving and search engine for nuclear-science literature

Presenters

Char Juin Chin

UC Berkeley

Authors

Char Juin Chin

UC Berkeley