Large-Scale Signal Name Harmonization for Long-Term Fusion Experiment Archives Using Language Models

Griffin Hon; Peter Steiner; Azarakhsh Jalalvand; Egemen Kolemen

Large-Scale Signal Name Harmonization for Long-Term Fusion Experiment Archives Using Language Models

POSTER

Abstract

Across decades of DIII-D tokamak operations, the names of hundreds of diagnostics have changed due to evolving experimental setups, personnel turnover, and undocumented naming conventions. These inconsistencies, often involving minor edits, abbreviation shifts, or formatting anomalies, have made consistent signal retrieval across time nearly impossible. This creates a major bottleneck in compiling data to train robust multimodal AI systems for diagnostic reconstruction, prediction, and optimization, which require coherent signal inputs spanning thousands of experiments stored in the MDSplus file system (Fredian & Stillerman, 2002).

We present a scalable machine learning-based pipeline for canonicalizing diagnostic and actuator names across historical MDSplus files. The method begins with high-throughput preprocessing and alphanumerical token normalization. Signal names are embedded using pretrained transformer models such as MiniLM (Wang et al., 2020), clustered via semantic similarity using the Facebook AI Similarity Search library (Douze et al., 2024), and filtered using token consistency checks. The most recent name in each cluster becomes the canonical identifier.

Initial tests on 50,000 randomly sampled signals reveal high-precision clustering and repeated renaming patterns. The resulting map enables standardized retrieval and supports multimodal AI pipelines and temporal models for fusion experiments.

Presenters

Griffin Hon

Authors

Griffin Hon
Peter Steiner

Princeton University
Azarakhsh Jalalvand

Princeton University
Egemen Kolemen

Princeton University