Large-Scale Signal Name Harmonization for Long-Term Fusion Experiment Archives Using Language Models

POSTER

Abstract

Across decades of DIII-D tokamak operations, the names of hundreds of diagnostics have changed due to evolving experimental setups, personnel turnover, and undocumented naming conventions. These inconsistencies, often involving minor edits, abbreviation shifts, or formatting anomalies, have made consistent signal retrieval across time nearly impossible. This creates a major bottleneck in compiling data to train robust multimodal AI systems for diagnostic reconstruction, prediction, and optimization, which require coherent signal inputs spanning thousands of experiments stored in the MDSplus file system (Fredian & Stillerman, 2002).

We present a scalable machine learning-based pipeline for canonicalizing diagnostic and actuator names across historical MDSplus files. The method begins with high-throughput preprocessing and alphanumerical token normalization. Signal names are embedded using pretrained transformer models such as MiniLM (Wang et al., 2020), clustered via semantic similarity using the Facebook AI Similarity Search library (Douze et al., 2024), and filtered using token consistency checks. The most recent name in each cluster becomes the canonical identifier.

Initial tests on 50,000 randomly sampled signals reveal high-precision clustering and repeated renaming patterns. The resulting map enables standardized retrieval and supports multimodal AI pipelines and temporal models for fusion experiments.

Presenters

  • Griffin Hon

Authors

  • Griffin Hon

  • Peter Steiner

    Princeton University

  • Azarakhsh Jalalvand

    Princeton University

  • Egemen Kolemen

    Princeton University