Investigation of the Inter-Rater Reliability between ChatGPT and Human Raters in Qualitative Analysis ​​​

ORAL

Abstract

Qualitative analysis in science education is typically limited to small datasets as it is time-intensive. Moreover, the services of another human are required to establish the reliability of the findings. Artificial intelligence tools like ChatGPT can potentially substitute for human raters if we can demonstrate high reliability compared to human ratings. This study aimed to investigate the inter-rater reliability of ChatGPT in rating audio transcripts that were coded manually in an earlier study. Participants were 14 undergraduate student groups from a university in the midwestern United States who discussed problem-solving strategies for a project. We used prompt engineering techniques to replicate the coding process described by the author of the earlier study with ChatGPT and calculated Cohen's Kappa for inter-rater reliability. We present our preliminary findings, which show satisfactory levels of reliability, suggesting that qualitative researchers can leverage AI tools like ChatGPT for analyzing large data sets efficiently.

Presenters

  • Nikhil Borse

    Purdue University, Purdue University - West Lafayette

Authors

  • Nikhil Borse

    Purdue University, Purdue University - West Lafayette

  • Sean Savage

    Purdue University, Purdue University, West Lafayette

  • Ravishankar Chatta Subramaniam

    Purdue University, Purdue University - West Lafayette

  • N. Sanjay Rebello

    Purdue University, Purdue University - West Lafayette