Finding Structure in the ArXiv

ORAL

Abstract

We applied machine learning techniques to the full text of the arXiv articles and report a meaningful low-dimensional representation of this big dataset. Using Google's open source implementation of the continuous skip-gram model, word2vec, the vocabulary used in scientific articles is mapped to a Euclidean vector space that preserves semantic and syntactic relationships between words. This allowed us to develop techniques for automatically characterizing articles, finding similar articles and authors, and segmenting articles into their relevant sections, among other applications.

Authors

  • Alexander Alemi

    Cornell University

  • Ricky Chachra

    Cornell University

  • Paul Ginsparg

    Cornell University

  • James Sethna

    Cornell University