Finding Structure in the ArXiv

Alexander Alemi; Ricky Chachra; Paul Ginsparg; James Sethna

Finding Structure in the ArXiv

ORAL

Abstract

We applied machine learning techniques to the full text of the arXiv articles and report a meaningful low-dimensional representation of this big dataset. Using Google's open source implementation of the continuous skip-gram model, word2vec, the vocabulary used in scientific articles is mapped to a Euclidean vector space that preserves semantic and syntactic relationships between words. This allowed us to develop techniques for automatically characterizing articles, finding similar articles and authors, and segmenting articles into their relevant sections, among other applications.

March 3, 2014, 1:27 PM – March 3, 2014, 1:39 PM

Authors

Alexander Alemi

Cornell University
Ricky Chachra

Cornell University
Paul Ginsparg

Cornell University
James Sethna

Cornell University