“Efficient Estimation of Word Representations in Vector Space” was submitted to arXiv in January 2013 by Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean of Google. It introduced word2vec, a method for learning dense numerical representations of words - now called word embeddings - that became a staple of natural language processing.
The core idea is that a word’s meaning can be inferred from the company it keeps. word2vec trains a simple, shallow neural network on enormous amounts of plain text with a self-supervised task: predict a word from its neighbors, or predict the neighbors from a word. No human labels are needed. After training, each word is represented by a vector of a few hundred numbers, and words used in similar contexts end up near each other in that space.
The striking and widely quoted result was that these vectors captured relationships through arithmetic. The vector for “king” minus “man” plus “woman” landed near the vector for “queen,” and similar analogies held for capitals-and-countries and verb tenses. The paper’s other contribution was efficiency: the architectures were simple enough to train on billions of words quickly, making high-quality embeddings practical for everyone.
The honest limit is that word2vec assigns each word a single fixed vector, regardless of context, so it cannot distinguish the “bank” of a river from the “bank” that holds money. That limitation was overcome a few years later by context-aware models built on the Transformer, such as BERT, which produce a different representation for a word depending on the sentence around it. Even so, word2vec was the moment embeddings became a mainstream tool.