Penn Treebank

The Penn Treebank is a corpus of English text annotated with part-of-speech tags and full syntactic structure, built at the University of Pennsylvania by Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor. The most widely used release, Treebank-3 (LDC catalog number LDC99T42, 1999), contains over a million words of annotated material, including Wall Street Journal articles from 1989, Brown Corpus text, Switchboard telephone conversations, and ATIS material. About 2,499 Wall Street Journal stories were given detailed syntactic bracketing.

Its bracketing style was designed so that predicate-argument structure could be extracted from the parse trees, which made the corpus directly useful for training and evaluating parsers. For roughly two decades, “the Penn Treebank” was effectively synonymous with the standard parsing benchmark: a particular train/development/test split of the Wall Street Journal sections became the fixed comparison point that nearly every syntactic parser was measured against.

The same Wall Street Journal text also became a standard language-modeling benchmark, reported as perplexity, and was a common testbed for recurrent and LSTM language models in the years before large web-scale corpora and Transformers. For business readers, the Penn Treebank is a clear case of how expensive, expert hand-annotation - linguists bracketing sentences one at a time - created a shared measuring stick that an entire research community could build on for years.

Sources

Related