Show The Graduate Center Menu
 
 

Language Corpora

Corpora are available in the linguistics department office. Contact Nishi to ckeck out the listed corpus. Please note that you are required to sign the user consent form before using some corpora.

ACL/DCI

The ACL Data Collection Initiative disc contains text from: Wall Street Journal, copyright 1987, 1988, 1989, provided by Dow Jones, Inc.; the Collins English Dictionary, Copyright 1979, William Collins Sons Co., Ltd.; scientific abstracts provided by the U.S. Department of Energy and a variety of grammatically tagged and parsed materials from the Treebank project at the University of Pennsylvania, copyright 1990,1991, University of Pennsylvania.
• LDC: LDC93T1
• Language: English
• Media: 1 CD
• Consent form: acldci.pdf
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93T1

American National Corpus the first release (2003)

The first 10 million words of the 100-million-word American National Corpus project. Non-balanced, non-hand-validated POS tagging in XML. Headers include such information as domain, subdomain, subject, audience, and medium.
• Language: American English
• Genre: balanced
• Type: Written & Spoken
• Size: 10 million words (portion of the planned 100-million word corpus)
• Media: 1 CD
• Consent form: anc.pdf
http://americannationalcorpus.org/

BLLIP 1987-89 WSJ Corpus Release 1

The WSJ Corpus was compiled by The Brown Laboratory for Linguistic Information Processing (BLLIP), POS tagged and parsed by Treebank-style parsing. This corpus covers the three year Wall Street Journal (WSJ) collection from the ACL/DCI corpus, from 1987-1989.
• LDC: LDC2000T43
• Language: English
• Genre: News
• Type: Written
• Size: 30-million words
• Media: 2 CD
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T43

CHILDES

CHILDES (TalkBank) is a collection of parsed language data available to all researchers for free. The CHILDES project is led by Brian MacWhinney. All users of CHILDES should consent with the user agreement with the CHILDES project. The old-version of CHILDES was tagged by CHAT, but the XML version is available now.
• Language: Various languages
• Type: Spoken
• Media: online; DVD (downloaded xml version)
• Consent form: childes.pdf
http://childes.psy.cmu.edu/

English Gigaword

English Gigaword corpus is a collection of previously published English corpora. The main sources are Agence France Press English Service (170 mil), Associated Press Worldstream English Service (530 mil), The New York Times Newswire Service (910 mil), and The Xinhua News Agency English Service (131 mil).
• LDC: LDC2003T05
• Language: English
• Genre: News
• Type: Written
• Size: 1.7 billion words
• Media: 1 DVD (only bigram data)
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05

Google 5-gram (Web 1T 5-gram Version 1)

This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses. Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663
• LDC: LDC2006T13
• Language: English
• Genre: web
• Type: written
• Size: 1 trillion words
• Media: 6 DVD
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13

ICLE v.2

International Corpus of Learner English v2.
Installed on several computers within the Program. Software and license are in the possession of the GC's Information Technology Dept.

Mandarin Chinese News Text

The Mandarin News Corpus includes text from various journalistic sources (newspaper text from Renmin Ribao (People's Daily), radio scripts from China Radio International, and newswire text from Xinhua newswire service)
• LDC: LDC95T13
• Language: Chinese (Mandarin)
• Genre: News
• Type: Written
• Size: 250 million GB-encoded text characters
• Media: 1 CD
• Consent form: mandarin.pdf
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T13

Reuters Corpora (2004)

Reuters Corpus Vol.1 (810,000 news stories) and Vol.2 (multilingual/parallel corpus in Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)
• Language: English and 15 other languages
• Genre: News
• Type: Written (parallel)
• Size: unknown (810,000 news stories; about 600M)
• Media: 3 CD
• Consent form: reuters.pdf
http://trec.nist.gov/data/reuters/reuters.html

Santa Barbara Corpus of Spoken American English Part-II

Santa Barbara Corpus of Spoken American English Part-II is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more.
• LDC: LDC2003S06
• Language: English
• Genre: balanced
• Type: Spoken
• Size: 6 hours
• Media: 1 DVD
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S06

The British National Corpus (BNC)

A 100-million-word collection of spoken and written British English. The sources are carefully selected from various genres to represent a good sample of present-dayt British English.
• Language: British English
• Genre: Balanced
• Type: Written (90m) & Spoken (10m)
• Size: 100 million words
• Media: missing
• Consent form: bnc.pdf
http://www.natcorp.ox.ac.uk/

The Brown Corpus

The Brown Corpus was compiled by W.N. Francis and H.Kucera at Brown University in 1976.
• Language: English
• Genre: Balanced
• Type: Written
• Size: 1 million words
• Media: 1 CD

The Lancaster-Oslo/Bergen (LOB) corpus

The LOB corpus is a 1-million-word corpus of British English (in a parallel format to the Brown Corpus), which was compiled by Stig Johnsson, Erie Atwell, Roger Garside, and Geoffrey Leech in 1986 (POS tagged).
• Language: British English
• Genre: Balanced
• Type: Written
• Size: 1 million words
• Media: missing
http://nora.hd.uib.no/whatis.html

U.S. Names Database (with frequency information)

The Census Bureau receives numerous requests to supply information on name frequency. In an effort to comply with those requests, the Census Bureau has embarked on a names list project involving a tabulation of names from the 1990 Census. These files contain only the frequency of a given name, no specific individual information.
• Language: English
• Genre: English names
• Size: 2M
• Media: CD: online
http://www.census.gov/genealogy/names/names_files.html

WordNet

WordNet is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
• Language: English
• Genre: Balanced
• Type: NA
• Size: 11M (compressed)
• Media: online; CD
http://wordnet.princeton.edu/

More Corpora

The following corpora are often used in the corpus-based research or were used in our students' research.

  • American National Corpus the second release
    ANC Second Release contains over 20 million words: 10+ million words added in the Second Release, and a new corrected and validated version of the 11 million word ANC First Release. The Second Release also contains software for searching and retrieving multiple stand-off annotations.
    • LDC: LDC2005T35
    • Language: English
    • Genre: Balanced
    • Type: Written & Spoken
    • Size: NA
    • Media: 2 DVD
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35
 
  • CELEX2
    The CELEX corpus contains ASCII versions of the CELEX lexical database of English (v.2.5), Dutch (v.3.1), and German (v.2.0). For each language, the CD-ROM contains detailed information on: orthography (variations in spelling, hyphenation); phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress); morphology (derivational and compositional structure, inflectional paradigms); syntax (word class, word class-specific subcategorizations, argument structures); word frequency (summed word and lemma counts, based on recent and representative text corpora)
    • LDC: LDC96L14
    • Language: English, Dutch, and German
    • Genre: lexical database
    • Type: NA
    • Size: NA
    • Media: CD
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14
 
  • FrameNet
    The Berkeley FrameNet project is creating an on-line lexical resource for English, based on frame semantics and supported by corpus evidence. The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through computer-assisted annotation of example sentences and automatic tabulation and display of the annotation results. The major product of this work, the FrameNet lexical database, currently contains more than 8,900 lexical units (defined below), more than 6,100 of which are fully annotated, in more than 625 semantic frames, exemplified in more than 135,000 annotated sentences. It has gone through three releases, and is now in use by hundreds of researchers, teachers, and students around the world (see FrameNet Users). Active research projects are now seeking to produce comparable frame-semantic lexicons for other languages and to devise means of automatically labeling running text with semantic frame information.
    • Language: English
    • Genre: Balanced
    • Type: NA
    • Size: NA
    • Media: online
    http://framenet.icsi.berkeley.edu/
 
  • Kyoto Text Corpus
    Japanese newspaper corpus. About 20,000 news sources from Asahi-shinbun January 1 - 17, 19995. The tagging and parsing were automatically processed with Japanese tag system JUMAN and Japanese parser KNP, and manually modified.
    • Language: Japanese
    • Genre: News
    • Type: Written
    • Size: 20,000 news
    • Media: 1 CD
    http://www.kc.t.u-tokyo.ac.jp/nl-resource/corpus.html
 
  • NAIST Text Corpus version 1.0 beta
    The tag information for the Kyoto corpus release 3.0. The tagged information includes (a) the relation between the surface case markers (ga, wo, ni) and predicate, (b) the relation between the surface case marker and NPs and (c) the NP relations.
    • Language: Japanese
    • Genre: News
    • Type: Written
    • Size: 40,000 sentences
    • Media: online
    http://cl.naist.jp/~ryu-i/coreference_tag.html
 
  • NICT JLE Corpus
    A 2-million-word learners corpus (second language corpus), which was compiled and manually tagged (for POS and grammar errors) by National Institute of Information and Communications Technology, Japan.
    • Language: Japanese/English
    • Genre: general
    • Type: spoken
    • Size: 2-million words (by 1300 Japanese EFL speakers)
    • Media: 1 CD
 
  • Proposition Bank I
    This is a semantic annotation of the Wall Street Journal section of Treebank-2. More specifically, each verb occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs have also been tagged with coarse grained senses and with inflectional information. This work was done in the Computer and Information Sciences Department at the University of Pennsylvania.
    • LDC: LDC2004T14
    • Language: English
    • Genre: news (WSJ)
    • Type: Written
    • Size: 5.5M
    • Media: online; CD
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14
 
  • SAID (Syntactically Annotated Idiom Dataset)
    The purpose if this corpus is to provide data for investigating the structural configurations in which English idioms are typically found. The assumption was that, since idioms are phrasal lexical items (PLIs), they would therefore have structural properties which are idiosyncratic. In order to study the structural properties of phrasal lexical items, the data is more useful if it is syntactically annotated.
    • LDC: LDC2003T10
    • Language: English
    • Genre: Idioms
    • Type: Written
    • Size: NA
    • Media: NA
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T10
 
  • Santa Barbara Corpus of Spoken American English Part-I
    The Santa Barbara Corpus of Spoken American English is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more.
    • LDC: LDC2000S85
    • Language: English
    • Genre: Balanced
    • Type: Spoken
    • Size: unknown
    • Media: 3 CD
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000S85
 
  • Switchboard-1 Release 2
    The Switchboard-1 Telephone Speech Corpus (LDC97S62) was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set and all copies of the first pressing have been distributed.
    • LDC: LDC97S62
    • Language: English
    • Genre: Balanced
    • Type: Spoken
    • Size: unknown
    • Media: 23 CD
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62
 
  • Switchboard-2 Phase II
    SWB-2 Phase II consists of 4,472 five-minute telephone conversations involving 679 participants. This corpus was collected by the Linguistic Data Consortium (LDC) in support of a project on Speaker Recognition sponsored by the U.S. Department of Defense.
    • LDC: LDC99S79
    • Language: English
    • Genre: Balanced
    • Type: Spoken
    • Size: about 370 hours
    • Media: 6 DVD
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99S79
 
  • The Corpus of Spontaneous Japanese
    CSJ, or Corpus of Spontaneous Japanese, is a large-scale annotated corpus of spontaneous Japanese. CSJ is an outcome of Japan's national priority-area research project known as Spontaneous Speech: Corpus and Processing Technology (1999-2003) supported by the Ministry of Education, Culture, Sports, Science and Technology. This is a collaborative work of the National Institute for Japanese Language (NIJLA), the Communications Research Laboratory (CRL), and the Tokyo Institute of Technology (TITech). The project supervisor is professor Sadaoki Furui of TITech.
    • LDC: NA
    • Language: Japanese
    • Genre: Not known
    • Type: spoken
    • Size: Not known
    • Media: Not known
    • Consent form: CSJ.pdf
    http://www.kokken.go.jp/katsudo/seika/corpus/public/
 
  • TimeBank 1.2
    The TimeBank Corpus contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times. The annotation follows the TimeML 1.2.1 specification. TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships. For a detailed description of TimeML, see the TimeML 1.2.1 Specification and Guidelines.
    • LDC: LDC2006T08
    • Language: English
    • Genre: NA
    • Type: NA
    • Size: NA
    • Media: CD; online
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08
 
  • TIMIT Acoustic-Phonetic Continuous Speech Corpus
    The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST).
    • LDC: LDC93S1
    • Language: English
    • Genre: Balanced
    • Type: Spoken
    • Size: unknown
    • Media: 1 CD
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1
 
  • Treebank-3 (Penn TreeBank 3)
    The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
    • LDC: LDC99T42
    • Language: English
    • Genre: News
    • Type: Written
    • Size: unknown
    • Media: 1 CD
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42

Penn Parsed Corpora of Historical English (PPCHE)

The Penn Corpora of Historical English, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English (PPCMBE), are running texts and text samples of British English prose across its history - from the earliest Middle English documents up to the First World War. The texts come in three forms: simple text, part-of-speech tagged text and syntactically annotated text. The syntactic annotation (parsing) permits searching not only for words and word sequences, but also for syntactic structure. All of the annotation has been carefully checked by expert human annotators for accuracy and consistency. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language, and they are publicly available to individuals, research groups and libraries.
 
Language: English
Genre: Historical
Type: Written
Size: Unknown
Media: CD and online
direct link to online resource
More information on the PPCHE website.