- Linguistics
- Student Resources
- Language Corpora
Language Corpora
A selection of language corpora are available as physical media to students in the Linguistics programs and can be checked out from the program office with appropriate approval.
- ANC (First release) Linguistic Data Consortium
- The Brown Corpus
- Mandarin Chinese News Text Corpus
- Santa Barbara Corpus of Spoken American English (Part –II)
- Web 1T 5-gram (version 1)
- Emile Corpus
- Linguistic Data Consortium (LDC): Contact Prof. Kyle Gorman
Please contact Nishi Bissoondial, Assistant Program Officer, to request access to these materials.
In addition, the following corpora have been recommended by program students and faculty and are available from external sources. Some corpora are available online and/or may require submission of a consent form for access.
The ACL Data Collection Initiative disc contains text from: Wall Street Journal, copyright 1987, 1988, 1989, provided by Dow Jones, Inc.; the Collins English Dictionary, Copyright 1979, William Collins Sons Co., Ltd.; scientific abstracts provided by the U.S. Department of Energy and a variety of grammatically tagged and parsed materials from the Treebank project at the University of Pennsylvania, copyright 1990,1991, University of Pennsylvania.
- LDC: LDC93T1
- Language: English
- Media: 1 CD
Consent form: acldci.pdf
Linguistic Data Consortium
The first 10 million words of the 100-million-word American National Corpus project. Non-balanced, non-hand-validated POS tagging in XML. Headers include such information as domain, subdomain, subject, audience, and medium.
- Language: American English
- Genre: balanced
- Type: Written & Spoken
- Size: 10 million words (portion of the planned 100-million word corpus)
- Media: 1 CD
Consent form: anc.pdf
American National Corpus
ANC Second Release contains over 20 million words: 10+ million words added in the Second Release, and a new corrected and validated version of the 11 million word ANC First Release. The Second Release also contains software for searching and retrieving multiple stand-off annotations.
• LDC: LDC2005T35
• Language: English
• Genre: Balanced
• Type: Written & Spoken
• Size: NA
• Media: 2 DVD
Linguistic Data Consortium
The WSJ Corpus was compiled by The Brown Laboratory for Linguistic Information Processing (BLLIP), POS tagged and parsed by Treebank-style parsing. This corpus covers the three year Wall Street Journal (WSJ) collection from the ACL/DCI corpus, from 1987-1989.
• LDC: LDC2000T43
• Language: English
• Genre: News
• Type: Written
• Size: 30-million words
• Media: 2 CD
Linguistic Data Consortium
A 100-million-word collection of spoken and written British English. The sources are carefully selected from various genres to represent a good sample of present-dayt British English.
• Language: British English
• Genre: Balanced
• Type: Written (90m) & Spoken (10m)
• Size: 100 million words
• Media: missing
• Consent form: bnc.pdf
British National Corpus
The CELEX corpus contains ASCII versions of the CELEX lexical database of English (v.2.5), Dutch (v.3.1), and German (v.2.0). For each language, the CD-ROM contains detailed information on: orthography (variations in spelling, hyphenation); phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress); morphology (derivational and compositional structure, inflectional paradigms); syntax (word class, word class-specific subcategorizations, argument structures); word frequency (summed word and lemma counts, based on recent and representative text corpora)
• LDC: LDC96L14
• Language: English, Dutch, and German
• Genre: lexical database
• Type: NA
• Size: NA
• Media: CD
Linguistic Data Consortium
CHILDES (TalkBank) is a collection of parsed language data available to all researchers for free. The CHILDES project is led by Brian MacWhinney. All users of CHILDES should consent with the user agreement with the CHILDES project. The old-version of CHILDES was tagged by CHAT, but the XML version is available now.
• Language: Various languages
• Type: Spoken
• Media: online; DVD (downloaded xml version)
• Consent form: childes.pdf
CHILDES
The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. The corpus contains more than 450 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2012 and the corpus is also updated regularly.
The page that specifically describes the n-grams is n-grams.info.
N-gram data: Up to 155 million distinct strings -- searchable by word form and part of speech, and also by lemma.
Access the corpus at English-Corpora.org
CSJ, or Corpus of Spontaneous Japanese, is a large-scale annotated corpus of spontaneous Japanese. CSJ is an outcome of Japan's national priority-area research project known as Spontaneous Speech: Corpus and Processing Technology (1999-2003) supported by the Ministry of Education, Culture, Sports, Science and Technology. This is a collaborative work of the National Institute for Japanese Language (NIJLA), the Communications Research Laboratory (CRL), and the Tokyo Institute of Technology (TITech). The project supervisor is professor Sadaoki Furui of TITech.
• LDC: NA
• Language: Japanese
• Genre: Not known
• Type: spoken
• Size: Not known
• Media: Not known
• Consent form: CSJ.pdf
English Gigaword corpus is a collection of previously published English corpora. The main sources are Agence France Press English Service (170 mil), Associated Press Worldstream English Service (530 mil), The New York Times Newswire Service (910 mil), and The Xinhua News Agency English Service (131 mil).
- LDC: LDC2003T05
- Language: English
- Genre: News
- Type: Written
- Size: 1.7 billion words
- Media: 1 DVD (only bigram data)
The Berkeley FrameNet project is creating an on-line lexical resource for English, based on frame semantics and supported by corpus evidence. The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through computer-assisted annotation of example sentences and automatic tabulation and display of the annotation results. The major product of this work, the FrameNet lexical database, currently contains more than 8,900 lexical units (defined below), more than 6,100 of which are fully annotated, in more than 625 semantic frames, exemplified in more than 135,000 annotated sentences. It has gone through three releases, and is now in use by hundreds of researchers, teachers, and students around the world (see FrameNet Users). Active research projects are now seeking to produce comparable frame-semantic lexicons for other languages and to devise means of automatically labeling running text with semantic frame information.
- Language: English
- Genre: Balanced
- Type: NA
- Size: NA
- Media: online
This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses. Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663
- LDC: LDC2006T13
- Language: English
- Genre: web
- Type: written
- Size: 1 trillion words
- Media: 6 DVD
International Corpus of Learner English v2.
Installed on several computers within the Program. Software and license are in the possession of the GC's Information Technology Dept.
Japanese newspaper corpus. About 20,000 news sources from Asahi-shinbun January 1 - 17, 19995. The tagging and parsing were automatically processed with Japanese tag system JUMAN and Japanese parser KNP, and manually modified.
- Language: Japanese
- Genre: News
- Type: Written
- Size: 20,000 news
- Media: 1 CD
The Mandarin News Corpus includes text from various journalistic sources (newspaper text from Renmin Ribao (People's Daily), radio scripts from China Radio International, and newswire text from Xinhua newswire service)
• LDC: LDC95T13
• Language: Chinese (Mandarin)
• Genre: News
• Type: Written
• Size: 250 million GB-encoded text characters
• Media: 1 CD
• Consent form: mandarin.pdf
The tag information for the Kyoto corpus release 3.0. The tagged information includes (a) the relation between the surface case markers (ga, wo, ni) and predicate, (b) the relation between the surface case marker and NPs and (c) the NP relations.
- Language: Japanese
- Genre: News
- Type: Written
- Size: 40,000 sentences
- Media: online
A 2-million-word learners corpus (second language corpus), which was compiled and manually tagged (for POS and grammar errors) by National Institute of Information and Communications Technology, Japan.
- Language: Japanese/English
- Genre: general
- Type: spoken
- Size: 2-million words (by 1300 Japanese EFL speakers)
- Media: 1 CD
The Penn Corpora of Historical English, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English (PPCMBE), are running texts and text samples of British English prose across its history - from the earliest Middle English documents up to the First World War. The texts come in three forms: simple text, part-of-speech tagged text and syntactically annotated text. The syntactic annotation (parsing) permits searching not only for words and word sequences, but also for syntactic structure. All of the annotation has been carefully checked by expert human annotators for accuracy and consistency. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language, and they are publicly available to individuals, research groups and libraries.
- Language: English
- Genre: Historical
- Type: Written
- Size: Unknown
- Media: CD and online
More information on the PPCHE website.
This is a semantic annotation of the Wall Street Journal section of Treebank-2. More specifically, each verb occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs have also been tagged with coarse grained senses and with inflectional information. This work was done in the Computer and Information Sciences Department at the University of Pennsylvania.
- LDC: LDC2004T14
- Language: English
- Genre: news (WSJ)
- Type: Written
- Size: 5.5M
- Media: online; CD
Reuters Corpus Vol.1 (810,000 news stories) and Vol.2 (multilingual/parallel corpus in Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)
• Language: English and 15 other languages
• Genre: News
• Type: Written (parallel)
• Size: unknown (810,000 news stories; about 600M)
• Media: 3 CD
• Consent form: reuters.pdf
Reuters Corpora
The purpose if this corpus is to provide data for investigating the structural configurations in which English idioms are typically found. The assumption was that, since idioms are phrasal lexical items (PLIs), they would therefore have structural properties which are idiosyncratic. In order to study the structural properties of phrasal lexical items, the data is more useful if it is syntactically annotated.
- LDC: LDC2003T10
- Language: English
- Genre: Idioms
- Type: Written
- Size: NA
- Media: NA
The Santa Barbara Corpus of Spoken American English is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more.
- LDC: LDC2000S85
- Language: English
- Genre: Balanced
- Type: Spoken
- Size: unknown
- Media: 3 CD
Santa Barbara Corpus of Spoken American English Part-II is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more.
- LDC: LDC2003S06
- Language: English
- Genre: balanced
- Type: Spoken
- Size: 6 hours
- Media: 1 DVD
The Switchboard-1 Telephone Speech Corpus (LDC97S62) was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set and all copies of the first pressing have been distributed.
- LDC: LDC97S62
- Language: English
- Genre: Balanced
- Type: Spoken
- Size: unknown
- Media: 23 CD
SWB-2 Phase II consists of 4,472 five-minute telephone conversations involving 679 participants. This corpus was collected by the Linguistic Data Consortium (LDC) in support of a project on Speaker Recognition sponsored by the U.S. Department of Defense.
- LDC: LDC99S79
- Language: English
- Genre: Balanced
- Type: Spoken
- Size: about 370 hours
- Media: 6 DVD
The TimeBank Corpus contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times. The annotation follows the TimeML 1.2.1 specification. TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships. For a detailed description of TimeML, see the TimeML 1.2.1 Specification and Guidelines.
- LDC: LDC2006T08
- Language: English
- Genre: NA
- Type: NA
- Size: NA
- Media: CD; online
The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST).
- LDC: LDC93S1
- Language: English
- Genre: Balanced
- Type: Spoken
- Size: unknown
- Media: 1 CD
The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
- LDC: LDC99T42
- Language: English
- Genre: News
- Type: Written
- Size: unknown
- Media: 1 CD
The Census Bureau receives numerous requests to supply information on name frequency. In an effort to comply with those requests, the Census Bureau has embarked on a names list project involving a tabulation of names from the 1990 Census. These files contain only the frequency of a given name, no specific individual information.
- Language: English
- Genre: English names
- Size: 2M
- Media: CD
WordNet is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
- Language: English
- Genre: Balanced
- Type: NA
- Size: 11M (compressed)
- Media: online; CD