Linguistic Data Consortium

Featured Dataverses

In order to use this feature you must have at least one published dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

351 to 399 of 399 Results

GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 Aug 17, 2015 Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2015, "GALE Phase 3 Arabic Broadcast Conversation Speech Part 1", https://hdl.handle.net/11272.1/AB2/IDQ7EF, Abacus Data Network, V1 Introduction GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 123 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Moroc...
GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 Aug 17, 2015 Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2015, "GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1", https://hdl.handle.net/11272.1/AB2/KNH6XN, Abacus Data Network, V1 Introduction GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 123 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and M...
GALE Phase 3 and 4 Arabic Newswire Parallel Text Aug 17, 2015 Chen, Song; Krug, Gary; Strassel, Stephanie, 2015, "GALE Phase 3 and 4 Arabic Newswire Parallel Text", https://hdl.handle.net/11272.1/AB2/LJCNZH, Abacus Data Network, V1 Introduction GALE Phase 3 and 4 Arabic Newswire Parallel Text was developed by the Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation)...
The Walking Around Corpus Jul 15, 2015 Brennan, Susan; Schuhmann, Katharina; Batres, Karla, 2015, "The Walking Around Corpus", https://hdl.handle.net/11272.1/AB2/CBYRGL, Abacus Data Network, V1 Introduction The Walking Around Corpus was developed by Stony Brook University and is comprised of approximately 33 hours of navigational telephone dialogues from 72 speakers (36 speaker pairs). Participants were Stony Brook University students who identified themselves as native...
English News Text Treebank: Penn Treebank Revised Jul 15, 2015 Bies, Ann; Mott, Justin; Warner, Colin, 2015, "English News Text Treebank: Penn Treebank Revised", https://hdl.handle.net/11272.1/AB2/3NDFMN, Abacus Data Network, V1 Introduction English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the Penn Treebank annotation of Wall Street Journal...
TS Wikipedia Jul 15, 2015 Sezer, Taner; Sezer, Türker, 2015, "TS Wikipedia", https://hdl.handle.net/11272.1/AB2/UZFK6X, Abacus Data Network, V1 Introduction TS Wikipedia is a collection of approximately 1.6 million processed Turkish Wikipedia pages. The data is tokenized and includes part-of-speech tags, morphological analysis, lemmas, bi-grams and tri-grams.
RST Signalling Corpus Jun 15, 2015 Das, Debopam; Taboada, Maite; McFetridge, Paul, 2015, "RST Signalling Corpus", https://hdl.handle.net/11272.1/AB2/ER6VE1, Abacus Data Network, V1 Introduction RST Signalling Corpus was developed at Simon Fraser University and contains annotations for signalling information added to RST Discourse Treebank (LDC2002T07). RST Discourse Treebank (RST-DT) is a collection of English news texts annotated for rhetorical relations u...
GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences Jun 15, 2015 Song, Zhiyi; Krug, Gary; Strassel, Stephanie, 2015, "GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences", https://hdl.handle.net/11272.1/AB2/X1AKBI, Abacus Data Network, V1 Introduction GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences was developed by the Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploit...
2006 CoNLL Shared Task - Ten Languages Jun 15, 2015 Bulgarian Academy of Sciences; Eberhard-Karls-Universität; Copenhagen Business School; Danish Society for Language and Literature; University of Groningen; Universität Potsdam; Universität des Saarlandes; Universität Stuttgart; Eberhard-Karls-Universität Tübingen; University of Southern Denmark; SINTEF Telcom & Informatics; Jožef Stefan Institute; Charles University; The Fran Ramovš Institute for the Slovenian Language; University of Barcelona; Uppsala University; Växjŏ University; Middle East Technical University; Bulgarian Academy of Sciences; Eberhard-Karls-Universität; Copenhagen Business School; Danish Society for Language and Literature; University of Groningen; Universität Potsdam; Universität des Saarlandes; Universität Stuttgart; Eberhard-Karls-Universität Tübingen; University of Southern Denmark; SINTEF Telcom & Informatics; Jožef Stefan Institute; Charles University; The Fran Ramovš Institute for the Slovenian Language; University of Barcelona; Uppsala University; Växjŏ University; Middle East Technical University, 2015, "2006 CoNLL Shared Task - Ten Languages", https://hdl.handle.net/11272.1/AB2/GXTB93, Abacus Data Network, V1 Introduction 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese,...
CIEMPIESS Jun 15, 2015 Mena, Carlos, 2015, "CIEMPIESS", https://hdl.handle.net/11272.1/AB2/PVKAVZ, Abacus Data Network, V1 Introduction CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of appr...
2006 CoNLL Shared Task - Arabic & Czech Jun 15, 2015 Charles University, 2015, "2006 CoNLL Shared Task - Arabic & Czech", https://hdl.handle.net/11272.1/AB2/UOAWYV, Abacus Data Network, V1 Introduction 2006 CoNLL Shared Task - Arabic & Czech consists of Arabic and Czech dependency treebanks used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. LDC also released 2006 CoNLL Shared Task - Ten Languages (LDC2015T11). This corpus is cross liste...
SenSem Lexicons May 15, 2015 Fernández, Ana; Vázquez, Gloria, 2015, "SenSem Lexicons", https://hdl.handle.net/11272.1/AB2/FOSRY6, Abacus Data Network, V1 Introduction SenSem (Sentence Semantics) Lexicons was developed by GRIAL, the Linguistic Applications Inter-University Research Group that includes the following Spanish institutions: the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat de Lleida a...
GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 May 15, 2015 Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2015, "GALE Phase 3 Chinese Broadcast Conversation Speech Part 2", https://hdl.handle.net/11272.1/AB2/NQFDD7, Abacus Data Network, V1 Introduction GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 112 hours of Mandarin Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Scie...
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 May 15, 2015 Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2015, "GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2", https://hdl.handle.net/11272.1/AB2/BS5SXB, Abacus Data Network, V1 Introduction GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 112 hours of Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University...
GALE Phase 3 and 4 Arabic Broadcast News Parallel Text Apr 20, 2015 Song, Zhiyi; Krug, Gary; Strassel, Stephanie, 2015, "GALE Phase 3 and 4 Arabic Broadcast News Parallel Text", https://hdl.handle.net/11272.1/AB2/KQYKZB, Abacus Data Network, V1 Introduction GALE Phase 3 and 4 Arabic Broadcast News Parallel Text was developed by the Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploita...
Mandarin-English Code-Switching in South-East Asia Apr 15, 2015 Nanyang Technological University; Universiti Sains Malaysia, 2015, "Mandarin-English Code-Switching in South-East Asia", https://hdl.handle.net/11272.1/AB2/WNRRBV, Abacus Data Network, V1 Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia in Singapore and Malaysia, respectively. It is comprised of approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with...
Mandarin Chinese Phonetic Segmentation and Tone Apr 15, 2015 Yuan, Jiahong; Ryant, Neville; Liberman, Mark, 2015, "Mandarin Chinese Phonetic Segmentation and Tone", https://hdl.handle.net/11272.1/AB2/HW9PE3, Abacus Data Network, V1 Mandarin Chinese Phonetic Segmentation and Tone was developed by the Linguistic Data Consortium (LDC) and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mand...
GALE Chinese-English Parallel Aligned Treebank -- Training Mar 16, 2015 Li, Xuansong; Grimes, Stephen; Strassel, Stephanie; Ma, Xiaoyi; Xue, Nianwen; Marcus, Mitch; Taylor, Anne, 2015, "GALE Chinese-English Parallel Aligned Treebank -- Training", https://hdl.handle.net/11272.1/AB2/R6YEEW, Abacus Data Network, V1 GALE Chinese-English Parallel Aligned Treebank -- Training was developed by the Linguistic Data Consortium (LDC) and contains 229,249 tokens of word aligned Chinese and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Glo...
GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text Mar 16, 2015 Song, Zhiyi; Krug, Gary; Strassel, Stephanie, 2015, "GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text", https://hdl.handle.net/11272.1/AB2/NE8B9E, Abacus Data Network, V1 Introduction GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text was developed by the Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language...
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 Feb 16, 2015 Li, Xuansong; Grimes, Stephen; Strassel, Stephanie, 2015, "GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3", https://hdl.handle.net/11272.1/AB2/NWG5BA, Abacus Data Network, V1 Introduction GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 was developed by the Linguistic Data Consortium (LDC) and contains 242,020 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as t...
Avocado Research Email Collection Feb 16, 2015 Oard, Douglas; Webber, William; Kirsch, David; Golitsynskiy, Sergey, 2015, "Avocado Research Email Collection", https://hdl.handle.net/11272.1/AB2/VTOAZW, Abacus Data Network, V1 Introduction Avocado Research Email Collection consists of emails and attachments taken from 279 accounts of a defunct information technology company referred to as "Avocado". Most of the accounts are those of Avocado employees; the remainder represent shared accounts such as "Le...
GALE Phase 2 Arabic Broadcast News Transcripts Part 2 Jan 15, 2015 Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2015, "GALE Phase 2 Arabic Broadcast News Transcripts Part 2", https://hdl.handle.net/11272.1/AB2/HXVHDH, Abacus Data Network, V1 Introduction GALE Phase 2 Arabic Broadcast News Transcripts Part 2 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Moroc...
SenSem Databank Jan 15, 2015 Fernández, Ana; Vázquez, Gloria, 2015, "SenSem Databank", https://hdl.handle.net/11272.1/AB2/R1OKLN, Abacus Data Network, V1 Introduction SenSem (Sentence Semantics) Databank was developed by GRIAL, the Linguistic Applications Inter-University Research Group that includes the following Spanish institutions: the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat de Lleida a...
GALE Phase 2 Arabic Broadcast News Speech Part 2 Jan 15, 2015 Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2015, "GALE Phase 2 Arabic Broadcast News Speech Part 2", https://hdl.handle.net/11272.1/AB2/OH8SGQ, Abacus Data Network, V1 Introduction GALE Phase 2 Arabic Broadcast News Speech Part 2 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase...
OntoNotes Release 5.0 Oct 16, 2013 Weischedel, Ralph; Palmer, Martha; Marcus, Mitchell; Hovy, Eduard; Pradhan, Sameer; Ramshaw, Lance; Xue, Nianwen; Taylor, Ann; Kaufman, Jeff; Franchini, Michelle; El-Bachouti, Mohammed; Belvin, Robert; Houston, Ann, 2013, "OntoNotes Release 5.0", https://hdl.handle.net/11272.1/AB2/MKJJ2R, Abacus Data Network, V1 OntoNotes Release 5.0 is the final release of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project was...
Arabic Gigaword Fifth Edition Oct 21, 2011 Parker, Robert; Graff, David; Chen, Ke; Kong, Junbo; Maeda, Kazuaki, 2011, "Arabic Gigaword Fifth Edition", https://hdl.handle.net/11272.1/AB2/CP764S, Abacus Data Network, V1 Arabic Gigaword Fifth Edition, Linguistic Data Consortium (LDC) catalog number LDC2011T11 and ISBN 1-58563-595-2, was produced by LDC. It is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by LDC at the University of Pennsylvania. Ara...
NIST 2005 Open Machine Translation (OpenMT) Evaluation Aug 18, 2010 NIST Multimodal Information Group, 2010, "NIST 2005 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/OUIAZM, Abacus Data Network, V1 NIST 2005 Open Machine Translation (OpenMT) Evaluation, Linguistic Data Consortium (LDC) catalog number LDC2010T14 and isbn 1-58563-556-1, is a package containing source data, reference translations, and scoring software used in the NIST 2005 OpenMT evaluation. It is designed to...
NIST 2004 Open Machine Translation (OpenMT) Evaluation Jul 19, 2010 Linguistic Data Consortium, 2010, "NIST 2004 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/N36V4C, Abacus Data Network, V1 NIST 2004 Open Machine Translation (OpenMT) Evaluation, is a package containing source data, reference translations, and scoring software used in the NIST 2004 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. The package was com...
FactBank 1.0 Sep 15, 2009 Sauri, Roser; Pustejovsky, James, 2009, "FactBank 1.0", https://hdl.handle.net/11272.1/AB2/ZEECTZ, Abacus Data Network, V1 FactBank 1.0, Linguistic Data Consortium (LDC) catalog number LDC2009T23 and isbn 1-58563-522-7, consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality, that is, the degree to w...
Global Yoruba Lexical Database v. 1.0 Dec 19, 2008 Awoyale, Yiwola, 2008, "Global Yoruba Lexical Database v. 1.0", https://hdl.handle.net/11272.1/AB2/WMU7KJ, Abacus Data Network, V1 Introduction The Global Yoruba Lexical Database v. 1.0 is a set of related dictionaries providing definitions and translations for over 450,000 words from the Yoruba language and its variants: Standard Yoruba (over 368,000 words), Gullah (over 3,600 words), Lucumí (over 8,000 wor...
The New York Times Annotated Corpus Oct 17, 2008 Sandhaus, Evan, 2008, "The New York Times Annotated Corpus", https://hdl.handle.net/11272.1/AB2/GZC6PL, Abacus Data Network, V1 The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online productio...
STC-TIMIT 1.0 Mar 19, 2008 Morales, Nicholas, 2008, "STC-TIMIT 1.0", https://hdl.handle.net/11272.1/AB2/XWIVLE, Abacus Data Network, V1 This file contains documentation for STC-TIMIT 1.0, Linguistic Data Consortium (LDC) catalog number LDC2008S03 and isbn 1-58563-468-9. STC-TIMIT 1.0 is a telephone version of TIMIT Acoustic Phonetic Continuous Speech Corpus, LDC93S1 (TIMIT). TIMIT contains broadband recordings of...
Mandarin Affective Speech Jul 17, 2007 Yang, Yingchun; Wu, Zhaohui; Wu, Tian; Li, Dongdong, 2007, "Mandarin Affective Speech", https://hdl.handle.net/11272.1/AB2/USGIFG, Abacus Data Network, V1 Mandarin Affective Speech is a database of emotional speech consisting of audio recordings and corresponding transcripts collected in 2005 at the Advance Computing and System Laboratory, College of Computer Science and Technology, Zhejiang University, Hangzhou, People's Republic...
ISI Arabic-English Automatically Extracted Parallel Text Feb 20, 2007 Munteanu, Dragos Stefan; Marcu, Daniel, 2007, "ISI Arabic-English Automatically Extracted Parallel Text", https://hdl.handle.net/11272.1/AB2/QOOTEO, Abacus Data Network, V1 This distribution contains a corpus of Arabic-English parallel sentences, which were extracted automatically from two monolingual corpora: Arabic Gigaword Second Edition (LDC2006T02) and English Gigaword Second Edition (LDC2005T12). The data was extracted from news articles publi...
Fisher English Training Speech Part 1 Transcripts Dec 15, 2004 Cieri, Christopher; Graff, David; Kimball, Owen; Miller, Dave; Walker, Kevin, 2004, "Fisher English Training Speech Part 1 Transcripts", https://hdl.handle.net/11272.1/AB2/2NDQPL, Abacus Data Network, V1 Fisher English Training Speech Part 1 Transcripts represents the first half of a collection of conversational telephone speech (CTS) that was created at LDC in 2003. It contains time-aligned transcript data for 5,850 complete conversations, each lasting up to 10 minutes. In addit...
Fisher English Training Speech Part 1 Speech Dec 15, 2004 Cieri, Christopher; Graff, David; Kimball, Owen; Miller, Dave; Walker,Kevin, 2004, "Fisher English Training Speech Part 1 Speech", https://hdl.handle.net/11272.1/AB2/KST6JM, Abacus Data Network, V1 Fisher English Training Speech Part 1 Speech represents the first half of a collection of conversational telephone speech (CTS) that was created at the LDC during 2003. It contains 5,850 audio files, each one containing a full conversation of up to 10 minutes. Additional informat...
Arabic English Parallel News Part 1 Oct 24, 2004 Several (sic), 2004, "Arabic English Parallel News Part 1", https://hdl.handle.net/11272.1/AB2/AWOGQE, Abacus Data Network, V1 This corpus contains Arabic news stories and their English translations LDC collected via Ummah Press Service from January 2001 to September 2004. It totals 8,439 story pairs, 68,685 sentence pairs, 2M Arabic words and 2.5M English words. The corpus is aligned at sentence level....
Arabic News Translation Text Part 1 Sep 23, 2004 Ma, Xiaoyi; Zakhary, Dalal; Bamba, Moussa, 2004, "Arabic News Translation Text Part 1", https://hdl.handle.net/11272.1/AB2/OXMNRV, Abacus Data Network, V1 Arabic News Translation Text Part 1 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T17 and ISBN 1-58563-307-0. To support the development of automatic machine translation systems, the LDC was sponsored to solicit English translations for a single set of Ar...
Korean Telephone Conversations Transcripts May 16, 2003 Ko, Eon-Suk; Han, Na-Rae; Strassel, Stephanie; Martey, Nii, 2003, "Korean Telephone Conversations Transcripts", https://hdl.handle.net/11272.1/AB2/NLHMOC, Abacus Data Network, V1 Korean Telephone Conversations Transcripts was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T08 and ISBN 1-58563-264-3. The telephone conversations on which these transcripts are based were originally recorded as part of the CALLFRIEND project. The CALLFRIEN...
RST Discourse Treebank Feb 21, 2002 Carlson, Lynn; Marcu, Daniel; Okurowski, Mary Ellen, 2002, "RST Discourse Treebank", https://hdl.handle.net/11272.1/AB2/T4O5YK, Abacus Data Network, V1 Rhetorical Structure Theory (RST) Discourse Treebank was developed by researchers at the Information Sciences Institute (University of Southern California), the US Department of Defense and the Linguistic Data Consortium (LDC). It consists of 385 Wall Street Journal articles from...
HTIMIT Jan 1, 1998 Reynolds, Douglas, 1998, "HTIMIT", https://hdl.handle.net/11272.1/AB2/HO3TZV, Abacus Data Network, V1 The HTIMIT corpus is a re-recording of a subset of the TIMIT corpus through different telephone handsets. The aim was to create a corpus for the study of telephone transducer effects on speech which minimized confounding factors, such as variable telephone channels and background...
FFMTIMIT Jan 1, 1996 Garofolo, John; Lamel, Lori; Fisher, William; Fiscus, Jonathan; Pallett, David; Dahlgren, Nancy; Zue, Victor, 1996, "FFMTIMIT", https://hdl.handle.net/11272.1/AB2/MJ60CA, Abacus Data Network, V1 The FFMTIMIT corpus contains the previously unreleased secondary microphone waveforms for the TIMIT Acoustic-Phonetic Continuous Speech corpus. The primary microphone waveforms, which were recorded using a close-talking noise-cancelling head-mounted Sennheiser microphone (model H...
JEIDA/JCSD-Channel 1 Control Words Jan 1, 1996 Hamaker, Jonathan; Duncan, Richard; Picone, Joe; Itahashi, Shuichi, 1996, "JEIDA/JCSD-Channel 1 Control Words", https://hdl.handle.net/11272.1/AB2/L4QJD1, Abacus Data Network, V1 The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of the Institute for Signal and Information Processing at Mississippi State University.
CALLHOME Spanish Lexicon Jan 1, 1996 Garrett, Susan; Morton, Tom; McLemore, Cynthia, 1996, "CALLHOME Spanish Lexicon", https://hdl.handle.net/11272.1/AB2/YRJRSK, Abacus Data Network, V1 The CALLHOME Spanish collection includes a lexical component. The CALLHOME Spanish Lexicon consists of 45,582 words and contains separate information fields with phonological, morphological and frequency information for each word. The token coverage by the LDC Spanish lexicon of...
CELEX2 Jan 1, 1996 Baayen, R; Piepenbrock, R; Gulikers, L, 1996, "CELEX2", https://hdl.handle.net/11272.1/AB2/WLSRWH, Abacus Data Network, V1 This corpus contains ASCII versions of the CELEX lexical databases of English (Version 2.5), Dutch (Version 3.1) and German (Version 2.0). CELEX was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institu...
CTIMIT Jan 1, 1996 George, E. Bryan; Brown, Kathy; Birnbaum, Martha; Macon, Michael, 1996, "CTIMIT", https://hdl.handle.net/11272.1/AB2/DPIQCD, Abacus Data Network, V1 The CTIMIT corpus is a cellular-bandwidth adjunct to the TIMIT Acoustic Phonetic Continuous Speech Corpus (NIST Speech Disc CD1-1.1/NTIS Pb91-505065, October 1990). The corpus was contributed by Lockheed-Martin Sanders to the LDC for distribution on CD-ROM media. The CTIMIT read...
NTIMIT Jan 1, 1993 Fisher, William; Doddington, George; Goudie-Marshall, Kathleen; Jankowski, Charles; Kalyanswamy, Ashok; Basson, Sara; Spitz, Judith, 1993, "NTIMIT", https://hdl.handle.net/11272.1/AB2/AXQJUZ, Abacus Data Network, V1 The NTIMIT corpus was developed by the NYNEX Science and Technology Speech Communication Group to provide a telephone bandwidth adjunct to TIMIT. NTIMIT was collected by transmitting all 6,300 original TIMIT recordings through a telephone handset and over various channels in the...
TIMIT Acoustic-Phonetic Continuous Speech (MS-WAV version) Jan 1, 1993 Garofolo, John; Lamel, Lori; Fisher, William; Fiscus, Jonathan; Pallett, David; Dahlgren, Nancy; Zue, Victor, 1993, "TIMIT Acoustic-Phonetic Continuous Speech (MS-WAV version)", https://hdl.handle.net/11272.1/AB2/BU0KGP, Abacus Data Network, V1 This version of the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) has all the waveform files formatted with ms-wav / RIFF headers, to make the corpus more accessible to a wider audience. The TIMIT corpus of read speech is designed to provide speech data for acoustic-...
TIMIT Acoustic-Phonetic Continuous Speech Corpus Jan 1, 1993 Garofolo, John; Lamel, Lori; Fisher, William; Fiscus, Jonathan; Pallett, David; Dahlgren, Nancy; Zue, Victor, 1993, "TIMIT Acoustic-Phonetic Continuous Speech Corpus", https://hdl.handle.net/11272.1/AB2/SWVENO, Abacus Data Network, V1 The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each r...

GALE Phase 3 Arabic Broadcast Conversation Speech Part 1

Aug 17, 2015

Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2015, "GALE Phase 3 Arabic Broadcast Conversation Speech Part 1", https://hdl.handle.net/11272.1/AB2/IDQ7EF, Abacus Data Network, V1

Introduction GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 123 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Moroc...

GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1

Aug 17, 2015

Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2015, "GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1", https://hdl.handle.net/11272.1/AB2/KNH6XN, Abacus Data Network, V1

Introduction GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 123 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and M...

GALE Phase 3 and 4 Arabic Newswire Parallel Text

Aug 17, 2015

Chen, Song; Krug, Gary; Strassel, Stephanie, 2015, "GALE Phase 3 and 4 Arabic Newswire Parallel Text", https://hdl.handle.net/11272.1/AB2/LJCNZH, Abacus Data Network, V1

Introduction GALE Phase 3 and 4 Arabic Newswire Parallel Text was developed by the Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation)...

The Walking Around Corpus

Jul 15, 2015

Brennan, Susan; Schuhmann, Katharina; Batres, Karla, 2015, "The Walking Around Corpus", https://hdl.handle.net/11272.1/AB2/CBYRGL, Abacus Data Network, V1

Introduction The Walking Around Corpus was developed by Stony Brook University and is comprised of approximately 33 hours of navigational telephone dialogues from 72 speakers (36 speaker pairs). Participants were Stony Brook University students who identified themselves as native...

English News Text Treebank: Penn Treebank Revised

Jul 15, 2015

Bies, Ann; Mott, Justin; Warner, Colin, 2015, "English News Text Treebank: Penn Treebank Revised", https://hdl.handle.net/11272.1/AB2/3NDFMN, Abacus Data Network, V1

Introduction English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the Penn Treebank annotation of Wall Street Journal...

TS Wikipedia

Jul 15, 2015

Sezer, Taner; Sezer, Türker, 2015, "TS Wikipedia", https://hdl.handle.net/11272.1/AB2/UZFK6X, Abacus Data Network, V1

Introduction TS Wikipedia is a collection of approximately 1.6 million processed Turkish Wikipedia pages. The data is tokenized and includes part-of-speech tags, morphological analysis, lemmas, bi-grams and tri-grams.

RST Signalling Corpus

Jun 15, 2015

Das, Debopam; Taboada, Maite; McFetridge, Paul, 2015, "RST Signalling Corpus", https://hdl.handle.net/11272.1/AB2/ER6VE1, Abacus Data Network, V1

Introduction RST Signalling Corpus was developed at Simon Fraser University and contains annotations for signalling information added to RST Discourse Treebank (LDC2002T07). RST Discourse Treebank (RST-DT) is a collection of English news texts annotated for rhetorical relations u...

GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences

Jun 15, 2015

Song, Zhiyi; Krug, Gary; Strassel, Stephanie, 2015, "GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences", https://hdl.handle.net/11272.1/AB2/X1AKBI, Abacus Data Network, V1

Introduction GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences was developed by the Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploit...

2006 CoNLL Shared Task - Ten Languages

Jun 15, 2015

Bulgarian Academy of Sciences; Eberhard-Karls-Universität; Copenhagen Business School; Danish Society for Language and Literature; University of Groningen; Universität Potsdam; Universität des Saarlandes; Universität Stuttgart; Eberhard-Karls-Universität Tübingen; University of Southern Denmark; SINTEF Telcom & Informatics; Jožef Stefan Institute; Charles University; The Fran Ramovš Institute for the Slovenian Language; University of Barcelona; Uppsala University; Växjŏ University; Middle East Technical University; Bulgarian Academy of Sciences; Eberhard-Karls-Universität; Copenhagen Business School; Danish Society for Language and Literature; University of Groningen; Universität Potsdam; Universität des Saarlandes; Universität Stuttgart; Eberhard-Karls-Universität Tübingen; University of Southern Denmark; SINTEF Telcom & Informatics; Jožef Stefan Institute; Charles University; The Fran Ramovš Institute for the Slovenian Language; University of Barcelona; Uppsala University; Växjŏ University; Middle East Technical University, 2015, "2006 CoNLL Shared Task - Ten Languages", https://hdl.handle.net/11272.1/AB2/GXTB93, Abacus Data Network, V1

Introduction 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese,...

CIEMPIESS

Jun 15, 2015

Mena, Carlos, 2015, "CIEMPIESS", https://hdl.handle.net/11272.1/AB2/PVKAVZ, Abacus Data Network, V1

Introduction CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of appr...

2006 CoNLL Shared Task - Arabic & Czech

Jun 15, 2015

Charles University, 2015, "2006 CoNLL Shared Task - Arabic & Czech", https://hdl.handle.net/11272.1/AB2/UOAWYV, Abacus Data Network, V1

Introduction 2006 CoNLL Shared Task - Arabic & Czech consists of Arabic and Czech dependency treebanks used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. LDC also released 2006 CoNLL Shared Task - Ten Languages (LDC2015T11). This corpus is cross liste...

SenSem Lexicons

May 15, 2015

Fernández, Ana; Vázquez, Gloria, 2015, "SenSem Lexicons", https://hdl.handle.net/11272.1/AB2/FOSRY6, Abacus Data Network, V1

Introduction SenSem (Sentence Semantics) Lexicons was developed by GRIAL, the Linguistic Applications Inter-University Research Group that includes the following Spanish institutions: the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat de Lleida a...

GALE Phase 3 Chinese Broadcast Conversation Speech Part 2

May 15, 2015

Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2015, "GALE Phase 3 Chinese Broadcast Conversation Speech Part 2", https://hdl.handle.net/11272.1/AB2/NQFDD7, Abacus Data Network, V1

Introduction GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 112 hours of Mandarin Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Scie...

GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2

May 15, 2015

Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2015, "GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2", https://hdl.handle.net/11272.1/AB2/BS5SXB, Abacus Data Network, V1

Introduction GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 112 hours of Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University...

GALE Phase 3 and 4 Arabic Broadcast News Parallel Text

Apr 20, 2015

Song, Zhiyi; Krug, Gary; Strassel, Stephanie, 2015, "GALE Phase 3 and 4 Arabic Broadcast News Parallel Text", https://hdl.handle.net/11272.1/AB2/KQYKZB, Abacus Data Network, V1

Introduction GALE Phase 3 and 4 Arabic Broadcast News Parallel Text was developed by the Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploita...

Mandarin-English Code-Switching in South-East Asia

Apr 15, 2015

Nanyang Technological University; Universiti Sains Malaysia, 2015, "Mandarin-English Code-Switching in South-East Asia", https://hdl.handle.net/11272.1/AB2/WNRRBV, Abacus Data Network, V1

Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia in Singapore and Malaysia, respectively. It is comprised of approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with...

Mandarin Chinese Phonetic Segmentation and Tone

Apr 15, 2015

Yuan, Jiahong; Ryant, Neville; Liberman, Mark, 2015, "Mandarin Chinese Phonetic Segmentation and Tone", https://hdl.handle.net/11272.1/AB2/HW9PE3, Abacus Data Network, V1

Mandarin Chinese Phonetic Segmentation and Tone was developed by the Linguistic Data Consortium (LDC) and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mand...

GALE Chinese-English Parallel Aligned Treebank -- Training

Mar 16, 2015

Li, Xuansong; Grimes, Stephen; Strassel, Stephanie; Ma, Xiaoyi; Xue, Nianwen; Marcus, Mitch; Taylor, Anne, 2015, "GALE Chinese-English Parallel Aligned Treebank -- Training", https://hdl.handle.net/11272.1/AB2/R6YEEW, Abacus Data Network, V1

GALE Chinese-English Parallel Aligned Treebank -- Training was developed by the Linguistic Data Consortium (LDC) and contains 229,249 tokens of word aligned Chinese and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Glo...

GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text

Mar 16, 2015

Song, Zhiyi; Krug, Gary; Strassel, Stephanie, 2015, "GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text", https://hdl.handle.net/11272.1/AB2/NE8B9E, Abacus Data Network, V1

Introduction GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text was developed by the Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language...

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3

Feb 16, 2015

Li, Xuansong; Grimes, Stephen; Strassel, Stephanie, 2015, "GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3", https://hdl.handle.net/11272.1/AB2/NWG5BA, Abacus Data Network, V1

Introduction GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 was developed by the Linguistic Data Consortium (LDC) and contains 242,020 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as t...

Avocado Research Email Collection

Feb 16, 2015

Oard, Douglas; Webber, William; Kirsch, David; Golitsynskiy, Sergey, 2015, "Avocado Research Email Collection", https://hdl.handle.net/11272.1/AB2/VTOAZW, Abacus Data Network, V1

Introduction Avocado Research Email Collection consists of emails and attachments taken from 279 accounts of a defunct information technology company referred to as "Avocado". Most of the accounts are those of Avocado employees; the remainder represent shared accounts such as "Le...

GALE Phase 2 Arabic Broadcast News Transcripts Part 2

Jan 15, 2015

Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2015, "GALE Phase 2 Arabic Broadcast News Transcripts Part 2", https://hdl.handle.net/11272.1/AB2/HXVHDH, Abacus Data Network, V1

Introduction GALE Phase 2 Arabic Broadcast News Transcripts Part 2 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Moroc...

SenSem Databank

Jan 15, 2015

Fernández, Ana; Vázquez, Gloria, 2015, "SenSem Databank", https://hdl.handle.net/11272.1/AB2/R1OKLN, Abacus Data Network, V1

Introduction SenSem (Sentence Semantics) Databank was developed by GRIAL, the Linguistic Applications Inter-University Research Group that includes the following Spanish institutions: the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat de Lleida a...

GALE Phase 2 Arabic Broadcast News Speech Part 2

Jan 15, 2015

Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2015, "GALE Phase 2 Arabic Broadcast News Speech Part 2", https://hdl.handle.net/11272.1/AB2/OH8SGQ, Abacus Data Network, V1

Introduction GALE Phase 2 Arabic Broadcast News Speech Part 2 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase...

OntoNotes Release 5.0

Oct 16, 2013

Weischedel, Ralph; Palmer, Martha; Marcus, Mitchell; Hovy, Eduard; Pradhan, Sameer; Ramshaw, Lance; Xue, Nianwen; Taylor, Ann; Kaufman, Jeff; Franchini, Michelle; El-Bachouti, Mohammed; Belvin, Robert; Houston, Ann, 2013, "OntoNotes Release 5.0", https://hdl.handle.net/11272.1/AB2/MKJJ2R, Abacus Data Network, V1

OntoNotes Release 5.0 is the final release of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project was...

Arabic Gigaword Fifth Edition

Oct 21, 2011

Parker, Robert; Graff, David; Chen, Ke; Kong, Junbo; Maeda, Kazuaki, 2011, "Arabic Gigaword Fifth Edition", https://hdl.handle.net/11272.1/AB2/CP764S, Abacus Data Network, V1

Arabic Gigaword Fifth Edition, Linguistic Data Consortium (LDC) catalog number LDC2011T11 and ISBN 1-58563-595-2, was produced by LDC. It is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by LDC at the University of Pennsylvania. Ara...

NIST 2005 Open Machine Translation (OpenMT) Evaluation

Aug 18, 2010

NIST Multimodal Information Group, 2010, "NIST 2005 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/OUIAZM, Abacus Data Network, V1

NIST 2005 Open Machine Translation (OpenMT) Evaluation, Linguistic Data Consortium (LDC) catalog number LDC2010T14 and isbn 1-58563-556-1, is a package containing source data, reference translations, and scoring software used in the NIST 2005 OpenMT evaluation. It is designed to...

NIST 2004 Open Machine Translation (OpenMT) Evaluation

Jul 19, 2010

Linguistic Data Consortium, 2010, "NIST 2004 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/N36V4C, Abacus Data Network, V1

NIST 2004 Open Machine Translation (OpenMT) Evaluation, is a package containing source data, reference translations, and scoring software used in the NIST 2004 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. The package was com...

FactBank 1.0

Sep 15, 2009

Sauri, Roser; Pustejovsky, James, 2009, "FactBank 1.0", https://hdl.handle.net/11272.1/AB2/ZEECTZ, Abacus Data Network, V1

FactBank 1.0, Linguistic Data Consortium (LDC) catalog number LDC2009T23 and isbn 1-58563-522-7, consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality, that is, the degree to w...

Global Yoruba Lexical Database v. 1.0

Dec 19, 2008

Awoyale, Yiwola, 2008, "Global Yoruba Lexical Database v. 1.0", https://hdl.handle.net/11272.1/AB2/WMU7KJ, Abacus Data Network, V1

Introduction The Global Yoruba Lexical Database v. 1.0 is a set of related dictionaries providing definitions and translations for over 450,000 words from the Yoruba language and its variants: Standard Yoruba (over 368,000 words), Gullah (over 3,600 words), Lucumí (over 8,000 wor...

The New York Times Annotated Corpus

Oct 17, 2008

Sandhaus, Evan, 2008, "The New York Times Annotated Corpus", https://hdl.handle.net/11272.1/AB2/GZC6PL, Abacus Data Network, V1

The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online productio...

STC-TIMIT 1.0

Mar 19, 2008

Morales, Nicholas, 2008, "STC-TIMIT 1.0", https://hdl.handle.net/11272.1/AB2/XWIVLE, Abacus Data Network, V1

This file contains documentation for STC-TIMIT 1.0, Linguistic Data Consortium (LDC) catalog number LDC2008S03 and isbn 1-58563-468-9. STC-TIMIT 1.0 is a telephone version of TIMIT Acoustic Phonetic Continuous Speech Corpus, LDC93S1 (TIMIT). TIMIT contains broadband recordings of...

Mandarin Affective Speech

Jul 17, 2007

Yang, Yingchun; Wu, Zhaohui; Wu, Tian; Li, Dongdong, 2007, "Mandarin Affective Speech", https://hdl.handle.net/11272.1/AB2/USGIFG, Abacus Data Network, V1

Mandarin Affective Speech is a database of emotional speech consisting of audio recordings and corresponding transcripts collected in 2005 at the Advance Computing and System Laboratory, College of Computer Science and Technology, Zhejiang University, Hangzhou, People's Republic...

ISI Arabic-English Automatically Extracted Parallel Text

Feb 20, 2007

Munteanu, Dragos Stefan; Marcu, Daniel, 2007, "ISI Arabic-English Automatically Extracted Parallel Text", https://hdl.handle.net/11272.1/AB2/QOOTEO, Abacus Data Network, V1

This distribution contains a corpus of Arabic-English parallel sentences, which were extracted automatically from two monolingual corpora: Arabic Gigaword Second Edition (LDC2006T02) and English Gigaword Second Edition (LDC2005T12). The data was extracted from news articles publi...

Fisher English Training Speech Part 1 Transcripts

Dec 15, 2004

Cieri, Christopher; Graff, David; Kimball, Owen; Miller, Dave; Walker, Kevin, 2004, "Fisher English Training Speech Part 1 Transcripts", https://hdl.handle.net/11272.1/AB2/2NDQPL, Abacus Data Network, V1

Fisher English Training Speech Part 1 Transcripts represents the first half of a collection of conversational telephone speech (CTS) that was created at LDC in 2003. It contains time-aligned transcript data for 5,850 complete conversations, each lasting up to 10 minutes. In addit...

Fisher English Training Speech Part 1 Speech

Dec 15, 2004

Cieri, Christopher; Graff, David; Kimball, Owen; Miller, Dave; Walker,Kevin, 2004, "Fisher English Training Speech Part 1 Speech", https://hdl.handle.net/11272.1/AB2/KST6JM, Abacus Data Network, V1

Fisher English Training Speech Part 1 Speech represents the first half of a collection of conversational telephone speech (CTS) that was created at the LDC during 2003. It contains 5,850 audio files, each one containing a full conversation of up to 10 minutes. Additional informat...

Arabic English Parallel News Part 1

Oct 24, 2004

Several (sic), 2004, "Arabic English Parallel News Part 1", https://hdl.handle.net/11272.1/AB2/AWOGQE, Abacus Data Network, V1

This corpus contains Arabic news stories and their English translations LDC collected via Ummah Press Service from January 2001 to September 2004. It totals 8,439 story pairs, 68,685 sentence pairs, 2M Arabic words and 2.5M English words. The corpus is aligned at sentence level....

Arabic News Translation Text Part 1

Sep 23, 2004

Ma, Xiaoyi; Zakhary, Dalal; Bamba, Moussa, 2004, "Arabic News Translation Text Part 1", https://hdl.handle.net/11272.1/AB2/OXMNRV, Abacus Data Network, V1

Arabic News Translation Text Part 1 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T17 and ISBN 1-58563-307-0. To support the development of automatic machine translation systems, the LDC was sponsored to solicit English translations for a single set of Ar...

Korean Telephone Conversations Transcripts

May 16, 2003

Ko, Eon-Suk; Han, Na-Rae; Strassel, Stephanie; Martey, Nii, 2003, "Korean Telephone Conversations Transcripts", https://hdl.handle.net/11272.1/AB2/NLHMOC, Abacus Data Network, V1

Korean Telephone Conversations Transcripts was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T08 and ISBN 1-58563-264-3. The telephone conversations on which these transcripts are based were originally recorded as part of the CALLFRIEND project. The CALLFRIEN...

RST Discourse Treebank

Feb 21, 2002

Carlson, Lynn; Marcu, Daniel; Okurowski, Mary Ellen, 2002, "RST Discourse Treebank", https://hdl.handle.net/11272.1/AB2/T4O5YK, Abacus Data Network, V1

Rhetorical Structure Theory (RST) Discourse Treebank was developed by researchers at the Information Sciences Institute (University of Southern California), the US Department of Defense and the Linguistic Data Consortium (LDC). It consists of 385 Wall Street Journal articles from...

HTIMIT

Jan 1, 1998

Reynolds, Douglas, 1998, "HTIMIT", https://hdl.handle.net/11272.1/AB2/HO3TZV, Abacus Data Network, V1

The HTIMIT corpus is a re-recording of a subset of the TIMIT corpus through different telephone handsets. The aim was to create a corpus for the study of telephone transducer effects on speech which minimized confounding factors, such as variable telephone channels and background...

FFMTIMIT

Jan 1, 1996

Garofolo, John; Lamel, Lori; Fisher, William; Fiscus, Jonathan; Pallett, David; Dahlgren, Nancy; Zue, Victor, 1996, "FFMTIMIT", https://hdl.handle.net/11272.1/AB2/MJ60CA, Abacus Data Network, V1

The FFMTIMIT corpus contains the previously unreleased secondary microphone waveforms for the TIMIT Acoustic-Phonetic Continuous Speech corpus. The primary microphone waveforms, which were recorded using a close-talking noise-cancelling head-mounted Sennheiser microphone (model H...

JEIDA/JCSD-Channel 1 Control Words

Jan 1, 1996

Hamaker, Jonathan; Duncan, Richard; Picone, Joe; Itahashi, Shuichi, 1996, "JEIDA/JCSD-Channel 1 Control Words", https://hdl.handle.net/11272.1/AB2/L4QJD1, Abacus Data Network, V1

The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of the Institute for Signal and Information Processing at Mississippi State University.

CALLHOME Spanish Lexicon

Jan 1, 1996

Garrett, Susan; Morton, Tom; McLemore, Cynthia, 1996, "CALLHOME Spanish Lexicon", https://hdl.handle.net/11272.1/AB2/YRJRSK, Abacus Data Network, V1

The CALLHOME Spanish collection includes a lexical component. The CALLHOME Spanish Lexicon consists of 45,582 words and contains separate information fields with phonological, morphological and frequency information for each word. The token coverage by the LDC Spanish lexicon of...

CELEX2

Jan 1, 1996

Baayen, R; Piepenbrock, R; Gulikers, L, 1996, "CELEX2", https://hdl.handle.net/11272.1/AB2/WLSRWH, Abacus Data Network, V1

This corpus contains ASCII versions of the CELEX lexical databases of English (Version 2.5), Dutch (Version 3.1) and German (Version 2.0). CELEX was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institu...

CTIMIT

Jan 1, 1996

George, E. Bryan; Brown, Kathy; Birnbaum, Martha; Macon, Michael, 1996, "CTIMIT", https://hdl.handle.net/11272.1/AB2/DPIQCD, Abacus Data Network, V1

The CTIMIT corpus is a cellular-bandwidth adjunct to the TIMIT Acoustic Phonetic Continuous Speech Corpus (NIST Speech Disc CD1-1.1/NTIS Pb91-505065, October 1990). The corpus was contributed by Lockheed-Martin Sanders to the LDC for distribution on CD-ROM media. The CTIMIT read...

NTIMIT

Jan 1, 1993

Fisher, William; Doddington, George; Goudie-Marshall, Kathleen; Jankowski, Charles; Kalyanswamy, Ashok; Basson, Sara; Spitz, Judith, 1993, "NTIMIT", https://hdl.handle.net/11272.1/AB2/AXQJUZ, Abacus Data Network, V1

The NTIMIT corpus was developed by the NYNEX Science and Technology Speech Communication Group to provide a telephone bandwidth adjunct to TIMIT. NTIMIT was collected by transmitting all 6,300 original TIMIT recordings through a telephone handset and over various channels in the...

TIMIT Acoustic-Phonetic Continuous Speech (MS-WAV version)

Jan 1, 1993

Garofolo, John; Lamel, Lori; Fisher, William; Fiscus, Jonathan; Pallett, David; Dahlgren, Nancy; Zue, Victor, 1993, "TIMIT Acoustic-Phonetic Continuous Speech (MS-WAV version)", https://hdl.handle.net/11272.1/AB2/BU0KGP, Abacus Data Network, V1

This version of the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) has all the waveform files formatted with ms-wav / RIFF headers, to make the corpus more accessible to a wider audience. The TIMIT corpus of read speech is designed to provide speech data for acoustic-...

TIMIT Acoustic-Phonetic Continuous Speech Corpus

Jan 1, 1993

Garofolo, John; Lamel, Lori; Fisher, William; Fiscus, Jonathan; Pallett, David; Dahlgren, Nancy; Zue, Victor, 1993, "TIMIT Acoustic-Phonetic Continuous Speech Corpus", https://hdl.handle.net/11272.1/AB2/SWVENO, Abacus Data Network, V1

The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each r...

Add Data

Share Dataverse

Link Dataverse

Reset Modifications