Linguistic Data Consortium

Featured Dataverses

In order to use this feature you must have at least one published dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

1,651 to 1,700 of 1,819 Results

USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition Jun 17, 2019 Ramabhadran, Bhuvana; Gustman, Samuel; Byrne, William; Hajič, Jan; Oard, Douglas; Olsson, J. Scott; Picheny, Michael; Psutka, Josef, 2019, "USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition", https://hdl.handle.net/11272.1/AB2/SGOMWO, Abacus Data Network, V1 USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition, LDC Catalog Number LDC2019S11 and ISBN 1-58563-889-7, was developed by IBM as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. This edition augments USC-SFI MALACH Interviews...
CIEMPIESS Experimentation May 15, 2019 Mena, Carlos Daniel Hernández, 2019, "CIEMPIESS Experimentation", https://hdl.handle.net/11272.1/AB2/DUUYQV, Abacus Data Network, V1 CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Experimentation was developed by the social service program "Desarrollo de Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) at the National Autonomous Univer...
TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 May 15, 2019 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014", https://hdl.handle.net/11272.1/AB2/ZZMOPP, Abacus Data Network, V1 TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 201...
Multi-Language Conversational Telephone Speech 2011 -- English Group May 15, 2019 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- English Group", https://hdl.handle.net/11272.1/AB2/ACDWDL, Abacus Data Network, V1 Multi-Language Conversational Telephone Speech 2011 – English Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 18 hours of telephone speech in two general varieties of English: American and South Asian. The data were collected primaril...
BOLT Egyptian-English Word Alignment -- Discussion Forum Training Apr 15, 2019 Li, Xuansong; Peterson, Katherine; Grimes, Stephen; Strassel, Stephanie, 2019, "BOLT Egyptian-English Word Alignment -- Discussion Forum Training", https://hdl.handle.net/11272.1/AB2/AR1QCS, Abacus Data Network, V1 BOLT Egyptian-English Word Alignment – Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 400,448 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations. The DARPA BOLT (Broad Operat...
Chinese Abstract Meaning Representation 1.0 Apr 15, 2019 Li, Bin; Wen, Yuan; Song, Li; Dai, Rubing; Qu, Weiguang; Xue, Nianwen, 2019, "Chinese Abstract Meaning Representation 1.0", https://hdl.handle.net/11272.1/AB2/TT5KRI, Abacus Data Network, V1 Chinese Abstract Meaning Representation was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of Chinese sentences from Chinese Treebank 8.0 (LDC2013T21). Abstract Meaning Representation (AMR) captures "who is doi...
Penn Discourse Treebank Version 3.0 Mar 15, 2019 Prasad, Rashmi; Webber, Bonnie; Lee, Alan; Joshi, Aravind, 2019, "Penn Discourse Treebank Version 3.0", https://hdl.handle.net/11272.1/AB2/SUU9CB, Abacus Data Network, V1 Penn Discourse Treebank (PDTB) Version 3.0 is the third release in the Penn Discourse Treebank project, the goal of which is to annotate the Wall Street Journal (WSJ) section of Treebank-2 (LDC95T7) with discourse relations. Penn Discourse Treebank Version 2 (LDC2008T05) contains...
CALLFRIEND Egyptian Arabic Second Edition Mar 15, 2019 Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2019, "CALLFRIEND Egyptian Arabic Second Edition", https://hdl.handle.net/11272.1/AB2/4LCUFC, Abacus Data Network, V1 CALLFRIEND Egyptian Arabic Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 25 hours of unscripted telephone conversations between native speakers of Egyptian Arabic. This second edition updates the audio files to wav format, simp...
VAST Chinese Speech and Transcripts Mar 15, 2019 Tracey, Jennifer; Strassel, Stephanie; Kuster, Neil, 2019, "VAST Chinese Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/OE8XTX, Abacus Data Network, V1 VAST Chinese Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) for the VAST (Video Annotation for Speech Technologies) project and is comprised of approximately 29 hours of Mandarin Chinese audio extracted from amateur video content harvested from the w...
DEFT Chinese Committed Belief Annotation Feb 15, 2019 Tracey, Jennifer; Arrigo, Michael; Kuster, Neil; Strassel, Stephanie, 2019, "DEFT Chinese Committed Belief Annotation", https://hdl.handle.net/11272.1/AB2/EGZOQ9, Abacus Data Network, V1 DEFT Chinese Committed Belief Annotation was developed by the Linguistic Data Consortium (LDC) and consists of approximately 83,000 tokens of Chinese discussion forum text annotated for “committed belief,” which marks the level of commitment displayed by the author to the truth o...
Multilingual ATIS Feb 15, 2019 Upadhyay, Shyam; Hakkani-Tur, Dilek; Tur, Gokhan; Rastogi, Abhinav, 2019, "Multilingual ATIS", https://hdl.handle.net/11272.1/AB2/AGMWIU, Abacus Data Network, V1 Multilingual ATIS was developed by Google Inc. and consists of 5,871 utterances from ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26) annotated and translated into Hindi and Turkish. The ATIS (Air Travel Information Services) collection was develope...
Multi-Language Conversational Telephone Speech 2011 -- Arabic Group Feb 15, 2019 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- Arabic Group", https://hdl.handle.net/11272.1/AB2/A5UT97, Abacus Data Network, V1 Multi-Language Conversational Telephone Speech 2011 – Arabic Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 117 hours of telephone speech in distinct dialects of colloquial Arabic: Iraqi, Levantine and Maghrebi. The data were collect...
SRI Speech-Based Collaborative Learning Corpus Jan 15, 2019 Richey, Colleen; D'Angelo, Cynthia; Alozie, Nonye; Bratt, Harry; Shriberg, Elizabeth, 2019, "SRI Speech-Based Collaborative Learning Corpus", https://hdl.handle.net/11272.1/AB2/YJWBEU, Abacus Data Network, V1 SRI Speech-Based Collaborative Learning Corpus was developed by SRI International and is comprised of approximately 120 hours of English speech from 134 US middle school students working collaboratively. The data set also contains orthographic transcriptions, manual annotation of...
TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 Jan 15, 2019 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015", https://hdl.handle.net/11272.1/AB2/LCPM63, Abacus Data Network, V1 TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2014 and 2015...
BOLT Arabic Discussion Forum Parallel Training Data Jan 15, 2019 Song, Zhiyi; Tracey, Jennifer; Walker, Christopher; Stephanie, Strassel,, 2019, "BOLT Arabic Discussion Forum Parallel Training Data", https://hdl.handle.net/11272.1/AB2/CZR6SG, Abacus Data Network, V1 BOLT Arabic Discussion Forum Parallel Training Data was developed by the Linguistic Data Consortium (LDC) and consists of 1,169,599 tokens of Egyptian Arabic discussion forum data collected for the DARPA BOLT program along with their corresponding English translations. The BOLT (...
HUB5 Mandarin Telephone Speech and Transcripts Second Edition Dec 17, 2018 Linguistic Data Consortium, 2018, "HUB5 Mandarin Telephone Speech and Transcripts Second Edition", https://hdl.handle.net/11272.1/AB2/2JAJJE, Abacus Data Network, V1 HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by the Linguistic Data Consortium (LDC) in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC...
TAC Relation Extraction Dataset Dec 15, 2018 Zhong, Victor; Zhang, Yuhao; Chen, Danqi; Angeli, Gabor; Manning, Christopher, 2018, "TAC Relation Extraction Dataset", https://hdl.handle.net/11272.1/AB2/SOYGGB, Abacus Data Network, V1 TAC Relation Extraction Dataset (TACRED) was developed by The Stanford NLP Group and is a large-scale relation extraction dataset with 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014....
IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a Nov 15, 2018 Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Hammond, Simon; Harper, Mary; Kaiser-Schatzlein, Alice; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Rytting, Anton; Shen, Sinney; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2018, "IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a", https://hdl.handle.net/11272.1/AB2/OTDPUV, Abacus Data Network, V1 Introduction IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Telugu conversational and scripted telephone speech collected in 2013...
BOLT Egyptian Arabic Treebank - Discussion Forum Nov 15, 2018 Maamouri, Mohamed; Bies, Ann; Kulick, Seth; Krouna, Sondos; Tabassi,Dalila; Ciul, Michael, 2018, "BOLT Egyptian Arabic Treebank - Discussion Forum", https://hdl.handle.net/11272.1/AB2/CAA0JW, Abacus Data Network, V1 BOLT Egyptian Arabic Treebank – Discussion Forum was developed by the Linguistic Data Consortium (LDC) and consists of Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation. The DARPA BOLT (Broad Operational Lang...
Avatar Education Portuguese Nov 15, 2018 Maciel, Alexandre M. A.; Rodrigues, Rodrigo L.; Barbosa, Danilo S., 2018, "Avatar Education Portuguese", https://hdl.handle.net/11272.1/AB2/BSQ4NP, Abacus Data Network, V1 Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant d...
TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 Oct 15, 2018 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2018, "TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014", https://hdl.handle.net/11272.1/AB2/B3R0J4, Abacus Data Network, V1 TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Slot Filling evaluation track conducted from 2009 to 2014...
IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a Sep 17, 2018 Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Hefright, Brook; Kozlov, Kirill; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Phillips, Josh; Walter, Marle; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2018, "IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a", https://hdl.handle.net/11272.1/AB2/KGA4ZX, Abacus Data Network, V1 Introduction IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kazakh conversational and scripted telephone speech collected in 2013...
HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation Sep 17, 2018 Morris, Amanda; Strassel, Stephanie; Li, Xuansong; Antonishek, Brian; Fiscus, Jonathan G., 2018, "HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation", https://hdl.handle.net/11272.1/AB2/XNNWD1, Abacus Data Network, V1 Introduction HAVIC MED Event E051-E060 – Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 53 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and related techn...
Multi-Language Conversational Telephone Speech 2011 -- Spanish Sep 17, 2018 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Spanish", https://hdl.handle.net/11272.1/AB2/9Q4DIQ, Abacus Data Network, V1 Introduction Multi-Language Conversational Telephone Speech 2011 – Spanish was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 23 hours of telephone speech in Spanish. The data were collected primarily to support research and technology evaluat...
BOLT Information Retrieval Comprehensive Training and Evaluation Sep 17, 2018 Griffitt, Kira; Strassel, Stephanie, 2018, "BOLT Information Retrieval Comprehensive Training and Evaluation", https://hdl.handle.net/11272.1/AB2/EDRQLG, Abacus Data Network, V1 Introduction BOLT Information Retrieval Comprehensive Training and Evaluation was developed by the Linguistic Data Consortium (LDC) and consists of all data produced in support of the Information Retrieval (IR) task within the DARPA Broad Operational Language Translation (BOLT) P...
CIEMPIESS Balance Aug 15, 2018 Hernández Mena, Carlos Daniel, 2018, "CIEMPIESS Balance", https://hdl.handle.net/11272.1/AB2/JWRYUR, Abacus Data Network, V1 CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Balance was developed by the Development of Speech Technologies program at the School of Engineering at the National Autonomous University of Mexico (UNAM) and consists...
2011 NIST Language Recognition Evaluation Test Set Aug 15, 2018 Greenberg, Craig; Martin, Alvin; Graff, David; Walker, Kevin; Jones, Karen; Strassel, Stephanie, 2018, "2011 NIST Language Recognition Evaluation Test Set", https://hdl.handle.net/11272.1/AB2/0ZCWPS, Abacus Data Network, V1 2011 NIST Language Recognition Evaluation Test Set contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation. It consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by the Linguis...
BOLT English SMS/Chat Aug 15, 2018 Song, Zhiyi; Fore, Dana; Strassel, Stephanie; Lee, Haejoong; Wright, Jonathan, 2018, "BOLT English SMS/Chat", https://hdl.handle.net/11272.1/AB2/RNIGFD, Abacus Data Network, V1 BOLT English SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of English. The corpus contains 18,429 co...
IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b Jul 18, 2018 Bills, Aric; Conners, Thomas; Corris, Miriam; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Kaiser-Schatzlein, Alice; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne; Viswanath, Arun, 2018, "IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b", https://hdl.handle.net/11272.1/AB2/8245NT, Abacus Data Network, V1 Introduction IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 350 hours of Tamil conversational and scripted telephone speech collected in 2012 an...
CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition Jul 16, 2018 Linguistic Data Consortium, 2018, "CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition", https://hdl.handle.net/11272.1/AB2/88OSWL, Abacus Data Network, V1 CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 24 hours of unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in mainland China. This se...
RATS Language Identification Jul 15, 2018 Linguistic Data Consortium, 2018, "RATS Language Identification", https://hdl.handle.net/11272.1/AB2/UP3WJC, Abacus Data Network, V1 RATS Language Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The corpus was created to provide...
TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 Jun 15, 2018 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2018, "TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013", https://hdl.handle.net/11272.1/AB2/SRPNPS, Abacus Data Network, V1 TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP English Entity Linking tasks in 2009, 2010, 2011, 2012, and 201...
BOLT Chinese SMS/Chat Jun 15, 2018 Song, Zhiyi; Fore, Dana; Strassel, Stephanie; Lee, Haejoong; Wright, Jonathan, 2018, "BOLT Chinese SMS/Chat", https://hdl.handle.net/11272.1/AB2/MMNPUR, Abacus Data Network, V1 BOLT Chinese SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Chinese. The corpus contains 14,877 co...
Multi-Language Conversational Telephone Speech 2011 -- Central European Jun 15, 2018 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Central European", https://hdl.handle.net/11272.1/AB2/Y1F6XQ, Abacus Data Network, V1 Multi-Language Conversational Telephone Speech 2011 – Central European was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 44 hours of telephone speech in two distinct language varieties of Central Europe: Czech and Slovak. The data were collec...
Rhythm and Pitch May 15, 2018 Dilley, Laura C.; Breen, Mara; Brown, Meredith; Gibson, Edward, 2018, "Rhythm and Pitch", https://hdl.handle.net/11272.1/AB2/JDLPMX, Abacus Data Network, V1 Rhythm and Pitch contains approximately 27 minutes of spontaneous English conversations and radio news stories annotated with the Rhythm and Pitch (RaP) scheme. Speech data for annotation was taken from two corpora released by LDC, CALLHOME American English Speech (LDC97S42) and...
GALE Phase 4 Arabic Broadcast News Transcripts May 15, 2018 Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2018, "GALE Phase 4 Arabic Broadcast News Transcripts", https://hdl.handle.net/11272.1/AB2/DN3EXL, Abacus Data Network, V1 GALE Phase 4 Arabic Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia...
GALE Phase 4 Arabic Broadcast News Speech May 15, 2018 Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2018, "GALE Phase 4 Arabic Broadcast News Speech", https://hdl.handle.net/11272.1/AB2/ODSQZW, Abacus Data Network, V1 GALE Phase 4 Arabic Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by LDC and MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the...
H2, E2, ERK1 Children's Writing Apr 16, 2018 Berkling, Kay, 2018, "H2, E2, ERK1 Children's Writing", https://hdl.handle.net/11272.1/AB2/7GXGKW, Abacus Data Network, V1 Introduction H2, E2, ERK1 Children’s Writing was developed by the Cooperative State University Baden-Württemberg, University of Education. It consists of approximately 2,000 texts written over four months by 173 German school children age six through eleven years. The data in thi...
TRAD Arabic-French Parallel Text -- Newsgroup Apr 16, 2018 Linguistic Data Consortium, 2018, "TRAD Arabic-French Parallel Text -- Newsgroup", https://hdl.handle.net/11272.1/AB2/0DET8M, Abacus Data Network, V1 Introduction TRAD Arabic-French Parallel Text – Newsgroup was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03). The PEA-TRAD p...
SPADE Mar 15, 2018 Arase, Yuki; Tsujii, Junichi, 2018, "SPADE", https://hdl.handle.net/11272.1/AB2/V6GR5J, Abacus Data Network, V1 SPADE (Syntactic Phrase Alignment Dataset for Evaluation) consists of annotated parse trees and alignment on English sentential paraphrases extracted from machine translation evaluation corpora and separated into development and test sets. Reference translations from machine tran...
LORELEI Somali Representative Language Pack - Monolingual and Parallel Text Mar 15, 2018 Tracey, Jennifer; Graff, David; Strassel, Stephanie; Ma, Xiaoyi; Wright, Jonathan, 2018, "LORELEI Somali Representative Language Pack - Monolingual and Parallel Text", https://hdl.handle.net/11272.1/AB2/75GGBX, Abacus Data Network, V1 LORELEI Somali Representative Language Pack - Monolingual and Parallel Text was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 13 million words of monolingual Somali text, approximately 800,000 of which are translated into English. Another 100...
TAC KBP Comprehensive English Source Corpora 2009-2014 Feb 16, 2018 Ellis, Joe; Getman, Jeremy; Graff, David; Strassel, Stephanie, 2018, "TAC KBP Comprehensive English Source Corpora 2009-2014", https://hdl.handle.net/11272.1/AB2/VC89SM, Abacus Data Network, V1 Introduction TAC KBP Comprehensive English Source Corpora 2009-2014 was developed by the Linguistic Data Consortium (LDC) and contains the 3,877,207 English source documents used in support of the TAC KBP tasks from 2009-2014. Text Analysis Conference (TAC) is a series of worksho...
LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text Feb 16, 2018 Tracey, Jennifer; Graff, David; Strassel, Stephanie; Ma, Xiaoyi; Wright, Jonathan, 2018, "LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text", https://hdl.handle.net/11272.1/AB2/5TNZPX, Abacus Data Network, V1 Introduction LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text was developed by the Linguistic Data Consortium and is comprised of approximately 25 million words of monolingual Amharic text, approximately 600,000 of which are translated into English. An...
Multi-Language Conversational Telephone Speech 2011 -- Central Asian Feb 16, 2018 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Central Asian", https://hdl.handle.net/11272.1/AB2/YW9PX3, Abacus Data Network, V1 Introduction Multi-Language Conversational Telephone Speech 2011 – Central Asian was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 37 hours of telephone speech in three distinct language varieties of Central Asia: Dari, Farsi and Pashto. The...
DIRHA English WSJ Audio Jan 16, 2018 Ravanelli, Mirco; Cristoforetti, Luca; Omologo, Maurizio, 2018, "DIRHA English WSJ Audio", https://hdl.handle.net/11272.1/AB2/8WSEVY, Abacus Data Network, V1 Introduction DIRHA English WSJ Audio was developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. It is comprised of approximately 85...
DEFT Spanish Treebank Jan 16, 2018 Taulé, Mariona; Martí, Maria Antonia; Bies, Ann; Garí, Aina; Nofre, Montserrat; Song, Zhiyi; Strassel, Stephanie; Ellis, Joe, 2018, "DEFT Spanish Treebank", https://hdl.handle.net/11272.1/AB2/Z3OEWX, Abacus Data Network, V1 Introduction DEFT Spanish Treebank was developed by the Linguistic Data Consortium (LDC) and the Language and Computation Center (CLiC), University of Barcelona. It contains treebank annotation of international Spanish newswire text and Latin American Spanish discussion forum dat...
TRAD Chinese-French Parallel Text -- Blog Jan 16, 2018 Linguistic Data Consortium; ELDA, 2018, "TRAD Chinese-French Parallel Text -- Blog", https://hdl.handle.net/11272.1/AB2/ATYE6I, Abacus Data Network, V1 Introduction TRAD Chinese-French Parallel Text – Blog was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Chinese words from GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06). The PEA-TRAD project (Translat...
GALE Phase 4 Chinese Broadcast News Transcripts Dec 15, 2017 Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2017, "GALE Phase 4 Chinese Broadcast News Transcripts", https://hdl.handle.net/11272.1/AB2/KTVMHA, Abacus Data Network, V1 Introduction GALE Phase 4 Chinese Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 134 hours of Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST...
GALE Phase 4 Chinese Broadcast News Speech Dec 15, 2017 Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2017, "GALE Phase 4 Chinese Broadcast News Speech", https://hdl.handle.net/11272.1/AB2/4ADDAM, Abacus Data Network, V1 Introduction GALE Phase 4 Chinese Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 134 hours of Mandarin Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong...
CIEMPIESS Light Nov 17, 2017 Mena, Carlos Daniel Hernández; Herrera, Abel, 2017, "CIEMPIESS Light", https://hdl.handle.net/11272.1/AB2/JXHBRG, Abacus Data Network, V1 CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Light was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximate...

USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition

Jun 17, 2019

Ramabhadran, Bhuvana; Gustman, Samuel; Byrne, William; Hajič, Jan; Oard, Douglas; Olsson, J. Scott; Picheny, Michael; Psutka, Josef, 2019, "USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition", https://hdl.handle.net/11272.1/AB2/SGOMWO, Abacus Data Network, V1

USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition, LDC Catalog Number LDC2019S11 and ISBN 1-58563-889-7, was developed by IBM as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. This edition augments USC-SFI MALACH Interviews...

CIEMPIESS Experimentation

May 15, 2019

Mena, Carlos Daniel Hernández, 2019, "CIEMPIESS Experimentation", https://hdl.handle.net/11272.1/AB2/DUUYQV, Abacus Data Network, V1

CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Experimentation was developed by the social service program "Desarrollo de Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) at the National Autonomous Univer...

TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014

May 15, 2019

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014", https://hdl.handle.net/11272.1/AB2/ZZMOPP, Abacus Data Network, V1

TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 201...

Multi-Language Conversational Telephone Speech 2011 -- English Group

May 15, 2019

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- English Group", https://hdl.handle.net/11272.1/AB2/ACDWDL, Abacus Data Network, V1

Multi-Language Conversational Telephone Speech 2011 – English Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 18 hours of telephone speech in two general varieties of English: American and South Asian. The data were collected primaril...

BOLT Egyptian-English Word Alignment -- Discussion Forum Training

Apr 15, 2019

Li, Xuansong; Peterson, Katherine; Grimes, Stephen; Strassel, Stephanie, 2019, "BOLT Egyptian-English Word Alignment -- Discussion Forum Training", https://hdl.handle.net/11272.1/AB2/AR1QCS, Abacus Data Network, V1

BOLT Egyptian-English Word Alignment – Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 400,448 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations. The DARPA BOLT (Broad Operat...

Chinese Abstract Meaning Representation 1.0

Apr 15, 2019

Li, Bin; Wen, Yuan; Song, Li; Dai, Rubing; Qu, Weiguang; Xue, Nianwen, 2019, "Chinese Abstract Meaning Representation 1.0", https://hdl.handle.net/11272.1/AB2/TT5KRI, Abacus Data Network, V1

Chinese Abstract Meaning Representation was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of Chinese sentences from Chinese Treebank 8.0 (LDC2013T21). Abstract Meaning Representation (AMR) captures "who is doi...

Penn Discourse Treebank Version 3.0

Mar 15, 2019

Prasad, Rashmi; Webber, Bonnie; Lee, Alan; Joshi, Aravind, 2019, "Penn Discourse Treebank Version 3.0", https://hdl.handle.net/11272.1/AB2/SUU9CB, Abacus Data Network, V1

Penn Discourse Treebank (PDTB) Version 3.0 is the third release in the Penn Discourse Treebank project, the goal of which is to annotate the Wall Street Journal (WSJ) section of Treebank-2 (LDC95T7) with discourse relations. Penn Discourse Treebank Version 2 (LDC2008T05) contains...

CALLFRIEND Egyptian Arabic Second Edition

Mar 15, 2019

Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2019, "CALLFRIEND Egyptian Arabic Second Edition", https://hdl.handle.net/11272.1/AB2/4LCUFC, Abacus Data Network, V1

CALLFRIEND Egyptian Arabic Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 25 hours of unscripted telephone conversations between native speakers of Egyptian Arabic. This second edition updates the audio files to wav format, simp...

VAST Chinese Speech and Transcripts

Mar 15, 2019

Tracey, Jennifer; Strassel, Stephanie; Kuster, Neil, 2019, "VAST Chinese Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/OE8XTX, Abacus Data Network, V1

VAST Chinese Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) for the VAST (Video Annotation for Speech Technologies) project and is comprised of approximately 29 hours of Mandarin Chinese audio extracted from amateur video content harvested from the w...

DEFT Chinese Committed Belief Annotation

Feb 15, 2019

Tracey, Jennifer; Arrigo, Michael; Kuster, Neil; Strassel, Stephanie, 2019, "DEFT Chinese Committed Belief Annotation", https://hdl.handle.net/11272.1/AB2/EGZOQ9, Abacus Data Network, V1

DEFT Chinese Committed Belief Annotation was developed by the Linguistic Data Consortium (LDC) and consists of approximately 83,000 tokens of Chinese discussion forum text annotated for “committed belief,” which marks the level of commitment displayed by the author to the truth o...

Multilingual ATIS

Feb 15, 2019

Upadhyay, Shyam; Hakkani-Tur, Dilek; Tur, Gokhan; Rastogi, Abhinav, 2019, "Multilingual ATIS", https://hdl.handle.net/11272.1/AB2/AGMWIU, Abacus Data Network, V1

Multilingual ATIS was developed by Google Inc. and consists of 5,871 utterances from ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26) annotated and translated into Hindi and Turkish. The ATIS (Air Travel Information Services) collection was develope...

Multi-Language Conversational Telephone Speech 2011 -- Arabic Group

Feb 15, 2019

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- Arabic Group", https://hdl.handle.net/11272.1/AB2/A5UT97, Abacus Data Network, V1

Multi-Language Conversational Telephone Speech 2011 – Arabic Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 117 hours of telephone speech in distinct dialects of colloquial Arabic: Iraqi, Levantine and Maghrebi. The data were collect...

SRI Speech-Based Collaborative Learning Corpus

Jan 15, 2019

Richey, Colleen; D'Angelo, Cynthia; Alozie, Nonye; Bratt, Harry; Shriberg, Elizabeth, 2019, "SRI Speech-Based Collaborative Learning Corpus", https://hdl.handle.net/11272.1/AB2/YJWBEU, Abacus Data Network, V1

SRI Speech-Based Collaborative Learning Corpus was developed by SRI International and is comprised of approximately 120 hours of English speech from 134 US middle school students working collaboratively. The data set also contains orthographic transcriptions, manual annotation of...

TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015

Jan 15, 2019

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015", https://hdl.handle.net/11272.1/AB2/LCPM63, Abacus Data Network, V1

TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2014 and 2015...

BOLT Arabic Discussion Forum Parallel Training Data

Jan 15, 2019

Song, Zhiyi; Tracey, Jennifer; Walker, Christopher; Stephanie, Strassel,, 2019, "BOLT Arabic Discussion Forum Parallel Training Data", https://hdl.handle.net/11272.1/AB2/CZR6SG, Abacus Data Network, V1

BOLT Arabic Discussion Forum Parallel Training Data was developed by the Linguistic Data Consortium (LDC) and consists of 1,169,599 tokens of Egyptian Arabic discussion forum data collected for the DARPA BOLT program along with their corresponding English translations. The BOLT (...

HUB5 Mandarin Telephone Speech and Transcripts Second Edition

Dec 17, 2018

Linguistic Data Consortium, 2018, "HUB5 Mandarin Telephone Speech and Transcripts Second Edition", https://hdl.handle.net/11272.1/AB2/2JAJJE, Abacus Data Network, V1

HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by the Linguistic Data Consortium (LDC) in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC...

TAC Relation Extraction Dataset

Dec 15, 2018

Zhong, Victor; Zhang, Yuhao; Chen, Danqi; Angeli, Gabor; Manning, Christopher, 2018, "TAC Relation Extraction Dataset", https://hdl.handle.net/11272.1/AB2/SOYGGB, Abacus Data Network, V1

TAC Relation Extraction Dataset (TACRED) was developed by The Stanford NLP Group and is a large-scale relation extraction dataset with 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014....

IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a

Nov 15, 2018

Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Hammond, Simon; Harper, Mary; Kaiser-Schatzlein, Alice; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Rytting, Anton; Shen, Sinney; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2018, "IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a", https://hdl.handle.net/11272.1/AB2/OTDPUV, Abacus Data Network, V1

Introduction IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Telugu conversational and scripted telephone speech collected in 2013...

BOLT Egyptian Arabic Treebank - Discussion Forum

Nov 15, 2018

Maamouri, Mohamed; Bies, Ann; Kulick, Seth; Krouna, Sondos; Tabassi,Dalila; Ciul, Michael, 2018, "BOLT Egyptian Arabic Treebank - Discussion Forum", https://hdl.handle.net/11272.1/AB2/CAA0JW, Abacus Data Network, V1

BOLT Egyptian Arabic Treebank – Discussion Forum was developed by the Linguistic Data Consortium (LDC) and consists of Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation. The DARPA BOLT (Broad Operational Lang...

Avatar Education Portuguese

Nov 15, 2018

Maciel, Alexandre M. A.; Rodrigues, Rodrigo L.; Barbosa, Danilo S., 2018, "Avatar Education Portuguese", https://hdl.handle.net/11272.1/AB2/BSQ4NP, Abacus Data Network, V1

Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant d...

TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014

Oct 15, 2018

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2018, "TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014", https://hdl.handle.net/11272.1/AB2/B3R0J4, Abacus Data Network, V1

TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Slot Filling evaluation track conducted from 2009 to 2014...

IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a

Sep 17, 2018

Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Hefright, Brook; Kozlov, Kirill; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Phillips, Josh; Walter, Marle; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2018, "IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a", https://hdl.handle.net/11272.1/AB2/KGA4ZX, Abacus Data Network, V1

Introduction IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kazakh conversational and scripted telephone speech collected in 2013...

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation

Sep 17, 2018

Morris, Amanda; Strassel, Stephanie; Li, Xuansong; Antonishek, Brian; Fiscus, Jonathan G., 2018, "HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation", https://hdl.handle.net/11272.1/AB2/XNNWD1, Abacus Data Network, V1

Introduction HAVIC MED Event E051-E060 – Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 53 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and related techn...

Multi-Language Conversational Telephone Speech 2011 -- Spanish

Sep 17, 2018

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Spanish", https://hdl.handle.net/11272.1/AB2/9Q4DIQ, Abacus Data Network, V1

Introduction Multi-Language Conversational Telephone Speech 2011 – Spanish was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 23 hours of telephone speech in Spanish. The data were collected primarily to support research and technology evaluat...

BOLT Information Retrieval Comprehensive Training and Evaluation

Sep 17, 2018

Griffitt, Kira; Strassel, Stephanie, 2018, "BOLT Information Retrieval Comprehensive Training and Evaluation", https://hdl.handle.net/11272.1/AB2/EDRQLG, Abacus Data Network, V1

Introduction BOLT Information Retrieval Comprehensive Training and Evaluation was developed by the Linguistic Data Consortium (LDC) and consists of all data produced in support of the Information Retrieval (IR) task within the DARPA Broad Operational Language Translation (BOLT) P...

CIEMPIESS Balance

Aug 15, 2018

Hernández Mena, Carlos Daniel, 2018, "CIEMPIESS Balance", https://hdl.handle.net/11272.1/AB2/JWRYUR, Abacus Data Network, V1

CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Balance was developed by the Development of Speech Technologies program at the School of Engineering at the National Autonomous University of Mexico (UNAM) and consists...

2011 NIST Language Recognition Evaluation Test Set

Aug 15, 2018

Greenberg, Craig; Martin, Alvin; Graff, David; Walker, Kevin; Jones, Karen; Strassel, Stephanie, 2018, "2011 NIST Language Recognition Evaluation Test Set", https://hdl.handle.net/11272.1/AB2/0ZCWPS, Abacus Data Network, V1

2011 NIST Language Recognition Evaluation Test Set contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation. It consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by the Linguis...

BOLT English SMS/Chat

Aug 15, 2018

Song, Zhiyi; Fore, Dana; Strassel, Stephanie; Lee, Haejoong; Wright, Jonathan, 2018, "BOLT English SMS/Chat", https://hdl.handle.net/11272.1/AB2/RNIGFD, Abacus Data Network, V1

BOLT English SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of English. The corpus contains 18,429 co...

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b

Jul 18, 2018

Bills, Aric; Conners, Thomas; Corris, Miriam; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Kaiser-Schatzlein, Alice; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne; Viswanath, Arun, 2018, "IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b", https://hdl.handle.net/11272.1/AB2/8245NT, Abacus Data Network, V1

Introduction IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 350 hours of Tamil conversational and scripted telephone speech collected in 2012 an...

CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition

Jul 16, 2018

Linguistic Data Consortium, 2018, "CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition", https://hdl.handle.net/11272.1/AB2/88OSWL, Abacus Data Network, V1

CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 24 hours of unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in mainland China. This se...

RATS Language Identification

Jul 15, 2018

Linguistic Data Consortium, 2018, "RATS Language Identification", https://hdl.handle.net/11272.1/AB2/UP3WJC, Abacus Data Network, V1

RATS Language Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The corpus was created to provide...

TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013

Jun 15, 2018

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2018, "TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013", https://hdl.handle.net/11272.1/AB2/SRPNPS, Abacus Data Network, V1

TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP English Entity Linking tasks in 2009, 2010, 2011, 2012, and 201...

BOLT Chinese SMS/Chat

Jun 15, 2018

Song, Zhiyi; Fore, Dana; Strassel, Stephanie; Lee, Haejoong; Wright, Jonathan, 2018, "BOLT Chinese SMS/Chat", https://hdl.handle.net/11272.1/AB2/MMNPUR, Abacus Data Network, V1

BOLT Chinese SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Chinese. The corpus contains 14,877 co...

Multi-Language Conversational Telephone Speech 2011 -- Central European

Jun 15, 2018

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Central European", https://hdl.handle.net/11272.1/AB2/Y1F6XQ, Abacus Data Network, V1

Multi-Language Conversational Telephone Speech 2011 – Central European was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 44 hours of telephone speech in two distinct language varieties of Central Europe: Czech and Slovak. The data were collec...

Rhythm and Pitch

May 15, 2018

Dilley, Laura C.; Breen, Mara; Brown, Meredith; Gibson, Edward, 2018, "Rhythm and Pitch", https://hdl.handle.net/11272.1/AB2/JDLPMX, Abacus Data Network, V1

Rhythm and Pitch contains approximately 27 minutes of spontaneous English conversations and radio news stories annotated with the Rhythm and Pitch (RaP) scheme. Speech data for annotation was taken from two corpora released by LDC, CALLHOME American English Speech (LDC97S42) and...

GALE Phase 4 Arabic Broadcast News Transcripts

May 15, 2018

Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2018, "GALE Phase 4 Arabic Broadcast News Transcripts", https://hdl.handle.net/11272.1/AB2/DN3EXL, Abacus Data Network, V1

GALE Phase 4 Arabic Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia...

GALE Phase 4 Arabic Broadcast News Speech

May 15, 2018

Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2018, "GALE Phase 4 Arabic Broadcast News Speech", https://hdl.handle.net/11272.1/AB2/ODSQZW, Abacus Data Network, V1

GALE Phase 4 Arabic Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by LDC and MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the...

H2, E2, ERK1 Children's Writing

Apr 16, 2018

Berkling, Kay, 2018, "H2, E2, ERK1 Children's Writing", https://hdl.handle.net/11272.1/AB2/7GXGKW, Abacus Data Network, V1

Introduction H2, E2, ERK1 Children’s Writing was developed by the Cooperative State University Baden-Württemberg, University of Education. It consists of approximately 2,000 texts written over four months by 173 German school children age six through eleven years. The data in thi...

TRAD Arabic-French Parallel Text -- Newsgroup

Apr 16, 2018

Linguistic Data Consortium, 2018, "TRAD Arabic-French Parallel Text -- Newsgroup", https://hdl.handle.net/11272.1/AB2/0DET8M, Abacus Data Network, V1

Introduction TRAD Arabic-French Parallel Text – Newsgroup was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03). The PEA-TRAD p...

SPADE

Mar 15, 2018

Arase, Yuki; Tsujii, Junichi, 2018, "SPADE", https://hdl.handle.net/11272.1/AB2/V6GR5J, Abacus Data Network, V1

SPADE (Syntactic Phrase Alignment Dataset for Evaluation) consists of annotated parse trees and alignment on English sentential paraphrases extracted from machine translation evaluation corpora and separated into development and test sets. Reference translations from machine tran...

LORELEI Somali Representative Language Pack - Monolingual and Parallel Text

Mar 15, 2018

Tracey, Jennifer; Graff, David; Strassel, Stephanie; Ma, Xiaoyi; Wright, Jonathan, 2018, "LORELEI Somali Representative Language Pack - Monolingual and Parallel Text", https://hdl.handle.net/11272.1/AB2/75GGBX, Abacus Data Network, V1

LORELEI Somali Representative Language Pack - Monolingual and Parallel Text was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 13 million words of monolingual Somali text, approximately 800,000 of which are translated into English. Another 100...

TAC KBP Comprehensive English Source Corpora 2009-2014

Feb 16, 2018

Ellis, Joe; Getman, Jeremy; Graff, David; Strassel, Stephanie, 2018, "TAC KBP Comprehensive English Source Corpora 2009-2014", https://hdl.handle.net/11272.1/AB2/VC89SM, Abacus Data Network, V1

Introduction TAC KBP Comprehensive English Source Corpora 2009-2014 was developed by the Linguistic Data Consortium (LDC) and contains the 3,877,207 English source documents used in support of the TAC KBP tasks from 2009-2014. Text Analysis Conference (TAC) is a series of worksho...

LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text

Feb 16, 2018

Tracey, Jennifer; Graff, David; Strassel, Stephanie; Ma, Xiaoyi; Wright, Jonathan, 2018, "LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text", https://hdl.handle.net/11272.1/AB2/5TNZPX, Abacus Data Network, V1

Introduction LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text was developed by the Linguistic Data Consortium and is comprised of approximately 25 million words of monolingual Amharic text, approximately 600,000 of which are translated into English. An...

Multi-Language Conversational Telephone Speech 2011 -- Central Asian

Feb 16, 2018

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Central Asian", https://hdl.handle.net/11272.1/AB2/YW9PX3, Abacus Data Network, V1

Introduction Multi-Language Conversational Telephone Speech 2011 – Central Asian was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 37 hours of telephone speech in three distinct language varieties of Central Asia: Dari, Farsi and Pashto. The...

DIRHA English WSJ Audio

Jan 16, 2018

Ravanelli, Mirco; Cristoforetti, Luca; Omologo, Maurizio, 2018, "DIRHA English WSJ Audio", https://hdl.handle.net/11272.1/AB2/8WSEVY, Abacus Data Network, V1

Introduction DIRHA English WSJ Audio was developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. It is comprised of approximately 85...

DEFT Spanish Treebank

Jan 16, 2018

Taulé, Mariona; Martí, Maria Antonia; Bies, Ann; Garí, Aina; Nofre, Montserrat; Song, Zhiyi; Strassel, Stephanie; Ellis, Joe, 2018, "DEFT Spanish Treebank", https://hdl.handle.net/11272.1/AB2/Z3OEWX, Abacus Data Network, V1

Introduction DEFT Spanish Treebank was developed by the Linguistic Data Consortium (LDC) and the Language and Computation Center (CLiC), University of Barcelona. It contains treebank annotation of international Spanish newswire text and Latin American Spanish discussion forum dat...

TRAD Chinese-French Parallel Text -- Blog

Jan 16, 2018

Linguistic Data Consortium; ELDA, 2018, "TRAD Chinese-French Parallel Text -- Blog", https://hdl.handle.net/11272.1/AB2/ATYE6I, Abacus Data Network, V1

Introduction TRAD Chinese-French Parallel Text – Blog was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Chinese words from GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06). The PEA-TRAD project (Translat...

GALE Phase 4 Chinese Broadcast News Transcripts

Dec 15, 2017

Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2017, "GALE Phase 4 Chinese Broadcast News Transcripts", https://hdl.handle.net/11272.1/AB2/KTVMHA, Abacus Data Network, V1

Introduction GALE Phase 4 Chinese Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 134 hours of Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST...

GALE Phase 4 Chinese Broadcast News Speech

Dec 15, 2017

Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2017, "GALE Phase 4 Chinese Broadcast News Speech", https://hdl.handle.net/11272.1/AB2/4ADDAM, Abacus Data Network, V1

Introduction GALE Phase 4 Chinese Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 134 hours of Mandarin Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong...

CIEMPIESS Light

Nov 17, 2017

Mena, Carlos Daniel Hernández; Herrera, Abel, 2017, "CIEMPIESS Light", https://hdl.handle.net/11272.1/AB2/JXHBRG, Abacus Data Network, V1

CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Light was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximate...

Add Data

Share Dataverse

Link Dataverse

Reset Modifications