Linguistic Data Consortium

Featured Dataverses

In order to use this feature you must have at least one published dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

251 to 300 of 410 Results

DEFT Chinese Committed Belief Annotation Feb 15, 2019 Tracey, Jennifer; Arrigo, Michael; Kuster, Neil; Strassel, Stephanie, 2019, "DEFT Chinese Committed Belief Annotation", https://hdl.handle.net/11272.1/AB2/EGZOQ9, Abacus Data Network, V1 DEFT Chinese Committed Belief Annotation was developed by the Linguistic Data Consortium (LDC) and consists of approximately 83,000 tokens of Chinese discussion forum text annotated for “committed belief,” which marks the level of commitment displayed by the author to the truth o...
Multilingual ATIS Feb 15, 2019 Upadhyay, Shyam; Hakkani-Tur, Dilek; Tur, Gokhan; Rastogi, Abhinav, 2019, "Multilingual ATIS", https://hdl.handle.net/11272.1/AB2/AGMWIU, Abacus Data Network, V1 Multilingual ATIS was developed by Google Inc. and consists of 5,871 utterances from ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26) annotated and translated into Hindi and Turkish. The ATIS (Air Travel Information Services) collection was develope...
Multi-Language Conversational Telephone Speech 2011 -- Arabic Group Feb 15, 2019 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- Arabic Group", https://hdl.handle.net/11272.1/AB2/A5UT97, Abacus Data Network, V1 Multi-Language Conversational Telephone Speech 2011 – Arabic Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 117 hours of telephone speech in distinct dialects of colloquial Arabic: Iraqi, Levantine and Maghrebi. The data were collect...
SRI Speech-Based Collaborative Learning Corpus Jan 15, 2019 Richey, Colleen; D'Angelo, Cynthia; Alozie, Nonye; Bratt, Harry; Shriberg, Elizabeth, 2019, "SRI Speech-Based Collaborative Learning Corpus", https://hdl.handle.net/11272.1/AB2/YJWBEU, Abacus Data Network, V1 SRI Speech-Based Collaborative Learning Corpus was developed by SRI International and is comprised of approximately 120 hours of English speech from 134 US middle school students working collaboratively. The data set also contains orthographic transcriptions, manual annotation of...
TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 Jan 15, 2019 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015", https://hdl.handle.net/11272.1/AB2/LCPM63, Abacus Data Network, V1 TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2014 and 2015...
BOLT Arabic Discussion Forum Parallel Training Data Jan 15, 2019 Song, Zhiyi; Tracey, Jennifer; Walker, Christopher; Stephanie, Strassel,, 2019, "BOLT Arabic Discussion Forum Parallel Training Data", https://hdl.handle.net/11272.1/AB2/CZR6SG, Abacus Data Network, V1 BOLT Arabic Discussion Forum Parallel Training Data was developed by the Linguistic Data Consortium (LDC) and consists of 1,169,599 tokens of Egyptian Arabic discussion forum data collected for the DARPA BOLT program along with their corresponding English translations. The BOLT (...
HUB5 Mandarin Telephone Speech and Transcripts Second Edition Dec 17, 2018 Linguistic Data Consortium, 2018, "HUB5 Mandarin Telephone Speech and Transcripts Second Edition", https://hdl.handle.net/11272.1/AB2/2JAJJE, Abacus Data Network, V1 HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by the Linguistic Data Consortium (LDC) in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC...
TAC Relation Extraction Dataset Dec 15, 2018 Zhong, Victor; Zhang, Yuhao; Chen, Danqi; Angeli, Gabor; Manning, Christopher, 2018, "TAC Relation Extraction Dataset", https://hdl.handle.net/11272.1/AB2/SOYGGB, Abacus Data Network, V1 TAC Relation Extraction Dataset (TACRED) was developed by The Stanford NLP Group and is a large-scale relation extraction dataset with 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014....
IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a Nov 15, 2018 Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Hammond, Simon; Harper, Mary; Kaiser-Schatzlein, Alice; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Rytting, Anton; Shen, Sinney; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2018, "IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a", https://hdl.handle.net/11272.1/AB2/OTDPUV, Abacus Data Network, V1 Introduction IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Telugu conversational and scripted telephone speech collected in 2013...
BOLT Egyptian Arabic Treebank - Discussion Forum Nov 15, 2018 Maamouri, Mohamed; Bies, Ann; Kulick, Seth; Krouna, Sondos; Tabassi,Dalila; Ciul, Michael, 2018, "BOLT Egyptian Arabic Treebank - Discussion Forum", https://hdl.handle.net/11272.1/AB2/CAA0JW, Abacus Data Network, V1 BOLT Egyptian Arabic Treebank – Discussion Forum was developed by the Linguistic Data Consortium (LDC) and consists of Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation. The DARPA BOLT (Broad Operational Lang...
Avatar Education Portuguese Nov 15, 2018 Maciel, Alexandre M. A.; Rodrigues, Rodrigo L.; Barbosa, Danilo S., 2018, "Avatar Education Portuguese", https://hdl.handle.net/11272.1/AB2/BSQ4NP, Abacus Data Network, V1 Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant d...
TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 Oct 15, 2018 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2018, "TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014", https://hdl.handle.net/11272.1/AB2/B3R0J4, Abacus Data Network, V1 TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Slot Filling evaluation track conducted from 2009 to 2014...
IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a Sep 17, 2018 Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Hefright, Brook; Kozlov, Kirill; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Phillips, Josh; Walter, Marle; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2018, "IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a", https://hdl.handle.net/11272.1/AB2/KGA4ZX, Abacus Data Network, V1 Introduction IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kazakh conversational and scripted telephone speech collected in 2013...
HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation Sep 17, 2018 Morris, Amanda; Strassel, Stephanie; Li, Xuansong; Antonishek, Brian; Fiscus, Jonathan G., 2018, "HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation", https://hdl.handle.net/11272.1/AB2/XNNWD1, Abacus Data Network, V1 Introduction HAVIC MED Event E051-E060 – Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 53 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and related techn...
Multi-Language Conversational Telephone Speech 2011 -- Spanish Sep 17, 2018 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Spanish", https://hdl.handle.net/11272.1/AB2/9Q4DIQ, Abacus Data Network, V1 Introduction Multi-Language Conversational Telephone Speech 2011 – Spanish was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 23 hours of telephone speech in Spanish. The data were collected primarily to support research and technology evaluat...
BOLT Information Retrieval Comprehensive Training and Evaluation Sep 17, 2018 Griffitt, Kira; Strassel, Stephanie, 2018, "BOLT Information Retrieval Comprehensive Training and Evaluation", https://hdl.handle.net/11272.1/AB2/EDRQLG, Abacus Data Network, V1 Introduction BOLT Information Retrieval Comprehensive Training and Evaluation was developed by the Linguistic Data Consortium (LDC) and consists of all data produced in support of the Information Retrieval (IR) task within the DARPA Broad Operational Language Translation (BOLT) P...
CIEMPIESS Balance Aug 15, 2018 Hernández Mena, Carlos Daniel, 2018, "CIEMPIESS Balance", https://hdl.handle.net/11272.1/AB2/JWRYUR, Abacus Data Network, V1 CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Balance was developed by the Development of Speech Technologies program at the School of Engineering at the National Autonomous University of Mexico (UNAM) and consists...
2011 NIST Language Recognition Evaluation Test Set Aug 15, 2018 Greenberg, Craig; Martin, Alvin; Graff, David; Walker, Kevin; Jones, Karen; Strassel, Stephanie, 2018, "2011 NIST Language Recognition Evaluation Test Set", https://hdl.handle.net/11272.1/AB2/0ZCWPS, Abacus Data Network, V1 2011 NIST Language Recognition Evaluation Test Set contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation. It consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by the Linguis...
BOLT English SMS/Chat Aug 15, 2018 Song, Zhiyi; Fore, Dana; Strassel, Stephanie; Lee, Haejoong; Wright, Jonathan, 2018, "BOLT English SMS/Chat", https://hdl.handle.net/11272.1/AB2/RNIGFD, Abacus Data Network, V1 BOLT English SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of English. The corpus contains 18,429 co...
IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b Jul 18, 2018 Bills, Aric; Conners, Thomas; Corris, Miriam; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Kaiser-Schatzlein, Alice; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne; Viswanath, Arun, 2018, "IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b", https://hdl.handle.net/11272.1/AB2/8245NT, Abacus Data Network, V1 Introduction IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 350 hours of Tamil conversational and scripted telephone speech collected in 2012 an...
CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition Jul 16, 2018 Linguistic Data Consortium, 2018, "CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition", https://hdl.handle.net/11272.1/AB2/88OSWL, Abacus Data Network, V1 CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 24 hours of unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in mainland China. This se...
RATS Language Identification Jul 15, 2018 Linguistic Data Consortium, 2018, "RATS Language Identification", https://hdl.handle.net/11272.1/AB2/UP3WJC, Abacus Data Network, V1 RATS Language Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The corpus was created to provide...
TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 Jun 15, 2018 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2018, "TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013", https://hdl.handle.net/11272.1/AB2/SRPNPS, Abacus Data Network, V1 TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP English Entity Linking tasks in 2009, 2010, 2011, 2012, and 201...
BOLT Chinese SMS/Chat Jun 15, 2018 Song, Zhiyi; Fore, Dana; Strassel, Stephanie; Lee, Haejoong; Wright, Jonathan, 2018, "BOLT Chinese SMS/Chat", https://hdl.handle.net/11272.1/AB2/MMNPUR, Abacus Data Network, V1 BOLT Chinese SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Chinese. The corpus contains 14,877 co...
Multi-Language Conversational Telephone Speech 2011 -- Central European Jun 15, 2018 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Central European", https://hdl.handle.net/11272.1/AB2/Y1F6XQ, Abacus Data Network, V1 Multi-Language Conversational Telephone Speech 2011 – Central European was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 44 hours of telephone speech in two distinct language varieties of Central Europe: Czech and Slovak. The data were collec...
Rhythm and Pitch May 15, 2018 Dilley, Laura C.; Breen, Mara; Brown, Meredith; Gibson, Edward, 2018, "Rhythm and Pitch", https://hdl.handle.net/11272.1/AB2/JDLPMX, Abacus Data Network, V1 Rhythm and Pitch contains approximately 27 minutes of spontaneous English conversations and radio news stories annotated with the Rhythm and Pitch (RaP) scheme. Speech data for annotation was taken from two corpora released by LDC, CALLHOME American English Speech (LDC97S42) and...
GALE Phase 4 Arabic Broadcast News Transcripts May 15, 2018 Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2018, "GALE Phase 4 Arabic Broadcast News Transcripts", https://hdl.handle.net/11272.1/AB2/DN3EXL, Abacus Data Network, V1 GALE Phase 4 Arabic Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia...
GALE Phase 4 Arabic Broadcast News Speech May 15, 2018 Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2018, "GALE Phase 4 Arabic Broadcast News Speech", https://hdl.handle.net/11272.1/AB2/ODSQZW, Abacus Data Network, V1 GALE Phase 4 Arabic Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by LDC and MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the...
H2, E2, ERK1 Children's Writing Apr 16, 2018 Berkling, Kay, 2018, "H2, E2, ERK1 Children's Writing", https://hdl.handle.net/11272.1/AB2/7GXGKW, Abacus Data Network, V1 Introduction H2, E2, ERK1 Children’s Writing was developed by the Cooperative State University Baden-Württemberg, University of Education. It consists of approximately 2,000 texts written over four months by 173 German school children age six through eleven years. The data in thi...
TRAD Arabic-French Parallel Text -- Newsgroup Apr 16, 2018 Linguistic Data Consortium, 2018, "TRAD Arabic-French Parallel Text -- Newsgroup", https://hdl.handle.net/11272.1/AB2/0DET8M, Abacus Data Network, V1 Introduction TRAD Arabic-French Parallel Text – Newsgroup was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03). The PEA-TRAD p...
SPADE Mar 15, 2018 Arase, Yuki; Tsujii, Junichi, 2018, "SPADE", https://hdl.handle.net/11272.1/AB2/V6GR5J, Abacus Data Network, V1 SPADE (Syntactic Phrase Alignment Dataset for Evaluation) consists of annotated parse trees and alignment on English sentential paraphrases extracted from machine translation evaluation corpora and separated into development and test sets. Reference translations from machine tran...
LORELEI Somali Representative Language Pack - Monolingual and Parallel Text Mar 15, 2018 Tracey, Jennifer; Graff, David; Strassel, Stephanie; Ma, Xiaoyi; Wright, Jonathan, 2018, "LORELEI Somali Representative Language Pack - Monolingual and Parallel Text", https://hdl.handle.net/11272.1/AB2/75GGBX, Abacus Data Network, V1 LORELEI Somali Representative Language Pack - Monolingual and Parallel Text was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 13 million words of monolingual Somali text, approximately 800,000 of which are translated into English. Another 100...
TAC KBP Comprehensive English Source Corpora 2009-2014 Feb 16, 2018 Ellis, Joe; Getman, Jeremy; Graff, David; Strassel, Stephanie, 2018, "TAC KBP Comprehensive English Source Corpora 2009-2014", https://hdl.handle.net/11272.1/AB2/VC89SM, Abacus Data Network, V1 Introduction TAC KBP Comprehensive English Source Corpora 2009-2014 was developed by the Linguistic Data Consortium (LDC) and contains the 3,877,207 English source documents used in support of the TAC KBP tasks from 2009-2014. Text Analysis Conference (TAC) is a series of worksho...
LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text Feb 16, 2018 Tracey, Jennifer; Graff, David; Strassel, Stephanie; Ma, Xiaoyi; Wright, Jonathan, 2018, "LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text", https://hdl.handle.net/11272.1/AB2/5TNZPX, Abacus Data Network, V1 Introduction LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text was developed by the Linguistic Data Consortium and is comprised of approximately 25 million words of monolingual Amharic text, approximately 600,000 of which are translated into English. An...
Multi-Language Conversational Telephone Speech 2011 -- Central Asian Feb 16, 2018 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Central Asian", https://hdl.handle.net/11272.1/AB2/YW9PX3, Abacus Data Network, V1 Introduction Multi-Language Conversational Telephone Speech 2011 – Central Asian was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 37 hours of telephone speech in three distinct language varieties of Central Asia: Dari, Farsi and Pashto. The...
DIRHA English WSJ Audio Jan 16, 2018 Ravanelli, Mirco; Cristoforetti, Luca; Omologo, Maurizio, 2018, "DIRHA English WSJ Audio", https://hdl.handle.net/11272.1/AB2/8WSEVY, Abacus Data Network, V1 Introduction DIRHA English WSJ Audio was developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. It is comprised of approximately 85...
DEFT Spanish Treebank Jan 16, 2018 Taulé, Mariona; Martí, Maria Antonia; Bies, Ann; Garí, Aina; Nofre, Montserrat; Song, Zhiyi; Strassel, Stephanie; Ellis, Joe, 2018, "DEFT Spanish Treebank", https://hdl.handle.net/11272.1/AB2/Z3OEWX, Abacus Data Network, V1 Introduction DEFT Spanish Treebank was developed by the Linguistic Data Consortium (LDC) and the Language and Computation Center (CLiC), University of Barcelona. It contains treebank annotation of international Spanish newswire text and Latin American Spanish discussion forum dat...
TRAD Chinese-French Parallel Text -- Blog Jan 16, 2018 Linguistic Data Consortium; ELDA, 2018, "TRAD Chinese-French Parallel Text -- Blog", https://hdl.handle.net/11272.1/AB2/ATYE6I, Abacus Data Network, V1 Introduction TRAD Chinese-French Parallel Text – Blog was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Chinese words from GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06). The PEA-TRAD project (Translat...
GALE Phase 4 Chinese Broadcast News Transcripts Dec 15, 2017 Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2017, "GALE Phase 4 Chinese Broadcast News Transcripts", https://hdl.handle.net/11272.1/AB2/KTVMHA, Abacus Data Network, V1 Introduction GALE Phase 4 Chinese Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 134 hours of Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST...
GALE Phase 4 Chinese Broadcast News Speech Dec 15, 2017 Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2017, "GALE Phase 4 Chinese Broadcast News Speech", https://hdl.handle.net/11272.1/AB2/4ADDAM, Abacus Data Network, V1 Introduction GALE Phase 4 Chinese Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 134 hours of Mandarin Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong...
CIEMPIESS Light Nov 17, 2017 Mena, Carlos Daniel Hernández; Herrera, Abel, 2017, "CIEMPIESS Light", https://hdl.handle.net/11272.1/AB2/JXHBRG, Abacus Data Network, V1 CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Light was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximate...
TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014 Nov 17, 2017 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2017, "TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014", https://hdl.handle.net/11272.1/AB2/XOE0NF, Abacus Data Network, V1 TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014 was developed by the Linguistic Data Consortium and contains training and evaluation data produced in support of the TAC KBP Chinese Cross-lingual Entity Linking tasks in 2011, 201...
Ancient Chinese Corpus Oct 18, 2017 Chen, Xiaohe; Li, Bin; Feng, Minxuan; Xu, Chao; Xu, Runhua; Shi, Min; Yu, Lili; Xiao, Lei; Wang, Qingqing, 2017, "Ancient Chinese Corpus", https://hdl.handle.net/11272.1/AB2/4HYBFE, Abacus Data Network, V1 Ancient Chinese Corpus was developed at Nanjing Normal University. It contains word-segmented and part-of-speech tagged text from Zuozhuan, an ancient Chinese work believed to date from the Warring States Period (475-221 BC). Zuozhuan is a commentary on the Chunqui, a history of...
MWE-Aware English Dependency Corpus 2.0 Oct 18, 2017 Kato, Akihiko; Shindo, Hiroyuki; Matsumoto, Yuji, 2017, "MWE-Aware English Dependency Corpus 2.0", https://hdl.handle.net/11272.1/AB2/GKYOY9, Abacus Data Network, V1 MWE-Aware English Dependency Corpus Version 2.0 was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from OntoNotes Release 5.0 (LDC20...
RATS Keyword Spotting Oct 18, 2017 Graff, David; Ma, Xiaoyi; Strassel, Stephanie; Walker, Kevin; Jones, Karen, 2017, "RATS Keyword Spotting", https://hdl.handle.net/11272.1/AB2/IFVKNB, Abacus Data Network, V1 RATS Keyword Spotting was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,100 hours of Levantine Arabic and Farsi conversational telephone speech with automatic and manual annotation of speech segments, transcripts and keywords generated from...
English Web Treebank Propbank Oct 18, 2017 O'Gorman, Tim; Conger, Katherine; Palmer, Martha, 2017, "English Web Treebank Propbank", https://hdl.handle.net/11272.1/AB2/Q8LILM, Abacus Data Network, V1 English Web Treebank Propbank, LDC Catalog Number LDC2017T15 and ISBN 1-58563-818-8, was developed by the University of Colorado Boulder - CLEAR (Computational Language and Education Research) and provides predicate-argument structure annotation for English Web Treebank (LDC2012T...
Multi-Language Conversational Telephone Speech 2011 -- South Asian Oct 15, 2017 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2017, "Multi-Language Conversational Telephone Speech 2011 -- South Asian", https://hdl.handle.net/11272.1/AB2/JPGPJM, Abacus Data Network, V1 Multi-Language Conversational Telephone Speech 2011 – South Asian was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 118 hours of telephone speech in five distinct language varieties of South Asia (i.e. the Indian sub-continent): Bengali, Hind...
SRI-FRTIV Sep 14, 2017 Shriberg, Elizabeth; Kathol, Andreas; Graciarena, Martin; Bratt, Harry; Kajarekar, Sachin; Jameel, Huda; Richey, Colleen; Goodman, Fred, 2017, "SRI-FRTIV", https://hdl.handle.net/11272.1/AB2/YONFH9, Abacus Data Network, V1 SRI-FRTIV (Five-way Recorded Toastmaster Intrinsic Variation) was developed by SRI International in 2007-2008 and is comprised of approximately 232 hours of English speech from thirty-four speakers who were members of Toastmaster clubs. Participants were asked to speak at three d...
2015-2016 CoNLL Shared Task Sep 14, 2017 Xue, Nianwen; Ng, Hwee Tou; Pradhan, Sameer; Rutherford, Attapol T.; Webber, Bonnie; Wang, Chuan; Wang, Hong Min; Prasad, Rashmi, 2017, "2015-2016 CoNLL Shared Task", https://hdl.handle.net/11272.1/AB2/TSNLNO, Abacus Data Network, V1 2015-2016 CoNLL Shared Task, LDC Catalog Number LDC2017T13 and ISBN 1-58563-812-9, contains the Chinese and English training, development and test data for the 2015 and 2016 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation which focused on shal...
GALE Phase 4 Arabic Broadcast Conversation Speech Aug 15, 2017 Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2017, "GALE Phase 4 Arabic Broadcast Conversation Speech", https://hdl.handle.net/11272.1/AB2/XFDC1A, Abacus Data Network, V1 GALE Phase 4 Arabic Broadcast Conversation Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 75 hours of Arabic broadcast conversation speech collected in 2008 and 2009 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Ph...

DEFT Chinese Committed Belief Annotation

Feb 15, 2019

Tracey, Jennifer; Arrigo, Michael; Kuster, Neil; Strassel, Stephanie, 2019, "DEFT Chinese Committed Belief Annotation", https://hdl.handle.net/11272.1/AB2/EGZOQ9, Abacus Data Network, V1

DEFT Chinese Committed Belief Annotation was developed by the Linguistic Data Consortium (LDC) and consists of approximately 83,000 tokens of Chinese discussion forum text annotated for “committed belief,” which marks the level of commitment displayed by the author to the truth o...

Multilingual ATIS

Feb 15, 2019

Upadhyay, Shyam; Hakkani-Tur, Dilek; Tur, Gokhan; Rastogi, Abhinav, 2019, "Multilingual ATIS", https://hdl.handle.net/11272.1/AB2/AGMWIU, Abacus Data Network, V1

Multilingual ATIS was developed by Google Inc. and consists of 5,871 utterances from ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26) annotated and translated into Hindi and Turkish. The ATIS (Air Travel Information Services) collection was develope...

Multi-Language Conversational Telephone Speech 2011 -- Arabic Group

Feb 15, 2019

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- Arabic Group", https://hdl.handle.net/11272.1/AB2/A5UT97, Abacus Data Network, V1

Multi-Language Conversational Telephone Speech 2011 – Arabic Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 117 hours of telephone speech in distinct dialects of colloquial Arabic: Iraqi, Levantine and Maghrebi. The data were collect...

SRI Speech-Based Collaborative Learning Corpus

Jan 15, 2019

Richey, Colleen; D'Angelo, Cynthia; Alozie, Nonye; Bratt, Harry; Shriberg, Elizabeth, 2019, "SRI Speech-Based Collaborative Learning Corpus", https://hdl.handle.net/11272.1/AB2/YJWBEU, Abacus Data Network, V1

SRI Speech-Based Collaborative Learning Corpus was developed by SRI International and is comprised of approximately 120 hours of English speech from 134 US middle school students working collaboratively. The data set also contains orthographic transcriptions, manual annotation of...

TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015

Jan 15, 2019

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015", https://hdl.handle.net/11272.1/AB2/LCPM63, Abacus Data Network, V1

TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2014 and 2015...

BOLT Arabic Discussion Forum Parallel Training Data

Jan 15, 2019

Song, Zhiyi; Tracey, Jennifer; Walker, Christopher; Stephanie, Strassel,, 2019, "BOLT Arabic Discussion Forum Parallel Training Data", https://hdl.handle.net/11272.1/AB2/CZR6SG, Abacus Data Network, V1

BOLT Arabic Discussion Forum Parallel Training Data was developed by the Linguistic Data Consortium (LDC) and consists of 1,169,599 tokens of Egyptian Arabic discussion forum data collected for the DARPA BOLT program along with their corresponding English translations. The BOLT (...

HUB5 Mandarin Telephone Speech and Transcripts Second Edition

Dec 17, 2018

Linguistic Data Consortium, 2018, "HUB5 Mandarin Telephone Speech and Transcripts Second Edition", https://hdl.handle.net/11272.1/AB2/2JAJJE, Abacus Data Network, V1

HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by the Linguistic Data Consortium (LDC) in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC...

TAC Relation Extraction Dataset

Dec 15, 2018

Zhong, Victor; Zhang, Yuhao; Chen, Danqi; Angeli, Gabor; Manning, Christopher, 2018, "TAC Relation Extraction Dataset", https://hdl.handle.net/11272.1/AB2/SOYGGB, Abacus Data Network, V1

TAC Relation Extraction Dataset (TACRED) was developed by The Stanford NLP Group and is a large-scale relation extraction dataset with 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014....

IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a

Nov 15, 2018

Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Hammond, Simon; Harper, Mary; Kaiser-Schatzlein, Alice; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Rytting, Anton; Shen, Sinney; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2018, "IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a", https://hdl.handle.net/11272.1/AB2/OTDPUV, Abacus Data Network, V1

Introduction IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Telugu conversational and scripted telephone speech collected in 2013...

BOLT Egyptian Arabic Treebank - Discussion Forum

Nov 15, 2018

Maamouri, Mohamed; Bies, Ann; Kulick, Seth; Krouna, Sondos; Tabassi,Dalila; Ciul, Michael, 2018, "BOLT Egyptian Arabic Treebank - Discussion Forum", https://hdl.handle.net/11272.1/AB2/CAA0JW, Abacus Data Network, V1

BOLT Egyptian Arabic Treebank – Discussion Forum was developed by the Linguistic Data Consortium (LDC) and consists of Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation. The DARPA BOLT (Broad Operational Lang...

Avatar Education Portuguese

Nov 15, 2018

Maciel, Alexandre M. A.; Rodrigues, Rodrigo L.; Barbosa, Danilo S., 2018, "Avatar Education Portuguese", https://hdl.handle.net/11272.1/AB2/BSQ4NP, Abacus Data Network, V1

Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant d...

TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014

Oct 15, 2018

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2018, "TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014", https://hdl.handle.net/11272.1/AB2/B3R0J4, Abacus Data Network, V1

TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Slot Filling evaluation track conducted from 2009 to 2014...

IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a

Sep 17, 2018

Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Hefright, Brook; Kozlov, Kirill; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Phillips, Josh; Walter, Marle; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2018, "IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a", https://hdl.handle.net/11272.1/AB2/KGA4ZX, Abacus Data Network, V1

Introduction IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kazakh conversational and scripted telephone speech collected in 2013...

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation

Sep 17, 2018

Morris, Amanda; Strassel, Stephanie; Li, Xuansong; Antonishek, Brian; Fiscus, Jonathan G., 2018, "HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation", https://hdl.handle.net/11272.1/AB2/XNNWD1, Abacus Data Network, V1

Introduction HAVIC MED Event E051-E060 – Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 53 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and related techn...

Multi-Language Conversational Telephone Speech 2011 -- Spanish

Sep 17, 2018

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Spanish", https://hdl.handle.net/11272.1/AB2/9Q4DIQ, Abacus Data Network, V1

Introduction Multi-Language Conversational Telephone Speech 2011 – Spanish was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 23 hours of telephone speech in Spanish. The data were collected primarily to support research and technology evaluat...

BOLT Information Retrieval Comprehensive Training and Evaluation

Sep 17, 2018

Griffitt, Kira; Strassel, Stephanie, 2018, "BOLT Information Retrieval Comprehensive Training and Evaluation", https://hdl.handle.net/11272.1/AB2/EDRQLG, Abacus Data Network, V1

Introduction BOLT Information Retrieval Comprehensive Training and Evaluation was developed by the Linguistic Data Consortium (LDC) and consists of all data produced in support of the Information Retrieval (IR) task within the DARPA Broad Operational Language Translation (BOLT) P...

CIEMPIESS Balance

Aug 15, 2018

Hernández Mena, Carlos Daniel, 2018, "CIEMPIESS Balance", https://hdl.handle.net/11272.1/AB2/JWRYUR, Abacus Data Network, V1

CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Balance was developed by the Development of Speech Technologies program at the School of Engineering at the National Autonomous University of Mexico (UNAM) and consists...

2011 NIST Language Recognition Evaluation Test Set

Aug 15, 2018

Greenberg, Craig; Martin, Alvin; Graff, David; Walker, Kevin; Jones, Karen; Strassel, Stephanie, 2018, "2011 NIST Language Recognition Evaluation Test Set", https://hdl.handle.net/11272.1/AB2/0ZCWPS, Abacus Data Network, V1

2011 NIST Language Recognition Evaluation Test Set contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation. It consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by the Linguis...

BOLT English SMS/Chat

Aug 15, 2018

Song, Zhiyi; Fore, Dana; Strassel, Stephanie; Lee, Haejoong; Wright, Jonathan, 2018, "BOLT English SMS/Chat", https://hdl.handle.net/11272.1/AB2/RNIGFD, Abacus Data Network, V1

BOLT English SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of English. The corpus contains 18,429 co...

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b

Jul 18, 2018

Bills, Aric; Conners, Thomas; Corris, Miriam; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Kaiser-Schatzlein, Alice; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne; Viswanath, Arun, 2018, "IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b", https://hdl.handle.net/11272.1/AB2/8245NT, Abacus Data Network, V1

Introduction IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 350 hours of Tamil conversational and scripted telephone speech collected in 2012 an...

CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition

Jul 16, 2018

Linguistic Data Consortium, 2018, "CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition", https://hdl.handle.net/11272.1/AB2/88OSWL, Abacus Data Network, V1

CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 24 hours of unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in mainland China. This se...

RATS Language Identification

Jul 15, 2018

Linguistic Data Consortium, 2018, "RATS Language Identification", https://hdl.handle.net/11272.1/AB2/UP3WJC, Abacus Data Network, V1

RATS Language Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The corpus was created to provide...

TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013

Jun 15, 2018

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2018, "TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013", https://hdl.handle.net/11272.1/AB2/SRPNPS, Abacus Data Network, V1

TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP English Entity Linking tasks in 2009, 2010, 2011, 2012, and 201...

BOLT Chinese SMS/Chat

Jun 15, 2018

Song, Zhiyi; Fore, Dana; Strassel, Stephanie; Lee, Haejoong; Wright, Jonathan, 2018, "BOLT Chinese SMS/Chat", https://hdl.handle.net/11272.1/AB2/MMNPUR, Abacus Data Network, V1

BOLT Chinese SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Chinese. The corpus contains 14,877 co...

Multi-Language Conversational Telephone Speech 2011 -- Central European

Jun 15, 2018

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Central European", https://hdl.handle.net/11272.1/AB2/Y1F6XQ, Abacus Data Network, V1

Multi-Language Conversational Telephone Speech 2011 – Central European was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 44 hours of telephone speech in two distinct language varieties of Central Europe: Czech and Slovak. The data were collec...

Rhythm and Pitch

May 15, 2018

Dilley, Laura C.; Breen, Mara; Brown, Meredith; Gibson, Edward, 2018, "Rhythm and Pitch", https://hdl.handle.net/11272.1/AB2/JDLPMX, Abacus Data Network, V1

Rhythm and Pitch contains approximately 27 minutes of spontaneous English conversations and radio news stories annotated with the Rhythm and Pitch (RaP) scheme. Speech data for annotation was taken from two corpora released by LDC, CALLHOME American English Speech (LDC97S42) and...

GALE Phase 4 Arabic Broadcast News Transcripts

May 15, 2018

Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2018, "GALE Phase 4 Arabic Broadcast News Transcripts", https://hdl.handle.net/11272.1/AB2/DN3EXL, Abacus Data Network, V1

GALE Phase 4 Arabic Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia...

GALE Phase 4 Arabic Broadcast News Speech

May 15, 2018

Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2018, "GALE Phase 4 Arabic Broadcast News Speech", https://hdl.handle.net/11272.1/AB2/ODSQZW, Abacus Data Network, V1

GALE Phase 4 Arabic Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by LDC and MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the...

H2, E2, ERK1 Children's Writing

Apr 16, 2018

Berkling, Kay, 2018, "H2, E2, ERK1 Children's Writing", https://hdl.handle.net/11272.1/AB2/7GXGKW, Abacus Data Network, V1

Introduction H2, E2, ERK1 Children’s Writing was developed by the Cooperative State University Baden-Württemberg, University of Education. It consists of approximately 2,000 texts written over four months by 173 German school children age six through eleven years. The data in thi...

TRAD Arabic-French Parallel Text -- Newsgroup

Apr 16, 2018

Linguistic Data Consortium, 2018, "TRAD Arabic-French Parallel Text -- Newsgroup", https://hdl.handle.net/11272.1/AB2/0DET8M, Abacus Data Network, V1

Introduction TRAD Arabic-French Parallel Text – Newsgroup was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03). The PEA-TRAD p...

SPADE

Mar 15, 2018

Arase, Yuki; Tsujii, Junichi, 2018, "SPADE", https://hdl.handle.net/11272.1/AB2/V6GR5J, Abacus Data Network, V1

SPADE (Syntactic Phrase Alignment Dataset for Evaluation) consists of annotated parse trees and alignment on English sentential paraphrases extracted from machine translation evaluation corpora and separated into development and test sets. Reference translations from machine tran...

LORELEI Somali Representative Language Pack - Monolingual and Parallel Text

Mar 15, 2018

Tracey, Jennifer; Graff, David; Strassel, Stephanie; Ma, Xiaoyi; Wright, Jonathan, 2018, "LORELEI Somali Representative Language Pack - Monolingual and Parallel Text", https://hdl.handle.net/11272.1/AB2/75GGBX, Abacus Data Network, V1

LORELEI Somali Representative Language Pack - Monolingual and Parallel Text was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 13 million words of monolingual Somali text, approximately 800,000 of which are translated into English. Another 100...

TAC KBP Comprehensive English Source Corpora 2009-2014

Feb 16, 2018

Ellis, Joe; Getman, Jeremy; Graff, David; Strassel, Stephanie, 2018, "TAC KBP Comprehensive English Source Corpora 2009-2014", https://hdl.handle.net/11272.1/AB2/VC89SM, Abacus Data Network, V1

Introduction TAC KBP Comprehensive English Source Corpora 2009-2014 was developed by the Linguistic Data Consortium (LDC) and contains the 3,877,207 English source documents used in support of the TAC KBP tasks from 2009-2014. Text Analysis Conference (TAC) is a series of worksho...

LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text

Feb 16, 2018

Tracey, Jennifer; Graff, David; Strassel, Stephanie; Ma, Xiaoyi; Wright, Jonathan, 2018, "LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text", https://hdl.handle.net/11272.1/AB2/5TNZPX, Abacus Data Network, V1

Introduction LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text was developed by the Linguistic Data Consortium and is comprised of approximately 25 million words of monolingual Amharic text, approximately 600,000 of which are translated into English. An...

Multi-Language Conversational Telephone Speech 2011 -- Central Asian

Feb 16, 2018

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2018, "Multi-Language Conversational Telephone Speech 2011 -- Central Asian", https://hdl.handle.net/11272.1/AB2/YW9PX3, Abacus Data Network, V1

Introduction Multi-Language Conversational Telephone Speech 2011 – Central Asian was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 37 hours of telephone speech in three distinct language varieties of Central Asia: Dari, Farsi and Pashto. The...

DIRHA English WSJ Audio

Jan 16, 2018

Ravanelli, Mirco; Cristoforetti, Luca; Omologo, Maurizio, 2018, "DIRHA English WSJ Audio", https://hdl.handle.net/11272.1/AB2/8WSEVY, Abacus Data Network, V1

Introduction DIRHA English WSJ Audio was developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. It is comprised of approximately 85...

DEFT Spanish Treebank

Jan 16, 2018

Taulé, Mariona; Martí, Maria Antonia; Bies, Ann; Garí, Aina; Nofre, Montserrat; Song, Zhiyi; Strassel, Stephanie; Ellis, Joe, 2018, "DEFT Spanish Treebank", https://hdl.handle.net/11272.1/AB2/Z3OEWX, Abacus Data Network, V1

Introduction DEFT Spanish Treebank was developed by the Linguistic Data Consortium (LDC) and the Language and Computation Center (CLiC), University of Barcelona. It contains treebank annotation of international Spanish newswire text and Latin American Spanish discussion forum dat...

TRAD Chinese-French Parallel Text -- Blog

Jan 16, 2018

Linguistic Data Consortium; ELDA, 2018, "TRAD Chinese-French Parallel Text -- Blog", https://hdl.handle.net/11272.1/AB2/ATYE6I, Abacus Data Network, V1

Introduction TRAD Chinese-French Parallel Text – Blog was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Chinese words from GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06). The PEA-TRAD project (Translat...

GALE Phase 4 Chinese Broadcast News Transcripts

Dec 15, 2017

Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2017, "GALE Phase 4 Chinese Broadcast News Transcripts", https://hdl.handle.net/11272.1/AB2/KTVMHA, Abacus Data Network, V1

Introduction GALE Phase 4 Chinese Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 134 hours of Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST...

GALE Phase 4 Chinese Broadcast News Speech

Dec 15, 2017

Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2017, "GALE Phase 4 Chinese Broadcast News Speech", https://hdl.handle.net/11272.1/AB2/4ADDAM, Abacus Data Network, V1

Introduction GALE Phase 4 Chinese Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 134 hours of Mandarin Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong...

CIEMPIESS Light

Nov 17, 2017

Mena, Carlos Daniel Hernández; Herrera, Abel, 2017, "CIEMPIESS Light", https://hdl.handle.net/11272.1/AB2/JXHBRG, Abacus Data Network, V1

CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Light was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximate...

TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014

Nov 17, 2017

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2017, "TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014", https://hdl.handle.net/11272.1/AB2/XOE0NF, Abacus Data Network, V1

TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014 was developed by the Linguistic Data Consortium and contains training and evaluation data produced in support of the TAC KBP Chinese Cross-lingual Entity Linking tasks in 2011, 201...

Ancient Chinese Corpus

Oct 18, 2017

Chen, Xiaohe; Li, Bin; Feng, Minxuan; Xu, Chao; Xu, Runhua; Shi, Min; Yu, Lili; Xiao, Lei; Wang, Qingqing, 2017, "Ancient Chinese Corpus", https://hdl.handle.net/11272.1/AB2/4HYBFE, Abacus Data Network, V1

Ancient Chinese Corpus was developed at Nanjing Normal University. It contains word-segmented and part-of-speech tagged text from Zuozhuan, an ancient Chinese work believed to date from the Warring States Period (475-221 BC). Zuozhuan is a commentary on the Chunqui, a history of...

MWE-Aware English Dependency Corpus 2.0

Oct 18, 2017

Kato, Akihiko; Shindo, Hiroyuki; Matsumoto, Yuji, 2017, "MWE-Aware English Dependency Corpus 2.0", https://hdl.handle.net/11272.1/AB2/GKYOY9, Abacus Data Network, V1

MWE-Aware English Dependency Corpus Version 2.0 was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from OntoNotes Release 5.0 (LDC20...

RATS Keyword Spotting

Oct 18, 2017

Graff, David; Ma, Xiaoyi; Strassel, Stephanie; Walker, Kevin; Jones, Karen, 2017, "RATS Keyword Spotting", https://hdl.handle.net/11272.1/AB2/IFVKNB, Abacus Data Network, V1

RATS Keyword Spotting was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,100 hours of Levantine Arabic and Farsi conversational telephone speech with automatic and manual annotation of speech segments, transcripts and keywords generated from...

English Web Treebank Propbank

Oct 18, 2017

O'Gorman, Tim; Conger, Katherine; Palmer, Martha, 2017, "English Web Treebank Propbank", https://hdl.handle.net/11272.1/AB2/Q8LILM, Abacus Data Network, V1

English Web Treebank Propbank, LDC Catalog Number LDC2017T15 and ISBN 1-58563-818-8, was developed by the University of Colorado Boulder - CLEAR (Computational Language and Education Research) and provides predicate-argument structure annotation for English Web Treebank (LDC2012T...

Multi-Language Conversational Telephone Speech 2011 -- South Asian

Oct 15, 2017

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2017, "Multi-Language Conversational Telephone Speech 2011 -- South Asian", https://hdl.handle.net/11272.1/AB2/JPGPJM, Abacus Data Network, V1

Multi-Language Conversational Telephone Speech 2011 – South Asian was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 118 hours of telephone speech in five distinct language varieties of South Asia (i.e. the Indian sub-continent): Bengali, Hind...

SRI-FRTIV

Sep 14, 2017

Shriberg, Elizabeth; Kathol, Andreas; Graciarena, Martin; Bratt, Harry; Kajarekar, Sachin; Jameel, Huda; Richey, Colleen; Goodman, Fred, 2017, "SRI-FRTIV", https://hdl.handle.net/11272.1/AB2/YONFH9, Abacus Data Network, V1

SRI-FRTIV (Five-way Recorded Toastmaster Intrinsic Variation) was developed by SRI International in 2007-2008 and is comprised of approximately 232 hours of English speech from thirty-four speakers who were members of Toastmaster clubs. Participants were asked to speak at three d...

2015-2016 CoNLL Shared Task

Sep 14, 2017

Xue, Nianwen; Ng, Hwee Tou; Pradhan, Sameer; Rutherford, Attapol T.; Webber, Bonnie; Wang, Chuan; Wang, Hong Min; Prasad, Rashmi, 2017, "2015-2016 CoNLL Shared Task", https://hdl.handle.net/11272.1/AB2/TSNLNO, Abacus Data Network, V1

2015-2016 CoNLL Shared Task, LDC Catalog Number LDC2017T13 and ISBN 1-58563-812-9, contains the Chinese and English training, development and test data for the 2015 and 2016 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation which focused on shal...

GALE Phase 4 Arabic Broadcast Conversation Speech

Aug 15, 2017

Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie, 2017, "GALE Phase 4 Arabic Broadcast Conversation Speech", https://hdl.handle.net/11272.1/AB2/XFDC1A, Abacus Data Network, V1

GALE Phase 4 Arabic Broadcast Conversation Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 75 hours of Arabic broadcast conversation speech collected in 2008 and 2009 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Ph...

Add Data

Share Dataverse

Link Dataverse

Reset Modifications