Linguistic Data Consortium

Featured Dataverses

In order to use this feature you must have at least one published dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

201 to 250 of 399 Results

BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training Sep 2, 2021 Li, Xuansong; Grimes, Stephen; Strassel, Stephanie, 2021, "BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training", https://hdl.handle.net/11272.1/AB2/XACS3U, Abacus Data Network, V1 Abstract Introduction BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training was developed by the Linguistic Data Consortium (LDC) and consists of 349,414 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations. The DA...
IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b Jun 11, 2021 Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Gann, Ketty; Harper, Mary; Kazi, Michael; Le, Hanh; Malyska, Nicolas; Melot, Jennifer; Phillips, Josh; Ray, Jessica; Roomi, Bergul; Rytting, Anton; Strahan, Tania E., 2019, "IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b", https://hdl.handle.net/11272.1/AB2/U1H3H7, Abacus Data Network, V1 Abstract Introduction IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Amharic conversational and scripted telephone speech collect...
TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 Jun 9, 2021 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2020, "TAC KBP English Event Argument - Training and Evaluation Data 2014-2015", https://hdl.handle.net/11272.1/AB2/TTCGFJ, Abacus Data Network, V1 Abstract Introduction TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the 2014 TAC KBP English Event Argument Extraction Pilot and Evalua...
Machine Reading Phase 1 IC Training Data Jun 9, 2021 Simpson, Heather; Strassel, Stephanie; Wright, Jonathan; Griffitt, Kira, 2020, "Machine Reading Phase 1 IC Training Data", https://hdl.handle.net/11272.1/AB2/7GZ3YJ, Abacus Data Network, V1 Abstract Introduction Machine Reading Phase 1 IC Training Data was developed by the Linguistic Data Consortium and contains 248 English source documents and 116 standoff annotation files used in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program. The Ma...
BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training Jun 9, 2021 Li, Xuansong; Grimes, Stephen; Strassel, Stephanie, 2020, "BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training", https://hdl.handle.net/11272.1/AB2/ZZOGLK, Abacus Data Network, V1 Abstract Introduction BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training was developed by the Linguistic Data Consortium (LDC) and consists of 153,171 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate...
Chinese CogBank Jun 9, 2021 Li, Bin; Yin, Siqi; Xu, Jie; Song, Li; Feng, Minxuan, 2020, "Chinese CogBank", https://hdl.handle.net/11272.1/AB2/XQKHRG, Abacus Data Network, V1 Abstract Introduction Chinese CogBank is a database of cognitive properties of Chinese words intended for use in metaphor understanding and generation. It consists of 232,497 "word-property" pairs, which are comprised of 83,104 words and 100,195 properties. Each "word-property" t...
BOLT English Treebank - SMS/Chat Jun 9, 2021 Bies, Ann; Mott, Justin; Warner, Colin; Kulick, Seth, 2021, "BOLT English Treebank - SMS/Chat", https://hdl.handle.net/11272.1/AB2/TMECTL, Abacus Data Network, V1 Abstract Introduction BOLT English Treebank - SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of English SMS and text chat data with part-of-speech and syntactic structure annotation. The DARPA BOLT (Broad Operational Language Translation) program deve...
ATIS - Seven Languages Jun 9, 2021 Mansour, Saab; Haider, Batool, 2021, "ATIS - Seven Languages", https://hdl.handle.net/11272.1/AB2/1TL7TE, Abacus Data Network, V1 Abstract Introduction ATIS - Seven Languages was developed by Amazon Web Services, Inc. and consists of 5,871 English utterances from ATIS (Air Travel Information Services) corpora, specifically ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26), tran...
LORELEI Akan Representative Language Pack Jun 9, 2021 Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2021, "LORELEI Akan Representative Language Pack", https://hdl.handle.net/11272.1/AB2/78MZYO, Abacus Data Network, V1 Abstract Introduction LORELEI Akan Representative Language Pack consists of Akan monolingual text, Akan-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI program. The LORELEI (Lo...
Nautilus Speaker Characterization Dec 17, 2019 Gallardo, Laura Fernández, 2019, "Nautilus Speaker Characterization", https://hdl.handle.net/11272.1/AB2/JR6VMZ, Abacus Data Network, V1 Nautilus Speaker Characterization was developed at the Technical University of Berlin and is comprised of approximately 155 hours of conversational speech from 300 German speakers aged 18 to 35 years (126 males and 174 females) with no marked dialect or accent, recorded in an aco...
TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 Nov 15, 2019 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017", https://hdl.handle.net/11272.1/AB2/KQWRTL, Abacus Data Network, V1 TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 was developed by the Linguistic Data Consortium (LDC) and contains Chinese, English and Spanish data produced in support of the TAC KBP Cold Start evaluation track conducted from 2012 to 2017. This includes source docum...
DEFT English Committed Belief Annotation Nov 15, 2019 Tracey, Jennifer; Arrigo, Michael; Strassel, Stephanie, 2019, "DEFT English Committed Belief Annotation", https://hdl.handle.net/11272.1/AB2/WY5NZN, Abacus Data Network, V1 DEFT English Committed Belief Annotation was developed by the Linguistic Data Consortium (LDC) and consists of approximately 950,000 words of English discussion forum text annotated for “committed belief,” which marks the level of commitment displayed by the author to the truth o...
CALLFRIEND American English-Non-Southern Dialect Second Edition Nov 15, 2019 Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2019, "CALLFRIEND American English-Non-Southern Dialect Second Edition", https://hdl.handle.net/11272.1/AB2/OBLYDI, Abacus Data Network, V1 CALLFRIEND American English-Non-Southern Dialect Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 26 hours of unscripted telephone conversations between native speakers of non-Southern dialects of American English. This second edi...
Polish Speech Database Oct 15, 2019 Szwelnik, Tomasz; Kawalec, Jacek; Gutowska, Dorota, 2019, "Polish Speech Database", https://hdl.handle.net/11272.1/AB2/GNGZEI, Abacus Data Network, V1 Polish Speech Database was developed by VoiceLab. It consists of 263,424 utterances of Polish speech data from 200 speakers, totaling approximately 280 hours, and corresponding transcripts. Data collection was performed in Poland. Speakers were asked to record themselves for at l...
2016 NIST Speaker Recognition Evaluation Test Set Oct 15, 2019 Greenberg, Craig; Sadjadi, Omid; Kheyrkhah, Timothee; Jones, Karen; Walker, Kevin; Strassel, Stephanie; Graff, David, 2019, "2016 NIST Speaker Recognition Evaluation Test Set", https://hdl.handle.net/11272.1/AB2/WJ2G5L, Abacus Data Network, V1 2016 NIST Speaker Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It contains approximately 340 hours of short segments of Tagalog, Cantonese, Cebuano and Mandarin telephone speech us...
BOLT English Treebank - Discussion Forum Oct 15, 2019 Bies, Ann; Mott, Justin; Warner, Colin; Kulick, Seth, 2019, "BOLT English Treebank - Discussion Forum", https://hdl.handle.net/11272.1/AB2/9OA0DB, Abacus Data Network, V1 BOLT English Treebank - Discussion Forum was developed by the Linguistic Data Consortium (LDC) and consists of English web discussion forum data with part-of-speech and syntactic structure annotations. The DARPA BOLT (Broad Operational Language Translation) program developed mach...
CALLFRIEND Canadian French Second Edition Sep 16, 2019 Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2019, "CALLFRIEND Canadian French Second Edition", https://hdl.handle.net/11272.1/AB2/PPNHVC, Abacus Data Network, V1 CALLFRIEND Canadian French Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Canadian French. This second edition updates the audio files to wav format, simp...
BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training Sep 16, 2019 Li, Xuansong; Grimes, Stephen; Strassel, Stephanie, 2019, "BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training", https://hdl.handle.net/11272.1/AB2/TJO8RI, Abacus Data Network, V1 BOLT Chinese-English Word Alignment and Tagging – SMS/Chat Training was developed by the Linguistic Data Consortium (LDC) and consists of 388,027 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations. The DARPA BOLT (Broad Operational...
Machine Reading Phase 1 NFL Scoring Training Data Sep 16, 2019 Simpson, Heather; Strassel, Stephanie; Wright, Jonathan; Griffitt, Kira, 2019, "Machine Reading Phase 1 NFL Scoring Training Data", https://hdl.handle.net/11272.1/AB2/AZSUUC, Abacus Data Network, V1 Machine Reading Phase 1 NFL Scoring Training Data was developed by the Linguistic Data Consortium (LDC) and contains 110 US NFL (National Football League) scoring source documents and 110 standoff annotation files used in the DARPA (Defense Advanced Research Projects Agency) Mach...
Multi-Language Conversational Telephone Speech 2011 -- East Asian Aug 15, 2019 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- East Asian", https://hdl.handle.net/11272.1/AB2/3MKZES, Abacus Data Network, V1 Multi-Language Conversational Telephone Speech 2011 – East Asian was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 19 hours of telephone speech in two distinct languages of East Asia: Thai and Lao. The data were collected primarily to support...
IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c Aug 15, 2019 Adams, Nikki; Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Gann, Ketty; Harper, Mary; Kaiser-Schatzlein, Alice; Kazi, Michael; Malyska, Nicolas; Melot, Jennifer; Onaka, Akiko; Paget, Shelley; Ray, Jessica; Richardson, Fred; Rytting, Anton; Shen, Sinney, 2019, "IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c", https://hdl.handle.net/11272.1/AB2/39RDNJ, Abacus Data Network, V1 ARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 207 hours of Igbo conversational and scripted telephone speech collected in 2014 and 2015 along wit...
TAC KBP Evaluation Source Corpora 2016-2017 Aug 15, 2019 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Evaluation Source Corpora 2016-2017", https://hdl.handle.net/11272.1/AB2/JDNLHX, Abacus Data Network, V1 TAC KBP Evaluation Source Corpora 2016-2017 was developed by the Linguistic Data Consortium (LDC) and contains the 180,003 Chinese, English and Spanish source documents used in support of all TAC KBP evaluation tracks conducted in 2016 and 2017. Text Analysis Conference (TAC) is...
Corpus of Conversational Persian Transcripts Aug 15, 2019 Mohammadi, Ariana Negar, 2019, "Corpus of Conversational Persian Transcripts", https://hdl.handle.net/11272.1/AB2/VPL800, Abacus Data Network, V1 Corpus of Conversational Persian Transcripts consists of transcripts from approximately 20 hours of naturally occurring informal conversations in the Tehrani dialect of Iranian Persian. The corresponding speech is not included in this release. Data This corpus is extracted from 1...
First DIHARD Challenge Development - Eight Sources Jul 19, 2019 Linguistic Data Consortium, 2019, "First DIHARD Challenge Development - Eight Sources", https://hdl.handle.net/11272.1/AB2/XA6BRY, Abacus Data Network, V1 First DIHARD Challenge Development - Eight Sources was developed by the Linguistic Data Consortium (LDC) and contains approximately 17 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge. The First DIHARD Cha...
First DIHARD Challenge Evaluation - Nine Sources Jul 15, 2019 Linguistic Data Consortium, 2019, "First DIHARD Challenge Evaluation - Nine Sources", https://hdl.handle.net/11272.1/AB2/HGTUHY, Abacus Data Network, V1 First DIHARD Challenge Evaluation - Nine Sources was developed by the Linguistic Data Consortium (LDC) and contains approximately 18 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge. The First DIHARD Chall...
Phrase Detectives Corpus Version 2 Jul 15, 2019 Chamberlain, Jon; Paun, Silviu; Yu, Juntao; Kruschwitz, Udo; Poesio, Massimo, 2019, "Phrase Detectives Corpus Version 2", https://hdl.handle.net/11272.1/AB2/6GWBA8, Abacus Data Network, V1 Phrase Detectives Corpus Version 2 was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 407,000 tokens across 537 documents anaphorically-annotated by the Phrase Detectives Game, an online interactive...
The DKU-JNU-EMA Electromagnetic Articulography Database Jul 15, 2019 Qin, Xiaoyi; Liu, Xinzhong; Cai, Zexin; Li, Ming, 2019, "The DKU-JNU-EMA Electromagnetic Articulography Database", https://hdl.handle.net/11272.1/AB2/D9PQFH, Abacus Data Network, V1 The DKU-JNU-EMA Electromagnetic Articulography Database was developed by Duke Kunshan University and Jinan University and contains approximately 10 hours of articulography and speech data in Mandarin, Cantonese, Hakka, and Teochew Chinese from two to seven native speakers for eac...
First DIHARD Challenge Evaluation - SEEDLingS Jul 15, 2019 Ryant, Neville; Liberman, Mark; Fiumara, James; Cieri, Christopher, 2019, "First DIHARD Challenge Evaluation - SEEDLingS", https://hdl.handle.net/11272.1/AB2/XH4KVV, Abacus Data Network, V1 First DIHARD Challenge Evaluation - SEEDLingS was developed by Duke University and the Linguistic Data Consortium (LDC) and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge. Th...
DEFT Spanish Committed Belief Annotation Jun 17, 2019 Tracey, Jennifer; Arrigo, Michael; Strassel, Stephanie, 2019, "DEFT Spanish Committed Belief Annotation", https://hdl.handle.net/11272.1/AB2/HWOJGE, Abacus Data Network, V1 DEFT Spanish Committed Belief Annotation was developed by the Linguistic Data Consortium (LDC) and consists of approximately 67,000 tokens of Spanish discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth o...
First DIHARD Challenge Development - SEEDLingS Jun 17, 2019 Ryant, Neville; Liberman, Mark; Fiumara, James; Cieri, Christopher, 2019, "First DIHARD Challenge Development - SEEDLingS", https://hdl.handle.net/11272.1/AB2/KXC76R, Abacus Data Network, V1 First DIHARD Challenge Development - SEEDLingS was developed by Duke University and the Linguistic Data Consortium (LDC) and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge. T...
USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition Jun 17, 2019 Ramabhadran, Bhuvana; Gustman, Samuel; Byrne, William; Hajič, Jan; Oard, Douglas; Olsson, J. Scott; Picheny, Michael; Psutka, Josef, 2019, "USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition", https://hdl.handle.net/11272.1/AB2/SGOMWO, Abacus Data Network, V1 USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition, LDC Catalog Number LDC2019S11 and ISBN 1-58563-889-7, was developed by IBM as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. This edition augments USC-SFI MALACH Interviews...
CIEMPIESS Experimentation May 15, 2019 Mena, Carlos Daniel Hernández, 2019, "CIEMPIESS Experimentation", https://hdl.handle.net/11272.1/AB2/DUUYQV, Abacus Data Network, V1 CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Experimentation was developed by the social service program "Desarrollo de Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) at the National Autonomous Univer...
TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 May 15, 2019 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014", https://hdl.handle.net/11272.1/AB2/ZZMOPP, Abacus Data Network, V1 TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 201...
Multi-Language Conversational Telephone Speech 2011 -- English Group May 15, 2019 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- English Group", https://hdl.handle.net/11272.1/AB2/ACDWDL, Abacus Data Network, V1 Multi-Language Conversational Telephone Speech 2011 – English Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 18 hours of telephone speech in two general varieties of English: American and South Asian. The data were collected primaril...
BOLT Egyptian-English Word Alignment -- Discussion Forum Training Apr 15, 2019 Li, Xuansong; Peterson, Katherine; Grimes, Stephen; Strassel, Stephanie, 2019, "BOLT Egyptian-English Word Alignment -- Discussion Forum Training", https://hdl.handle.net/11272.1/AB2/AR1QCS, Abacus Data Network, V1 BOLT Egyptian-English Word Alignment – Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 400,448 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations. The DARPA BOLT (Broad Operat...
Chinese Abstract Meaning Representation 1.0 Apr 15, 2019 Li, Bin; Wen, Yuan; Song, Li; Dai, Rubing; Qu, Weiguang; Xue, Nianwen, 2019, "Chinese Abstract Meaning Representation 1.0", https://hdl.handle.net/11272.1/AB2/TT5KRI, Abacus Data Network, V1 Chinese Abstract Meaning Representation was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of Chinese sentences from Chinese Treebank 8.0 (LDC2013T21). Abstract Meaning Representation (AMR) captures "who is doi...
Penn Discourse Treebank Version 3.0 Mar 15, 2019 Prasad, Rashmi; Webber, Bonnie; Lee, Alan; Joshi, Aravind, 2019, "Penn Discourse Treebank Version 3.0", https://hdl.handle.net/11272.1/AB2/SUU9CB, Abacus Data Network, V1 Penn Discourse Treebank (PDTB) Version 3.0 is the third release in the Penn Discourse Treebank project, the goal of which is to annotate the Wall Street Journal (WSJ) section of Treebank-2 (LDC95T7) with discourse relations. Penn Discourse Treebank Version 2 (LDC2008T05) contains...
CALLFRIEND Egyptian Arabic Second Edition Mar 15, 2019 Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2019, "CALLFRIEND Egyptian Arabic Second Edition", https://hdl.handle.net/11272.1/AB2/4LCUFC, Abacus Data Network, V1 CALLFRIEND Egyptian Arabic Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 25 hours of unscripted telephone conversations between native speakers of Egyptian Arabic. This second edition updates the audio files to wav format, simp...
VAST Chinese Speech and Transcripts Mar 15, 2019 Tracey, Jennifer; Strassel, Stephanie; Kuster, Neil, 2019, "VAST Chinese Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/OE8XTX, Abacus Data Network, V1 VAST Chinese Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) for the VAST (Video Annotation for Speech Technologies) project and is comprised of approximately 29 hours of Mandarin Chinese audio extracted from amateur video content harvested from the w...
DEFT Chinese Committed Belief Annotation Feb 15, 2019 Tracey, Jennifer; Arrigo, Michael; Kuster, Neil; Strassel, Stephanie, 2019, "DEFT Chinese Committed Belief Annotation", https://hdl.handle.net/11272.1/AB2/EGZOQ9, Abacus Data Network, V1 DEFT Chinese Committed Belief Annotation was developed by the Linguistic Data Consortium (LDC) and consists of approximately 83,000 tokens of Chinese discussion forum text annotated for “committed belief,” which marks the level of commitment displayed by the author to the truth o...
Multilingual ATIS Feb 15, 2019 Upadhyay, Shyam; Hakkani-Tur, Dilek; Tur, Gokhan; Rastogi, Abhinav, 2019, "Multilingual ATIS", https://hdl.handle.net/11272.1/AB2/AGMWIU, Abacus Data Network, V1 Multilingual ATIS was developed by Google Inc. and consists of 5,871 utterances from ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26) annotated and translated into Hindi and Turkish. The ATIS (Air Travel Information Services) collection was develope...
Multi-Language Conversational Telephone Speech 2011 -- Arabic Group Feb 15, 2019 Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- Arabic Group", https://hdl.handle.net/11272.1/AB2/A5UT97, Abacus Data Network, V1 Multi-Language Conversational Telephone Speech 2011 – Arabic Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 117 hours of telephone speech in distinct dialects of colloquial Arabic: Iraqi, Levantine and Maghrebi. The data were collect...
SRI Speech-Based Collaborative Learning Corpus Jan 15, 2019 Richey, Colleen; D'Angelo, Cynthia; Alozie, Nonye; Bratt, Harry; Shriberg, Elizabeth, 2019, "SRI Speech-Based Collaborative Learning Corpus", https://hdl.handle.net/11272.1/AB2/YJWBEU, Abacus Data Network, V1 SRI Speech-Based Collaborative Learning Corpus was developed by SRI International and is comprised of approximately 120 hours of English speech from 134 US middle school students working collaboratively. The data set also contains orthographic transcriptions, manual annotation of...
TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 Jan 15, 2019 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015", https://hdl.handle.net/11272.1/AB2/LCPM63, Abacus Data Network, V1 TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2014 and 2015...
BOLT Arabic Discussion Forum Parallel Training Data Jan 15, 2019 Song, Zhiyi; Tracey, Jennifer; Walker, Christopher; Stephanie, Strassel,, 2019, "BOLT Arabic Discussion Forum Parallel Training Data", https://hdl.handle.net/11272.1/AB2/CZR6SG, Abacus Data Network, V1 BOLT Arabic Discussion Forum Parallel Training Data was developed by the Linguistic Data Consortium (LDC) and consists of 1,169,599 tokens of Egyptian Arabic discussion forum data collected for the DARPA BOLT program along with their corresponding English translations. The BOLT (...
HUB5 Mandarin Telephone Speech and Transcripts Second Edition Dec 17, 2018 Linguistic Data Consortium, 2018, "HUB5 Mandarin Telephone Speech and Transcripts Second Edition", https://hdl.handle.net/11272.1/AB2/2JAJJE, Abacus Data Network, V1 HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by the Linguistic Data Consortium (LDC) in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC...
TAC Relation Extraction Dataset Dec 15, 2018 Zhong, Victor; Zhang, Yuhao; Chen, Danqi; Angeli, Gabor; Manning, Christopher, 2018, "TAC Relation Extraction Dataset", https://hdl.handle.net/11272.1/AB2/SOYGGB, Abacus Data Network, V1 TAC Relation Extraction Dataset (TACRED) was developed by The Stanford NLP Group and is a large-scale relation extraction dataset with 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014....
IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a Nov 15, 2018 Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Hammond, Simon; Harper, Mary; Kaiser-Schatzlein, Alice; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Rytting, Anton; Shen, Sinney; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2018, "IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a", https://hdl.handle.net/11272.1/AB2/OTDPUV, Abacus Data Network, V1 Introduction IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Telugu conversational and scripted telephone speech collected in 2013...
BOLT Egyptian Arabic Treebank - Discussion Forum Nov 15, 2018 Maamouri, Mohamed; Bies, Ann; Kulick, Seth; Krouna, Sondos; Tabassi,Dalila; Ciul, Michael, 2018, "BOLT Egyptian Arabic Treebank - Discussion Forum", https://hdl.handle.net/11272.1/AB2/CAA0JW, Abacus Data Network, V1 BOLT Egyptian Arabic Treebank – Discussion Forum was developed by the Linguistic Data Consortium (LDC) and consists of Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation. The DARPA BOLT (Broad Operational Lang...
Avatar Education Portuguese Nov 15, 2018 Maciel, Alexandre M. A.; Rodrigues, Rodrigo L.; Barbosa, Danilo S., 2018, "Avatar Education Portuguese", https://hdl.handle.net/11272.1/AB2/BSQ4NP, Abacus Data Network, V1 Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant d...

BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training

Sep 2, 2021

Li, Xuansong; Grimes, Stephen; Strassel, Stephanie, 2021, "BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training", https://hdl.handle.net/11272.1/AB2/XACS3U, Abacus Data Network, V1

Abstract Introduction BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training was developed by the Linguistic Data Consortium (LDC) and consists of 349,414 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations. The DA...

IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b

Jun 11, 2021

Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Gann, Ketty; Harper, Mary; Kazi, Michael; Le, Hanh; Malyska, Nicolas; Melot, Jennifer; Phillips, Josh; Ray, Jessica; Roomi, Bergul; Rytting, Anton; Strahan, Tania E., 2019, "IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b", https://hdl.handle.net/11272.1/AB2/U1H3H7, Abacus Data Network, V1

Abstract Introduction IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Amharic conversational and scripted telephone speech collect...

TAC KBP English Event Argument - Training and Evaluation Data 2014-2015

Jun 9, 2021

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2020, "TAC KBP English Event Argument - Training and Evaluation Data 2014-2015", https://hdl.handle.net/11272.1/AB2/TTCGFJ, Abacus Data Network, V1

Abstract Introduction TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the 2014 TAC KBP English Event Argument Extraction Pilot and Evalua...

Machine Reading Phase 1 IC Training Data

Jun 9, 2021

Simpson, Heather; Strassel, Stephanie; Wright, Jonathan; Griffitt, Kira, 2020, "Machine Reading Phase 1 IC Training Data", https://hdl.handle.net/11272.1/AB2/7GZ3YJ, Abacus Data Network, V1

Abstract Introduction Machine Reading Phase 1 IC Training Data was developed by the Linguistic Data Consortium and contains 248 English source documents and 116 standoff annotation files used in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program. The Ma...

BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training

Jun 9, 2021

Li, Xuansong; Grimes, Stephen; Strassel, Stephanie, 2020, "BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training", https://hdl.handle.net/11272.1/AB2/ZZOGLK, Abacus Data Network, V1

Abstract Introduction BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training was developed by the Linguistic Data Consortium (LDC) and consists of 153,171 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate...

Chinese CogBank

Jun 9, 2021

Li, Bin; Yin, Siqi; Xu, Jie; Song, Li; Feng, Minxuan, 2020, "Chinese CogBank", https://hdl.handle.net/11272.1/AB2/XQKHRG, Abacus Data Network, V1

Abstract Introduction Chinese CogBank is a database of cognitive properties of Chinese words intended for use in metaphor understanding and generation. It consists of 232,497 "word-property" pairs, which are comprised of 83,104 words and 100,195 properties. Each "word-property" t...

BOLT English Treebank - SMS/Chat

Jun 9, 2021

Bies, Ann; Mott, Justin; Warner, Colin; Kulick, Seth, 2021, "BOLT English Treebank - SMS/Chat", https://hdl.handle.net/11272.1/AB2/TMECTL, Abacus Data Network, V1

Abstract Introduction BOLT English Treebank - SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of English SMS and text chat data with part-of-speech and syntactic structure annotation. The DARPA BOLT (Broad Operational Language Translation) program deve...

ATIS - Seven Languages

Jun 9, 2021

Mansour, Saab; Haider, Batool, 2021, "ATIS - Seven Languages", https://hdl.handle.net/11272.1/AB2/1TL7TE, Abacus Data Network, V1

Abstract Introduction ATIS - Seven Languages was developed by Amazon Web Services, Inc. and consists of 5,871 English utterances from ATIS (Air Travel Information Services) corpora, specifically ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26), tran...

LORELEI Akan Representative Language Pack

Jun 9, 2021

Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2021, "LORELEI Akan Representative Language Pack", https://hdl.handle.net/11272.1/AB2/78MZYO, Abacus Data Network, V1

Abstract Introduction LORELEI Akan Representative Language Pack consists of Akan monolingual text, Akan-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI program. The LORELEI (Lo...

Nautilus Speaker Characterization

Dec 17, 2019

Gallardo, Laura Fernández, 2019, "Nautilus Speaker Characterization", https://hdl.handle.net/11272.1/AB2/JR6VMZ, Abacus Data Network, V1

Nautilus Speaker Characterization was developed at the Technical University of Berlin and is comprised of approximately 155 hours of conversational speech from 300 German speakers aged 18 to 35 years (126 males and 174 females) with no marked dialect or accent, recorded in an aco...

TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017

Nov 15, 2019

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017", https://hdl.handle.net/11272.1/AB2/KQWRTL, Abacus Data Network, V1

TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 was developed by the Linguistic Data Consortium (LDC) and contains Chinese, English and Spanish data produced in support of the TAC KBP Cold Start evaluation track conducted from 2012 to 2017. This includes source docum...

DEFT English Committed Belief Annotation

Nov 15, 2019

Tracey, Jennifer; Arrigo, Michael; Strassel, Stephanie, 2019, "DEFT English Committed Belief Annotation", https://hdl.handle.net/11272.1/AB2/WY5NZN, Abacus Data Network, V1

DEFT English Committed Belief Annotation was developed by the Linguistic Data Consortium (LDC) and consists of approximately 950,000 words of English discussion forum text annotated for “committed belief,” which marks the level of commitment displayed by the author to the truth o...

CALLFRIEND American English-Non-Southern Dialect Second Edition

Nov 15, 2019

Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2019, "CALLFRIEND American English-Non-Southern Dialect Second Edition", https://hdl.handle.net/11272.1/AB2/OBLYDI, Abacus Data Network, V1

CALLFRIEND American English-Non-Southern Dialect Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 26 hours of unscripted telephone conversations between native speakers of non-Southern dialects of American English. This second edi...

Polish Speech Database

Oct 15, 2019

Szwelnik, Tomasz; Kawalec, Jacek; Gutowska, Dorota, 2019, "Polish Speech Database", https://hdl.handle.net/11272.1/AB2/GNGZEI, Abacus Data Network, V1

Polish Speech Database was developed by VoiceLab. It consists of 263,424 utterances of Polish speech data from 200 speakers, totaling approximately 280 hours, and corresponding transcripts. Data collection was performed in Poland. Speakers were asked to record themselves for at l...

2016 NIST Speaker Recognition Evaluation Test Set

Oct 15, 2019

Greenberg, Craig; Sadjadi, Omid; Kheyrkhah, Timothee; Jones, Karen; Walker, Kevin; Strassel, Stephanie; Graff, David, 2019, "2016 NIST Speaker Recognition Evaluation Test Set", https://hdl.handle.net/11272.1/AB2/WJ2G5L, Abacus Data Network, V1

2016 NIST Speaker Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It contains approximately 340 hours of short segments of Tagalog, Cantonese, Cebuano and Mandarin telephone speech us...

BOLT English Treebank - Discussion Forum

Oct 15, 2019

Bies, Ann; Mott, Justin; Warner, Colin; Kulick, Seth, 2019, "BOLT English Treebank - Discussion Forum", https://hdl.handle.net/11272.1/AB2/9OA0DB, Abacus Data Network, V1

BOLT English Treebank - Discussion Forum was developed by the Linguistic Data Consortium (LDC) and consists of English web discussion forum data with part-of-speech and syntactic structure annotations. The DARPA BOLT (Broad Operational Language Translation) program developed mach...

CALLFRIEND Canadian French Second Edition

Sep 16, 2019

Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2019, "CALLFRIEND Canadian French Second Edition", https://hdl.handle.net/11272.1/AB2/PPNHVC, Abacus Data Network, V1

CALLFRIEND Canadian French Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Canadian French. This second edition updates the audio files to wav format, simp...

BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training

Sep 16, 2019

Li, Xuansong; Grimes, Stephen; Strassel, Stephanie, 2019, "BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training", https://hdl.handle.net/11272.1/AB2/TJO8RI, Abacus Data Network, V1

BOLT Chinese-English Word Alignment and Tagging – SMS/Chat Training was developed by the Linguistic Data Consortium (LDC) and consists of 388,027 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations. The DARPA BOLT (Broad Operational...

Machine Reading Phase 1 NFL Scoring Training Data

Sep 16, 2019

Simpson, Heather; Strassel, Stephanie; Wright, Jonathan; Griffitt, Kira, 2019, "Machine Reading Phase 1 NFL Scoring Training Data", https://hdl.handle.net/11272.1/AB2/AZSUUC, Abacus Data Network, V1

Machine Reading Phase 1 NFL Scoring Training Data was developed by the Linguistic Data Consortium (LDC) and contains 110 US NFL (National Football League) scoring source documents and 110 standoff annotation files used in the DARPA (Defense Advanced Research Projects Agency) Mach...

Multi-Language Conversational Telephone Speech 2011 -- East Asian

Aug 15, 2019

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- East Asian", https://hdl.handle.net/11272.1/AB2/3MKZES, Abacus Data Network, V1

Multi-Language Conversational Telephone Speech 2011 – East Asian was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 19 hours of telephone speech in two distinct languages of East Asia: Thai and Lao. The data were collected primarily to support...

IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c

Aug 15, 2019

Adams, Nikki; Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Gann, Ketty; Harper, Mary; Kaiser-Schatzlein, Alice; Kazi, Michael; Malyska, Nicolas; Melot, Jennifer; Onaka, Akiko; Paget, Shelley; Ray, Jessica; Richardson, Fred; Rytting, Anton; Shen, Sinney, 2019, "IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c", https://hdl.handle.net/11272.1/AB2/39RDNJ, Abacus Data Network, V1

ARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 207 hours of Igbo conversational and scripted telephone speech collected in 2014 and 2015 along wit...

TAC KBP Evaluation Source Corpora 2016-2017

Aug 15, 2019

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Evaluation Source Corpora 2016-2017", https://hdl.handle.net/11272.1/AB2/JDNLHX, Abacus Data Network, V1

TAC KBP Evaluation Source Corpora 2016-2017 was developed by the Linguistic Data Consortium (LDC) and contains the 180,003 Chinese, English and Spanish source documents used in support of all TAC KBP evaluation tracks conducted in 2016 and 2017. Text Analysis Conference (TAC) is...

Corpus of Conversational Persian Transcripts

Aug 15, 2019

Mohammadi, Ariana Negar, 2019, "Corpus of Conversational Persian Transcripts", https://hdl.handle.net/11272.1/AB2/VPL800, Abacus Data Network, V1

Corpus of Conversational Persian Transcripts consists of transcripts from approximately 20 hours of naturally occurring informal conversations in the Tehrani dialect of Iranian Persian. The corresponding speech is not included in this release. Data This corpus is extracted from 1...

First DIHARD Challenge Development - Eight Sources

Jul 19, 2019

Linguistic Data Consortium, 2019, "First DIHARD Challenge Development - Eight Sources", https://hdl.handle.net/11272.1/AB2/XA6BRY, Abacus Data Network, V1

First DIHARD Challenge Development - Eight Sources was developed by the Linguistic Data Consortium (LDC) and contains approximately 17 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge. The First DIHARD Cha...

First DIHARD Challenge Evaluation - Nine Sources

Jul 15, 2019

Linguistic Data Consortium, 2019, "First DIHARD Challenge Evaluation - Nine Sources", https://hdl.handle.net/11272.1/AB2/HGTUHY, Abacus Data Network, V1

First DIHARD Challenge Evaluation - Nine Sources was developed by the Linguistic Data Consortium (LDC) and contains approximately 18 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge. The First DIHARD Chall...

Phrase Detectives Corpus Version 2

Jul 15, 2019

Chamberlain, Jon; Paun, Silviu; Yu, Juntao; Kruschwitz, Udo; Poesio, Massimo, 2019, "Phrase Detectives Corpus Version 2", https://hdl.handle.net/11272.1/AB2/6GWBA8, Abacus Data Network, V1

Phrase Detectives Corpus Version 2 was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 407,000 tokens across 537 documents anaphorically-annotated by the Phrase Detectives Game, an online interactive...

The DKU-JNU-EMA Electromagnetic Articulography Database

Jul 15, 2019

Qin, Xiaoyi; Liu, Xinzhong; Cai, Zexin; Li, Ming, 2019, "The DKU-JNU-EMA Electromagnetic Articulography Database", https://hdl.handle.net/11272.1/AB2/D9PQFH, Abacus Data Network, V1

The DKU-JNU-EMA Electromagnetic Articulography Database was developed by Duke Kunshan University and Jinan University and contains approximately 10 hours of articulography and speech data in Mandarin, Cantonese, Hakka, and Teochew Chinese from two to seven native speakers for eac...

First DIHARD Challenge Evaluation - SEEDLingS

Jul 15, 2019

Ryant, Neville; Liberman, Mark; Fiumara, James; Cieri, Christopher, 2019, "First DIHARD Challenge Evaluation - SEEDLingS", https://hdl.handle.net/11272.1/AB2/XH4KVV, Abacus Data Network, V1

First DIHARD Challenge Evaluation - SEEDLingS was developed by Duke University and the Linguistic Data Consortium (LDC) and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge. Th...

DEFT Spanish Committed Belief Annotation

Jun 17, 2019

Tracey, Jennifer; Arrigo, Michael; Strassel, Stephanie, 2019, "DEFT Spanish Committed Belief Annotation", https://hdl.handle.net/11272.1/AB2/HWOJGE, Abacus Data Network, V1

DEFT Spanish Committed Belief Annotation was developed by the Linguistic Data Consortium (LDC) and consists of approximately 67,000 tokens of Spanish discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth o...

First DIHARD Challenge Development - SEEDLingS

Jun 17, 2019

Ryant, Neville; Liberman, Mark; Fiumara, James; Cieri, Christopher, 2019, "First DIHARD Challenge Development - SEEDLingS", https://hdl.handle.net/11272.1/AB2/KXC76R, Abacus Data Network, V1

First DIHARD Challenge Development - SEEDLingS was developed by Duke University and the Linguistic Data Consortium (LDC) and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge. T...

USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition

Jun 17, 2019

Ramabhadran, Bhuvana; Gustman, Samuel; Byrne, William; Hajič, Jan; Oard, Douglas; Olsson, J. Scott; Picheny, Michael; Psutka, Josef, 2019, "USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition", https://hdl.handle.net/11272.1/AB2/SGOMWO, Abacus Data Network, V1

USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition, LDC Catalog Number LDC2019S11 and ISBN 1-58563-889-7, was developed by IBM as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. This edition augments USC-SFI MALACH Interviews...

CIEMPIESS Experimentation

May 15, 2019

Mena, Carlos Daniel Hernández, 2019, "CIEMPIESS Experimentation", https://hdl.handle.net/11272.1/AB2/DUUYQV, Abacus Data Network, V1

CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Experimentation was developed by the social service program "Desarrollo de Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) at the National Autonomous Univer...

TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014

May 15, 2019

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014", https://hdl.handle.net/11272.1/AB2/ZZMOPP, Abacus Data Network, V1

TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 201...

Multi-Language Conversational Telephone Speech 2011 -- English Group

May 15, 2019

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- English Group", https://hdl.handle.net/11272.1/AB2/ACDWDL, Abacus Data Network, V1

Multi-Language Conversational Telephone Speech 2011 – English Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 18 hours of telephone speech in two general varieties of English: American and South Asian. The data were collected primaril...

BOLT Egyptian-English Word Alignment -- Discussion Forum Training

Apr 15, 2019

Li, Xuansong; Peterson, Katherine; Grimes, Stephen; Strassel, Stephanie, 2019, "BOLT Egyptian-English Word Alignment -- Discussion Forum Training", https://hdl.handle.net/11272.1/AB2/AR1QCS, Abacus Data Network, V1

BOLT Egyptian-English Word Alignment – Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 400,448 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations. The DARPA BOLT (Broad Operat...

Chinese Abstract Meaning Representation 1.0

Apr 15, 2019

Li, Bin; Wen, Yuan; Song, Li; Dai, Rubing; Qu, Weiguang; Xue, Nianwen, 2019, "Chinese Abstract Meaning Representation 1.0", https://hdl.handle.net/11272.1/AB2/TT5KRI, Abacus Data Network, V1

Chinese Abstract Meaning Representation was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of Chinese sentences from Chinese Treebank 8.0 (LDC2013T21). Abstract Meaning Representation (AMR) captures "who is doi...

Penn Discourse Treebank Version 3.0

Mar 15, 2019

Prasad, Rashmi; Webber, Bonnie; Lee, Alan; Joshi, Aravind, 2019, "Penn Discourse Treebank Version 3.0", https://hdl.handle.net/11272.1/AB2/SUU9CB, Abacus Data Network, V1

Penn Discourse Treebank (PDTB) Version 3.0 is the third release in the Penn Discourse Treebank project, the goal of which is to annotate the Wall Street Journal (WSJ) section of Treebank-2 (LDC95T7) with discourse relations. Penn Discourse Treebank Version 2 (LDC2008T05) contains...

CALLFRIEND Egyptian Arabic Second Edition

Mar 15, 2019

Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2019, "CALLFRIEND Egyptian Arabic Second Edition", https://hdl.handle.net/11272.1/AB2/4LCUFC, Abacus Data Network, V1

CALLFRIEND Egyptian Arabic Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 25 hours of unscripted telephone conversations between native speakers of Egyptian Arabic. This second edition updates the audio files to wav format, simp...

VAST Chinese Speech and Transcripts

Mar 15, 2019

Tracey, Jennifer; Strassel, Stephanie; Kuster, Neil, 2019, "VAST Chinese Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/OE8XTX, Abacus Data Network, V1

VAST Chinese Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) for the VAST (Video Annotation for Speech Technologies) project and is comprised of approximately 29 hours of Mandarin Chinese audio extracted from amateur video content harvested from the w...

DEFT Chinese Committed Belief Annotation

Feb 15, 2019

Tracey, Jennifer; Arrigo, Michael; Kuster, Neil; Strassel, Stephanie, 2019, "DEFT Chinese Committed Belief Annotation", https://hdl.handle.net/11272.1/AB2/EGZOQ9, Abacus Data Network, V1

DEFT Chinese Committed Belief Annotation was developed by the Linguistic Data Consortium (LDC) and consists of approximately 83,000 tokens of Chinese discussion forum text annotated for “committed belief,” which marks the level of commitment displayed by the author to the truth o...

Multilingual ATIS

Feb 15, 2019

Upadhyay, Shyam; Hakkani-Tur, Dilek; Tur, Gokhan; Rastogi, Abhinav, 2019, "Multilingual ATIS", https://hdl.handle.net/11272.1/AB2/AGMWIU, Abacus Data Network, V1

Multilingual ATIS was developed by Google Inc. and consists of 5,871 utterances from ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26) annotated and translated into Hindi and Turkish. The ATIS (Air Travel Information Services) collection was develope...

Multi-Language Conversational Telephone Speech 2011 -- Arabic Group

Feb 15, 2019

Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie, 2019, "Multi-Language Conversational Telephone Speech 2011 -- Arabic Group", https://hdl.handle.net/11272.1/AB2/A5UT97, Abacus Data Network, V1

Multi-Language Conversational Telephone Speech 2011 – Arabic Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 117 hours of telephone speech in distinct dialects of colloquial Arabic: Iraqi, Levantine and Maghrebi. The data were collect...

SRI Speech-Based Collaborative Learning Corpus

Jan 15, 2019

Richey, Colleen; D'Angelo, Cynthia; Alozie, Nonye; Bratt, Harry; Shriberg, Elizabeth, 2019, "SRI Speech-Based Collaborative Learning Corpus", https://hdl.handle.net/11272.1/AB2/YJWBEU, Abacus Data Network, V1

SRI Speech-Based Collaborative Learning Corpus was developed by SRI International and is comprised of approximately 120 hours of English speech from 134 US middle school students working collaboratively. The data set also contains orthographic transcriptions, manual annotation of...

TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015

Jan 15, 2019

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2019, "TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015", https://hdl.handle.net/11272.1/AB2/LCPM63, Abacus Data Network, V1

TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2014 and 2015...

BOLT Arabic Discussion Forum Parallel Training Data

Jan 15, 2019

Song, Zhiyi; Tracey, Jennifer; Walker, Christopher; Stephanie, Strassel,, 2019, "BOLT Arabic Discussion Forum Parallel Training Data", https://hdl.handle.net/11272.1/AB2/CZR6SG, Abacus Data Network, V1

BOLT Arabic Discussion Forum Parallel Training Data was developed by the Linguistic Data Consortium (LDC) and consists of 1,169,599 tokens of Egyptian Arabic discussion forum data collected for the DARPA BOLT program along with their corresponding English translations. The BOLT (...

HUB5 Mandarin Telephone Speech and Transcripts Second Edition

Dec 17, 2018

Linguistic Data Consortium, 2018, "HUB5 Mandarin Telephone Speech and Transcripts Second Edition", https://hdl.handle.net/11272.1/AB2/2JAJJE, Abacus Data Network, V1

HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by the Linguistic Data Consortium (LDC) in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC...

TAC Relation Extraction Dataset

Dec 15, 2018

Zhong, Victor; Zhang, Yuhao; Chen, Danqi; Angeli, Gabor; Manning, Christopher, 2018, "TAC Relation Extraction Dataset", https://hdl.handle.net/11272.1/AB2/SOYGGB, Abacus Data Network, V1

TAC Relation Extraction Dataset (TACRED) was developed by The Stanford NLP Group and is a large-scale relation extraction dataset with 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014....

IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a

Nov 15, 2018

Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Hammond, Simon; Harper, Mary; Kaiser-Schatzlein, Alice; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Rytting, Anton; Shen, Sinney; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2018, "IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a", https://hdl.handle.net/11272.1/AB2/OTDPUV, Abacus Data Network, V1

Introduction IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Telugu conversational and scripted telephone speech collected in 2013...

BOLT Egyptian Arabic Treebank - Discussion Forum

Nov 15, 2018

Maamouri, Mohamed; Bies, Ann; Kulick, Seth; Krouna, Sondos; Tabassi,Dalila; Ciul, Michael, 2018, "BOLT Egyptian Arabic Treebank - Discussion Forum", https://hdl.handle.net/11272.1/AB2/CAA0JW, Abacus Data Network, V1

BOLT Egyptian Arabic Treebank – Discussion Forum was developed by the Linguistic Data Consortium (LDC) and consists of Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation. The DARPA BOLT (Broad Operational Lang...

Avatar Education Portuguese

Nov 15, 2018

Maciel, Alexandre M. A.; Rodrigues, Rodrigo L.; Barbosa, Danilo S., 2018, "Avatar Education Portuguese", https://hdl.handle.net/11272.1/AB2/BSQ4NP, Abacus Data Network, V1

Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant d...

Add Data

Share Dataverse

Link Dataverse

Reset Modifications