Linguistic Data Consortium

Featured Dataverses

In order to use this feature you must have at least one published dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

151 to 200 of 403 Results

Global TIMIT Mandarin Chinese-Guanzhong Dialect Mar 18, 2022 Jiang, Yue; Zhan, Juhong; Han, Hongjian; Xu, Zuohao; Zhou, Haiyan; Yuan, Jiahong; Liberman, Mark, 2022, "Global TIMIT Mandarin Chinese-Guanzhong Dialect", https://hdl.handle.net/11272.1/AB2/MFTAUQ, Abacus Data Network, V1 Abstract Introduction Global TIMIT Mandarin Chinese-Guanzhong Dialect was developed by the Linguistic Data Consortium and Xi'an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Sh...
Global TIMIT Learner Simple English Mar 18, 2022 Ding, Hongwei; Liao, Sishi; Zhan, Yuqing; Feng, Hui; He, Wenchao; Hu, Xiaoyan; Wu, Yu; Yuan, Jiahong; Liberman, Mark, 2022, "Global TIMIT Learner Simple English", https://hdl.handle.net/11272.1/AB2/NMUWWH, Abacus Data Network, V1 Abstract Introduction Global TIMIT Learner Simple English was developed by the Linguistic Data Consortium and Shanghai Jiao Tong University and consists of approximately 12 hours of L1 and L2 English read speech and transcripts. The Global TIMIT project aimed to create a series o...
Global TIMIT Learner Treebank English Mar 18, 2022 Luan, Huan; Wang, Yanhong; Feng, Hui; He, Wenchao; Hu, Xiaoyan; Wu, Yu; Yuan, Jiahong; Liberman, Mark, 2022, "Global TIMIT Learner Treebank English", https://hdl.handle.net/11272.1/AB2/A2ZRDI, Abacus Data Network, V1 Abstract Introduction Global TIMIT Learner Treebank English was developed by the Linguistic Data Consortium and LAIX Inc. and consists of approximately 24 hours of L1 and L2 English read speech and transcripts. The Global TIMIT project aimed to create a series of corpora in a var...
CALLFRIEND American English-Southern Dialect Second Edition Mar 18, 2022 Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2022, "CALLFRIEND American English-Southern Dialect Second Edition", https://hdl.handle.net/11272.1/AB2/O0EZK5, Abacus Data Network, V1 Abstract Introduction CALLFRIEND American English-Southern Dialect Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Southern dialects of American English. This second edition updates the au...
CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition Mar 18, 2022 Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2022, "CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition", https://hdl.handle.net/11272.1/AB2/AT8NRM, Abacus Data Network, V1 Abstract Introduction CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 27 hours of unscripted telephone conversations between native speakers of the Taiwan dialect of Mandarin Chinese. Th...
Chinese Lexical Resources for Gender, Number, Animacy Mar 18, 2022 Chen, Song; Yuan, Jiahong; Ma, Xiaoyi; Strassel, Stephanie, 2022, "Chinese Lexical Resources for Gender, Number, Animacy", https://hdl.handle.net/11272.1/AB2/2CSZDM, Abacus Data Network, V1 Abstract Introduction Chinese Lexical Resources for Gender, Number, Animacy was developed by the Linguistic Data Consortium (LDC) and consists of gender, number, and animacy lexicons produced in support of the DARPA DEFT program. Gender, number and animacy are lexical indicators...
GALE Phase 4 Chinese Broadcast News Transcripts Mar 18, 2022 Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2022, "GALE Phase 4 Chinese Broadcast News Transcripts", https://hdl.handle.net/11272.1/AB2/TVASI8, Abacus Data Network, V1 Abstract Introduction GALE Phase 4 Chinese Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 134 hours of Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technolo...
Columbia Games Corpus Mar 18, 2022 Hirschberg, Julia; Gravano, Agustin; Benus, Stefan; Ward, Gregory; Sneed German, Elisa, 2022, "Columbia Games Corpus", https://hdl.handle.net/11272.1/AB2/TPZYOR, Abacus Data Network, V1 Abstract Introduction Columbia Games Corpus was developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation along with corresponding orthographic...
Corpus of Law, Academic, and News Mar 18, 2022 Mohammadi, Ariana Negar, 2022, "Corpus of Law, Academic, and News", https://hdl.handle.net/11272.1/AB2/VMWYC0, Abacus Data Network, V1 Abstract Introduction Corpus of Law, Academic, and News consists of 400 Persian documents divided into three genres: legal, academic, and news. The legal section contains texts from official publications, including the civil penal code, the criminal penal code, and the constituti...
Penn Parsed Corpora of Historical English Mar 18, 2022 Kroch, Anthony, 2022, "Penn Parsed Corpora of Historical English", https://hdl.handle.net/11272.1/AB2/NWMKHI, Abacus Data Network, V1 Abstract Introduction Penn Parsed Corpora of Historical English was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (19...
Global TIMIT Mandarin Chinese-Guanzhong Dialect Mar 18, 2022 Jiang, Yue; Zhan, Juhong; Han, Hongjian; Xu, Zuohao; Zhou, Haiyan; Yuan, Jiahong; Liberman, Mark, 2022, "Global TIMIT Mandarin Chinese-Guanzhong Dialect", https://hdl.handle.net/11272.1/AB2/FF5DX5, Abacus Data Network, V1 Abstract Introduction Global TIMIT Mandarin Chinese-Guanzhong Dialect was developed by the Linguistic Data Consortium and Xi'an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Sh...
IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b Mar 18, 2022 Bills, Aric; Conners, Thomas; David, Anne; Cruz, Luanne Dela; Dubinski, Eyal; Fiscus, Jonathan G.; Gann, Ketty; Harper, Mary; Kazi, Michael; Le, Hanh; Malyska, Nicolas; Melot, Jennifer; Ray, Jessica; Richardson, Fred; Rytting, Anton; Zwanenburg, Jacqui, 2022, "IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b", https://hdl.handle.net/11272.1/AB2/BBDKDK, Abacus Data Network, V1 Abstract Introduction IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Javanese conversational and scripted telephone speech colle...
IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c Mar 18, 2022 Andresen, Lucy; Bills, Aric; Brugman, Claudia; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Gann, Ketty; Harper, Mary; Kazi, Michael; Le, Hanh; Malyska, Nicolas; Maurillo, Arlene; Melot, Jennifer; Paget, Shelley; Prebble, Jane Elizabeth; Ray, Jessica; Richardson, Fred; Rytting, Anton; Shen, Sinney, 2022, "IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c", https://hdl.handle.net/11272.1/AB2/C2XGCW, Abacus Data Network, V1 Abstract Introduction IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 198 hours of Guarani conversational and scripted telephone speech collect...
IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b Mar 18, 2022 Benowitz, Daniel; Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Hefright, Brook; Le, Hanh; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Shen, Sinney; Smith, Rosanna; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2022, "IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b", https://hdl.handle.net/11272.1/AB2/5MR7Z2, Abacus Data Network, V1 Abstract Introduction IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 210 hours of Lithuanian conversational and scripted telephone speech c...
2007 CoNLL Shared Task - Arabic & English Mar 18, 2022 Consortium, Linguistic Data, 2022, "2007 CoNLL Shared Task - Arabic & English", https://hdl.handle.net/11272.1/AB2/X7AEOJ, Abacus Data Network, V1 Abstract Introduction 2007 CoNLL Shared Task - Arabic & English consists of dependency treebanks in two languages used as part of the CoNLL 2007 shared task on multi-lingual dependency parsing and domain adaptation. The languages covered in this release are Arabic and English. LD...
2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish Mar 18, 2022 Country, University of the Basque; Catalunya, Technical University of; University, Charles; University, Middle East Technical; University, Sabanci, 2022, "2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish", https://hdl.handle.net/11272.1/AB2/R8ZR6Q, Abacus Data Network, V1 Abstract Introduction 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish consists of dependency treebanks in four languages used as part of the CoNLL 2007 shared task on multi-lingual dependency parsing and domain adaptation. The languages covered in this release are: Basq...
IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e Mar 18, 2022 Bills, Aric; Conners, Thomas; Corris, Miriam; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Heighway, Melanie; Kozlov, Kirill; Malyska, Nicolas; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2022, "IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e", https://hdl.handle.net/11272.1/AB2/CTDWII, Abacus Data Network, V1 Abstract Introduction IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 200 hours of Tok Pisin conversational and scripted telephone speech col...
IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a Mar 18, 2022 Bills, Aric; Conners, Thomas; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Heighway, Melanie; Lin, Willa; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Roomi, Bergul; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne; Zwanenburg, Jacqui, 2022, "IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a", https://hdl.handle.net/11272.1/AB2/HRUQMM, Abacus Data Network, V1 Abstract Introduction IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kurmanji Kurdish conversational and scripted teleph...
IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e Mar 18, 2022 Adams, Nikki; Bills, Aric; Conners, Thomas; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Lin, Willa; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne; Wong, Jamie, 2022, "IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e", https://hdl.handle.net/11272.1/AB2/SJQNLO, Abacus Data Network, V1 Abstract Introduction IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 211 hours of Zulu conversational and scripted telephone speech collected in...
IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b Mar 18, 2022 Andrus, Tony; Bills, Aric; Conners, Thomas; Crabb, Erin Smith; Dubinski, Eyal; Fiscus, Jonathan G.; Gillies, Breanna; Harper, Mary; Hazen, T. J.; Hefright, Brook; Jarrett, Amy; Le, Hanh; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2022, "IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b", https://hdl.handle.net/11272.1/AB2/O4K5VU, Abacus Data Network, V1 Abstract Introduction IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Haitian Creole conversational and scripted telephone...
Audiovisual Database of Spoken American English Mar 18, 2022 Richie, Carolyn; Warburton, Sarah; Carter, Megan, 2022, "Audiovisual Database of Spoken American English", https://hdl.handle.net/11272.1/AB2/8KIBXB, Abacus Data Network, V1 Abstract Introduction The Audiovisual Database of Spoken American English, Linguistic Data Consortium (LDC) catalog number LDC2009V01 and isbn 1-58563-496-4, was developed at Butler University, Indianapolis, IN in 2007 for use by a a variety of researchers to evaluate speech prod...
HKUST Mandarin Telephone Transcript Data, Part 1 Mar 18, 2022 Fung, Pascale; Huang, Shudong; Graff, David, 2022, "HKUST Mandarin Telephone Transcript Data, Part 1", https://hdl.handle.net/11272.1/AB2/UOHG3I, Abacus Data Network, V1 Abstract Introduction HKUST Mandarin Telephone Transcript Data Part 1 was developed by Hong Kong University of Science and Technology (HKUST) and contains transcripts for 897 telephone conversations in Mandarin Chinese. In 2004 HKUST was contracted to collect and transcribe 200 h...
HKUST Mandarin Telephone Speech, Part 1 Mar 18, 2022 Fung, Pascale; Huang, Shudong; Graff, David, 2022, "HKUST Mandarin Telephone Speech, Part 1", https://hdl.handle.net/11272.1/AB2/TKM8OR, Abacus Data Network, V1 Abstract Introduction HKUST Mandarin Telephone Speech, Part 1 was developed by Hong Kong University of Science and Technology (HKUST) and contains approximately 149 hours of conversational telephone speech (CTS) in Mandarin. Given that Standard Mandarin is not the native dialect...
LORELEI Kinyarwanda Incident Language Pack Feb 7, 2022 Tracey, Jennifer; Graff, David; Strassel, Stephanie; Arrigo, Michael; Wright, Jonathan; Bies, Ann, 2022, "LORELEI Kinyarwanda Incident Language Pack", https://hdl.handle.net/11272.1/AB2/P1OIX0, Abacus Data Network, V1 Abstract Introduction LORELEI Kinyarwanda Incident Language Pack was developed by the Linguistic Data Consortium and is comprised of approximately 11.9 million words of Kinyarwanda monolingual text, 35,000 words of English monolingual text, 3.4 million words of parallel and compa...
2017 NIST OpenSAT Pilot - SSSF Feb 7, 2022 Byers, Frederick, 2022, "2017 NIST OpenSAT Pilot - SSSF", https://hdl.handle.net/11272.1/AB2/PTU0AQ, Abacus Data Network, V1 Abstract Introduction 2017 NIST OpenSAT Pilot - SSSF was developed by NIST (National Institute of Standards and Technology) and contains approximately one hour of operational speech data, transcripts and annotation files used in the speech activity detection, automatic speech rec...
BOLT English Translation Treebank - Chinese SMS/Chat Feb 7, 2022 Bies, Ann; Mott, Justin; Warner, Colin; Kulick, Seth, 2022, "BOLT English Translation Treebank - Chinese SMS/Chat", https://hdl.handle.net/11272.1/AB2/JBOOKU, Abacus Data Network, V1 Abstract Introduction BOLT English Translation Treebank - Chinese SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of SMS and chat text data translated from Chinese to English and annotated for part-of-speech and syntactic structure. The DARPA BOLT (Bro...
GALE Phase 3 Arabic Broadcast News Transcripts Part 2 Jan 24, 2022 Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2017, "GALE Phase 3 Arabic Broadcast News Transcripts Part 2", https://hdl.handle.net/11272.1/AB2/VM5MOD, Abacus Data Network, V2 Introduction GALE Phase 3 Arabic Broadcast News Transcripts Part 2 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 128 hours of Arabic broadcast news speech collected in 2007 by the Linguistic Data Consortium (LDC), MediaNet, Tun...
BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech Dec 2, 2021 Palmer, Martha; Hwang, Jena D.; Mansouri, Aous; Bonial, Claire; O'Gorman, Tim; Gung, James, 2021, "BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech", https://hdl.handle.net/11272.1/AB2/YS81IR, Abacus Data Network, V1 Abstract Introduction BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by the University of Colorado Boulder - CLEAR (Computational Language and Education Research) and consists of propbank annotation on Egyp...
Second DIHARD Challenge Development - Eleven Sources Dec 2, 2021 Ryant, Neville; Liberman, Mark; Fiumara, James; Cieri, Christopher, 2021, "Second DIHARD Challenge Development - Eleven Sources", https://hdl.handle.net/11272.1/AB2/CBFPZO, Abacus Data Network, V1 Abstract Introduction Second DIHARD Challenge Development - Eleven Sources was developed by LDC and contains approximately 22 hours of English and Chinese speech data along with corresponding annotations used in support of the Second DIHARD Challenge. The DIHARD Challenges are a...
BOLT Egyptian Arabic Treebank - SMS/Chat Nov 18, 2021 Maamouri, Mohamed; Bies, Ann; Kulick, Seth; Krouna, Sondos; Tabassi, Dalila; Ciul, Michael, 2021, "BOLT Egyptian Arabic Treebank - SMS/Chat", https://hdl.handle.net/11272.1/AB2/1DSLOX, Abacus Data Network, V1 Abstract Introduction BOLT Egyptian Arabic Treebank - SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of Egyptian Arabic SMS/Chat data with part-of-speech annotation, morphology, and syntactic tree annotation. The DARPA BOLT (Broad Operational Language...
UCLA Speaker Variability Database Nov 18, 2021 Keating, Patricia; Kreiman, Jody; Alwan, Abeer; Chong, Adam; Lee, Yoonjeong, 2021, "UCLA Speaker Variability Database", https://hdl.handle.net/11272.1/AB2/CIIVXT, Abacus Data Network, V1 Abstract Introduction UCLA Speaker Variability Database was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of approximately 34 hours of English speech and orthographic transcripts. This corpus was designed to sample variability in speaking...
Switchboard-1 Release 2 Oct 26, 2021 Godfrey, John J.; Holliman, Edward, 2021, "Switchboard-1 Release 2", https://hdl.handle.net/11272.1/AB2/VTPSCK, Abacus Data Network, V1 Abstract Introduction The Switchboard-1 Telephone Speech Corpus (LDC97S62) consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by...
Wikipedia Spanish Speech and Transcripts Oct 14, 2021 Mena, Carlos Daniel Hernández; Ruiz, Iván Vladimir Meza, 2021, "Wikipedia Spanish Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/L05NFF, Abacus Data Network, V1 Abstract Introduction Wikipedia Spanish Speech and Transcripts consists of approximately 25 hours of Spanish read speech and transcripts. The read text was taken from the Spanish version of WikiProject Spoken Wikipedia, referred to as Wikipedia Grabada. The transcripts were devel...
BOLT Egyptian Arabic SMS/Chat Parallel Training Data Oct 14, 2021 Tracey, Jennifer; Delgado, Dana; Chen, Song; Strassel, Stephanie, 2021, "BOLT Egyptian Arabic SMS/Chat Parallel Training Data", https://hdl.handle.net/11272.1/AB2/WXML9A, Abacus Data Network, V1 Abstract Introduction BOLT Egyptian Arabic SMS/Chat Parallel Training Data was developed by the Linguistic Data Consortium (LDC) and consists of approximately 723,000 tokens of Egyptian Arabic SMS/Chat data collected for the DARPA BOLT program along with their corresponding Engli...
Classical Arabic Dictionary Oct 14, 2021 Alsheddi, Abeer, 2021, "Classical Arabic Dictionary", https://hdl.handle.net/11272.1/AB2/FQ7PIS, Abacus Data Network, V1 Abstract Introduction Classical Arabic Dictionary consists of approximately one hundred million words of Arabic collected from texts dating between 431 and 1104 CE, principally books and essays, along with word occurrences, source documents and related metadata. Data The dictiona...
IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b Oct 1, 2021 Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Gann, Ketty; Harper, Mary; Kazi, Michael; Lim, Lynn-Li; Malyska, Nicolas; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Shen, Sinney; Smith, Rosanna, 2021, "IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b", https://hdl.handle.net/11272.1/AB2/IFBL6A, Abacus Data Network, V1 Abstract Introduction IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Halh Mongolian conversational and scripted telephone speec...
IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d Sep 29, 2021 Andresen, Jess; Bills, Aric; Conners, Thomas; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Kozlov, Kirill; Malyska, Nicolas; Melot, Jennifer; Morrison, Michelle; Phillips, Josh; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne; Wong, Jamie, 2021, "IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d", https://hdl.handle.net/11272.1/AB2/TNSSDU, Abacus Data Network, V2 Abstract Introduction IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 350 hours of Swahili conversational and scripted telephone speech collect...
LORELEI Oromo Incident Language Pack Sep 29, 2021 Tracey, Jennifer; Graff, David; Strassel, Stephanie; Arrigo, Michael; Wright, Jonathan; Bies, Ann, 2021, "LORELEI Oromo Incident Language Pack", https://hdl.handle.net/11272.1/AB2/EH7NXF, Abacus Data Network, V1 Abstract Introduction LORELEI Oromo Incident Language Pack was developed by the Linguistic Data Consortium and is comprised of approximately 3.9 million words of Oromo monolingual text, 25,000 words of English monolingual text, 135,000 words of parallel and comparable Oromo-Engli...
Database of Word Level Statistics - Mandarin Sep 3, 2021 Neergaard, Karl David; Xu, Hongzhi; Huang, Chu-Ren, 2021, "Database of Word Level Statistics - Mandarin", https://hdl.handle.net/11272.1/AB2/VJDPA0, Abacus Data Network, V1 Abstract Introduction Database of Word Level Statistics - Mandarin was developed by The Hong Kong Polytechnic University. It provides lexical characteristics of a descriptive and statistical nature for words and nonwords of Mandarin Chinese. It is designed for researchers particu...
Abstract Meaning Representation (AMR) Annotation Release 3.0 Sep 3, 2021 Knight, Kevin; Badarau, Bianca; Baranescu, Laura; Bonial, Claire; Bardocz, Madalina; Griffitt, Kira; Hermjakob, Ulf; Marcu, Daniel; Palmer, Martha; O'Gorman, Tim; Schneider, Nathan, 2021, "Abstract Meaning Representation (AMR) Annotation Release 3.0", https://hdl.handle.net/11272.1/AB2/82CVJF, Abacus Data Network, V1 Abstract Introduction Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Ins...
Penn Discourse Treebank Version 2.0 - German Translation Sep 3, 2021 Sluyter-Gaethje, Henny; Bourgonje, Peter; Stede, Manfred, 2021, "Penn Discourse Treebank Version 2.0 - German Translation", https://hdl.handle.net/11272.1/AB2/1AXWBN, Abacus Data Network, V1 Abstract Introduction Penn Discourse Treebank Version 2.0 - German Translation was developed at the University of Potsdam's Applied Computational Linguistics group and consists of approximately one million tokens derived from Penn Discourse Treebank Version 2.0 (LDC2008T05). This...
TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010 Sep 3, 2021 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2021, "TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010", https://hdl.handle.net/11272.1/AB2/VAZOSD, Abacus Data Network, V1 Abstract Introduction TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010 was developed by the Linguistic Data Consortium and contains training and evaluation data produced in support of the 2010 TAC KBP Surprise Slot Filling track, the only y...
TAC KBP English Sentiment Slot Filling -- Comprehensive Training and Evaluation Data 2013-2014 Sep 3, 2021 Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2021, "TAC KBP English Sentiment Slot Filling -- Comprehensive Training and Evaluation Data 2013-2014", https://hdl.handle.net/11272.1/AB2/MRZALN, Abacus Data Network, V1 Abstract Introduction TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010 was developed by the Linguistic Data Consortium and contains training and evaluation data produced in support of the 2013 and 2014 TAC KBP Sentiment Slot Filling tracks....
X-SRL: Parallel Cross-lingual Semantic Role Labeling Sep 3, 2021 Daza, Angel; Frank, Anette, 2021, "X-SRL: Parallel Cross-lingual Semantic Role Labeling", https://hdl.handle.net/11272.1/AB2/DNOJP9, Abacus Data Network, V1 Abstract Introduction X-SRL: Parallel Cross-lingual Semantic Role Labeling was developed by Heidelberg University, Department of Computational Linguistics and the Leibniz Institute for the German Language (IDS). It consists of approximately three million words of German, French a...
ESPADA Sep 3, 2021 Arase, Yuki; Tsujii, Junichi, 2021, "ESPADA", https://hdl.handle.net/11272.1/AB2/ANSK9Z, Abacus Data Network, V1 Abstract Introduction ESPADA (Extended Syntactic Phrase Alignment DAtaset) consists of annotated parse trees and alignment on English sentential paraphrases extracted from machine translation evaluation corpora. It extends SPADE (LDC2018T09) by adding new annotated data for train...
BOLT Chinese SMS/Chat Parallel Training Data Sep 3, 2021 Tracey, Jennifer; Delgado, Dana; Chen, Song; Strassel, Stephanie, 2021, "BOLT Chinese SMS/Chat Parallel Training Data", https://hdl.handle.net/11272.1/AB2/O3JTA9, Abacus Data Network, V1 Abstract Introduction BOLT Chinese SMS/Chat Parallel Training Data was developed by the Linguistic Data Consortium and consists of approximately 1.8 million tokens of Chinese SMS/Chat data collected for the DARPA BOLT program along with their corresponding English translations Th...
Chinese Abstract Meaning Representation 2.0 Sep 3, 2021 Li, Bin; Xiao, Liming; Liu, Yihuan; Wen, Yuan; Song, Li; Chun, Jayeol; Feng, Minxuan; Zhou, Junsheng; Qu, Weiguang; Xue, Nianwen, 2021, "Chinese Abstract Meaning Representation 2.0", https://hdl.handle.net/11272.1/AB2/LVQEZJ, Abacus Data Network, V1 Abstract Introduction Chinese Abstract Meaning Representation (CAMR) 2.0 was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of approximately 20,000 Chinese sentences from Chinese Treebank (CTB) 8.0 (LDC2013T21)...
BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech Sep 3, 2021 Agarwal, Nitin; Francini, Michelle; Kappler, Michelle; Micciulla, Linnea; Pradhan, Sameer; Ramshaw, Lance, 2021, "BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech", https://hdl.handle.net/11272.1/AB2/DXWM3B, Abacus Data Network, V1 Abstract Introduction BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on Egyptian Arabic discussion forum (DF), SMS/Chat and conversational tele...
LibriVox Spanish Sep 2, 2021 Mena, Carlos Daniel Hernández, 2021, "LibriVox Spanish", https://hdl.handle.net/11272.1/AB2/AHBO1C, Abacus Data Network, V1 Abstract Introduction LibriVox Spanish consists of approximately 73 hours of Spanish read speech and transcripts. The audio data was taken from Spanish audiobooks developed by LibriVox, a non-profit project that creates audiobooks from public domain works. The transcripts were de...
Global TIMIT Mandarin Chinese Sep 2, 2021 Ding, Hongwei; Liao, Sishi; Zhan, Yuqing; Yuan, Jiahong; Liberman, Mark, 2021, "Global TIMIT Mandarin Chinese", https://hdl.handle.net/11272.1/AB2/2CCXH8, Abacus Data Network, V1 Abstract Introduction Global TIMIT Mandarin Chinese was developed by the Linguistic Data Consortium and Shanghai Jiao Tong University and consists of approximately five hours of read speech and transcripts in Mandarin Chinese. The Global TIMIT project aimed to create a series of...

Global TIMIT Mandarin Chinese-Guanzhong Dialect

Mar 18, 2022

Jiang, Yue; Zhan, Juhong; Han, Hongjian; Xu, Zuohao; Zhou, Haiyan; Yuan, Jiahong; Liberman, Mark, 2022, "Global TIMIT Mandarin Chinese-Guanzhong Dialect", https://hdl.handle.net/11272.1/AB2/MFTAUQ, Abacus Data Network, V1

Abstract Introduction Global TIMIT Mandarin Chinese-Guanzhong Dialect was developed by the Linguistic Data Consortium and Xi'an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Sh...

Global TIMIT Learner Simple English

Mar 18, 2022

Ding, Hongwei; Liao, Sishi; Zhan, Yuqing; Feng, Hui; He, Wenchao; Hu, Xiaoyan; Wu, Yu; Yuan, Jiahong; Liberman, Mark, 2022, "Global TIMIT Learner Simple English", https://hdl.handle.net/11272.1/AB2/NMUWWH, Abacus Data Network, V1

Abstract Introduction Global TIMIT Learner Simple English was developed by the Linguistic Data Consortium and Shanghai Jiao Tong University and consists of approximately 12 hours of L1 and L2 English read speech and transcripts. The Global TIMIT project aimed to create a series o...

Global TIMIT Learner Treebank English

Mar 18, 2022

Luan, Huan; Wang, Yanhong; Feng, Hui; He, Wenchao; Hu, Xiaoyan; Wu, Yu; Yuan, Jiahong; Liberman, Mark, 2022, "Global TIMIT Learner Treebank English", https://hdl.handle.net/11272.1/AB2/A2ZRDI, Abacus Data Network, V1

Abstract Introduction Global TIMIT Learner Treebank English was developed by the Linguistic Data Consortium and LAIX Inc. and consists of approximately 24 hours of L1 and L2 English read speech and transcripts. The Global TIMIT project aimed to create a series of corpora in a var...

CALLFRIEND American English-Southern Dialect Second Edition

Mar 18, 2022

Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2022, "CALLFRIEND American English-Southern Dialect Second Edition", https://hdl.handle.net/11272.1/AB2/O0EZK5, Abacus Data Network, V1

Abstract Introduction CALLFRIEND American English-Southern Dialect Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Southern dialects of American English. This second edition updates the au...

CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition

Mar 18, 2022

Canavan, Alexandra; Zipperlen, George; Bartlett, John, 2022, "CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition", https://hdl.handle.net/11272.1/AB2/AT8NRM, Abacus Data Network, V1

Abstract Introduction CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 27 hours of unscripted telephone conversations between native speakers of the Taiwan dialect of Mandarin Chinese. Th...

Chinese Lexical Resources for Gender, Number, Animacy

Mar 18, 2022

Chen, Song; Yuan, Jiahong; Ma, Xiaoyi; Strassel, Stephanie, 2022, "Chinese Lexical Resources for Gender, Number, Animacy", https://hdl.handle.net/11272.1/AB2/2CSZDM, Abacus Data Network, V1

Abstract Introduction Chinese Lexical Resources for Gender, Number, Animacy was developed by the Linguistic Data Consortium (LDC) and consists of gender, number, and animacy lexicons produced in support of the DARPA DEFT program. Gender, number and animacy are lexical indicators...

GALE Phase 4 Chinese Broadcast News Transcripts

Mar 18, 2022

Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2022, "GALE Phase 4 Chinese Broadcast News Transcripts", https://hdl.handle.net/11272.1/AB2/TVASI8, Abacus Data Network, V1

Abstract Introduction GALE Phase 4 Chinese Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 134 hours of Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technolo...

Columbia Games Corpus

Mar 18, 2022

Hirschberg, Julia; Gravano, Agustin; Benus, Stefan; Ward, Gregory; Sneed German, Elisa, 2022, "Columbia Games Corpus", https://hdl.handle.net/11272.1/AB2/TPZYOR, Abacus Data Network, V1

Abstract Introduction Columbia Games Corpus was developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation along with corresponding orthographic...

Corpus of Law, Academic, and News

Mar 18, 2022

Mohammadi, Ariana Negar, 2022, "Corpus of Law, Academic, and News", https://hdl.handle.net/11272.1/AB2/VMWYC0, Abacus Data Network, V1

Abstract Introduction Corpus of Law, Academic, and News consists of 400 Persian documents divided into three genres: legal, academic, and news. The legal section contains texts from official publications, including the civil penal code, the criminal penal code, and the constituti...

Penn Parsed Corpora of Historical English

Mar 18, 2022

Kroch, Anthony, 2022, "Penn Parsed Corpora of Historical English", https://hdl.handle.net/11272.1/AB2/NWMKHI, Abacus Data Network, V1

Abstract Introduction Penn Parsed Corpora of Historical English was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (19...

Global TIMIT Mandarin Chinese-Guanzhong Dialect

Mar 18, 2022

IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b

Mar 18, 2022

Bills, Aric; Conners, Thomas; David, Anne; Cruz, Luanne Dela; Dubinski, Eyal; Fiscus, Jonathan G.; Gann, Ketty; Harper, Mary; Kazi, Michael; Le, Hanh; Malyska, Nicolas; Melot, Jennifer; Ray, Jessica; Richardson, Fred; Rytting, Anton; Zwanenburg, Jacqui, 2022, "IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b", https://hdl.handle.net/11272.1/AB2/BBDKDK, Abacus Data Network, V1

Abstract Introduction IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Javanese conversational and scripted telephone speech colle...

IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c

Mar 18, 2022

Andresen, Lucy; Bills, Aric; Brugman, Claudia; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Gann, Ketty; Harper, Mary; Kazi, Michael; Le, Hanh; Malyska, Nicolas; Maurillo, Arlene; Melot, Jennifer; Paget, Shelley; Prebble, Jane Elizabeth; Ray, Jessica; Richardson, Fred; Rytting, Anton; Shen, Sinney, 2022, "IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c", https://hdl.handle.net/11272.1/AB2/C2XGCW, Abacus Data Network, V1

Abstract Introduction IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 198 hours of Guarani conversational and scripted telephone speech collect...

IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b

Mar 18, 2022

Benowitz, Daniel; Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Hefright, Brook; Le, Hanh; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Shen, Sinney; Smith, Rosanna; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2022, "IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b", https://hdl.handle.net/11272.1/AB2/5MR7Z2, Abacus Data Network, V1

Abstract Introduction IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 210 hours of Lithuanian conversational and scripted telephone speech c...

2007 CoNLL Shared Task - Arabic & English

Mar 18, 2022

Consortium, Linguistic Data, 2022, "2007 CoNLL Shared Task - Arabic & English", https://hdl.handle.net/11272.1/AB2/X7AEOJ, Abacus Data Network, V1

Abstract Introduction 2007 CoNLL Shared Task - Arabic & English consists of dependency treebanks in two languages used as part of the CoNLL 2007 shared task on multi-lingual dependency parsing and domain adaptation. The languages covered in this release are Arabic and English. LD...

2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish

Mar 18, 2022

Country, University of the Basque; Catalunya, Technical University of; University, Charles; University, Middle East Technical; University, Sabanci, 2022, "2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish", https://hdl.handle.net/11272.1/AB2/R8ZR6Q, Abacus Data Network, V1

Abstract Introduction 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish consists of dependency treebanks in four languages used as part of the CoNLL 2007 shared task on multi-lingual dependency parsing and domain adaptation. The languages covered in this release are: Basq...

IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e

Mar 18, 2022

Bills, Aric; Conners, Thomas; Corris, Miriam; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Heighway, Melanie; Kozlov, Kirill; Malyska, Nicolas; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2022, "IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e", https://hdl.handle.net/11272.1/AB2/CTDWII, Abacus Data Network, V1

Abstract Introduction IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 200 hours of Tok Pisin conversational and scripted telephone speech col...

IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a

Mar 18, 2022

Bills, Aric; Conners, Thomas; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Heighway, Melanie; Lin, Willa; Melot, Jennifer; Paget, Shelley; Ray, Jessica; Roomi, Bergul; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne; Zwanenburg, Jacqui, 2022, "IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a", https://hdl.handle.net/11272.1/AB2/HRUQMM, Abacus Data Network, V1

Abstract Introduction IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kurmanji Kurdish conversational and scripted teleph...

IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e

Mar 18, 2022

Adams, Nikki; Bills, Aric; Conners, Thomas; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Lin, Willa; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne; Wong, Jamie, 2022, "IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e", https://hdl.handle.net/11272.1/AB2/SJQNLO, Abacus Data Network, V1

Abstract Introduction IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 211 hours of Zulu conversational and scripted telephone speech collected in...

IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b

Mar 18, 2022

Andrus, Tony; Bills, Aric; Conners, Thomas; Crabb, Erin Smith; Dubinski, Eyal; Fiscus, Jonathan G.; Gillies, Breanna; Harper, Mary; Hazen, T. J.; Hefright, Brook; Jarrett, Amy; Le, Hanh; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne, 2022, "IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b", https://hdl.handle.net/11272.1/AB2/O4K5VU, Abacus Data Network, V1

Abstract Introduction IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Haitian Creole conversational and scripted telephone...

Audiovisual Database of Spoken American English

Mar 18, 2022

Richie, Carolyn; Warburton, Sarah; Carter, Megan, 2022, "Audiovisual Database of Spoken American English", https://hdl.handle.net/11272.1/AB2/8KIBXB, Abacus Data Network, V1

Abstract Introduction The Audiovisual Database of Spoken American English, Linguistic Data Consortium (LDC) catalog number LDC2009V01 and isbn 1-58563-496-4, was developed at Butler University, Indianapolis, IN in 2007 for use by a a variety of researchers to evaluate speech prod...

HKUST Mandarin Telephone Transcript Data, Part 1

Mar 18, 2022

Fung, Pascale; Huang, Shudong; Graff, David, 2022, "HKUST Mandarin Telephone Transcript Data, Part 1", https://hdl.handle.net/11272.1/AB2/UOHG3I, Abacus Data Network, V1

Abstract Introduction HKUST Mandarin Telephone Transcript Data Part 1 was developed by Hong Kong University of Science and Technology (HKUST) and contains transcripts for 897 telephone conversations in Mandarin Chinese. In 2004 HKUST was contracted to collect and transcribe 200 h...

HKUST Mandarin Telephone Speech, Part 1

Mar 18, 2022

Fung, Pascale; Huang, Shudong; Graff, David, 2022, "HKUST Mandarin Telephone Speech, Part 1", https://hdl.handle.net/11272.1/AB2/TKM8OR, Abacus Data Network, V1

Abstract Introduction HKUST Mandarin Telephone Speech, Part 1 was developed by Hong Kong University of Science and Technology (HKUST) and contains approximately 149 hours of conversational telephone speech (CTS) in Mandarin. Given that Standard Mandarin is not the native dialect...

LORELEI Kinyarwanda Incident Language Pack

Feb 7, 2022

Tracey, Jennifer; Graff, David; Strassel, Stephanie; Arrigo, Michael; Wright, Jonathan; Bies, Ann, 2022, "LORELEI Kinyarwanda Incident Language Pack", https://hdl.handle.net/11272.1/AB2/P1OIX0, Abacus Data Network, V1

Abstract Introduction LORELEI Kinyarwanda Incident Language Pack was developed by the Linguistic Data Consortium and is comprised of approximately 11.9 million words of Kinyarwanda monolingual text, 35,000 words of English monolingual text, 3.4 million words of parallel and compa...

2017 NIST OpenSAT Pilot - SSSF

Feb 7, 2022

Byers, Frederick, 2022, "2017 NIST OpenSAT Pilot - SSSF", https://hdl.handle.net/11272.1/AB2/PTU0AQ, Abacus Data Network, V1

Abstract Introduction 2017 NIST OpenSAT Pilot - SSSF was developed by NIST (National Institute of Standards and Technology) and contains approximately one hour of operational speech data, transcripts and annotation files used in the speech activity detection, automatic speech rec...

BOLT English Translation Treebank - Chinese SMS/Chat

Feb 7, 2022

Bies, Ann; Mott, Justin; Warner, Colin; Kulick, Seth, 2022, "BOLT English Translation Treebank - Chinese SMS/Chat", https://hdl.handle.net/11272.1/AB2/JBOOKU, Abacus Data Network, V1

Abstract Introduction BOLT English Translation Treebank - Chinese SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of SMS and chat text data translated from Chinese to English and annotated for part-of-speech and syntactic structure. The DARPA BOLT (Bro...

GALE Phase 3 Arabic Broadcast News Transcripts Part 2

Jan 24, 2022

Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki, 2017, "GALE Phase 3 Arabic Broadcast News Transcripts Part 2", https://hdl.handle.net/11272.1/AB2/VM5MOD, Abacus Data Network, V2

Introduction GALE Phase 3 Arabic Broadcast News Transcripts Part 2 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 128 hours of Arabic broadcast news speech collected in 2007 by the Linguistic Data Consortium (LDC), MediaNet, Tun...

BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Dec 2, 2021

Palmer, Martha; Hwang, Jena D.; Mansouri, Aous; Bonial, Claire; O'Gorman, Tim; Gung, James, 2021, "BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech", https://hdl.handle.net/11272.1/AB2/YS81IR, Abacus Data Network, V1

Abstract Introduction BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by the University of Colorado Boulder - CLEAR (Computational Language and Education Research) and consists of propbank annotation on Egyp...

Second DIHARD Challenge Development - Eleven Sources

Dec 2, 2021

Ryant, Neville; Liberman, Mark; Fiumara, James; Cieri, Christopher, 2021, "Second DIHARD Challenge Development - Eleven Sources", https://hdl.handle.net/11272.1/AB2/CBFPZO, Abacus Data Network, V1

Abstract Introduction Second DIHARD Challenge Development - Eleven Sources was developed by LDC and contains approximately 22 hours of English and Chinese speech data along with corresponding annotations used in support of the Second DIHARD Challenge. The DIHARD Challenges are a...

BOLT Egyptian Arabic Treebank - SMS/Chat

Nov 18, 2021

Maamouri, Mohamed; Bies, Ann; Kulick, Seth; Krouna, Sondos; Tabassi, Dalila; Ciul, Michael, 2021, "BOLT Egyptian Arabic Treebank - SMS/Chat", https://hdl.handle.net/11272.1/AB2/1DSLOX, Abacus Data Network, V1

Abstract Introduction BOLT Egyptian Arabic Treebank - SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of Egyptian Arabic SMS/Chat data with part-of-speech annotation, morphology, and syntactic tree annotation. The DARPA BOLT (Broad Operational Language...

UCLA Speaker Variability Database

Nov 18, 2021

Keating, Patricia; Kreiman, Jody; Alwan, Abeer; Chong, Adam; Lee, Yoonjeong, 2021, "UCLA Speaker Variability Database", https://hdl.handle.net/11272.1/AB2/CIIVXT, Abacus Data Network, V1

Abstract Introduction UCLA Speaker Variability Database was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of approximately 34 hours of English speech and orthographic transcripts. This corpus was designed to sample variability in speaking...

Switchboard-1 Release 2

Oct 26, 2021

Godfrey, John J.; Holliman, Edward, 2021, "Switchboard-1 Release 2", https://hdl.handle.net/11272.1/AB2/VTPSCK, Abacus Data Network, V1

Abstract Introduction The Switchboard-1 Telephone Speech Corpus (LDC97S62) consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by...

Wikipedia Spanish Speech and Transcripts

Oct 14, 2021

Mena, Carlos Daniel Hernández; Ruiz, Iván Vladimir Meza, 2021, "Wikipedia Spanish Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/L05NFF, Abacus Data Network, V1

Abstract Introduction Wikipedia Spanish Speech and Transcripts consists of approximately 25 hours of Spanish read speech and transcripts. The read text was taken from the Spanish version of WikiProject Spoken Wikipedia, referred to as Wikipedia Grabada. The transcripts were devel...

BOLT Egyptian Arabic SMS/Chat Parallel Training Data

Oct 14, 2021

Tracey, Jennifer; Delgado, Dana; Chen, Song; Strassel, Stephanie, 2021, "BOLT Egyptian Arabic SMS/Chat Parallel Training Data", https://hdl.handle.net/11272.1/AB2/WXML9A, Abacus Data Network, V1

Abstract Introduction BOLT Egyptian Arabic SMS/Chat Parallel Training Data was developed by the Linguistic Data Consortium (LDC) and consists of approximately 723,000 tokens of Egyptian Arabic SMS/Chat data collected for the DARPA BOLT program along with their corresponding Engli...

Classical Arabic Dictionary

Oct 14, 2021

Alsheddi, Abeer, 2021, "Classical Arabic Dictionary", https://hdl.handle.net/11272.1/AB2/FQ7PIS, Abacus Data Network, V1

Abstract Introduction Classical Arabic Dictionary consists of approximately one hundred million words of Arabic collected from texts dating between 431 and 1104 CE, principally books and essays, along with word occurrences, source documents and related metadata. Data The dictiona...

IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b

Oct 1, 2021

Bills, Aric; Conners, Thomas; David, Anne; Dubinski, Eyal; Fiscus, Jonathan G.; Gann, Ketty; Harper, Mary; Kazi, Michael; Lim, Lynn-Li; Malyska, Nicolas; Melot, Jennifer; Ray, Jessica; Rytting, Anton; Shen, Sinney; Smith, Rosanna, 2021, "IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b", https://hdl.handle.net/11272.1/AB2/IFBL6A, Abacus Data Network, V1

Abstract Introduction IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Halh Mongolian conversational and scripted telephone speec...

IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d

Sep 29, 2021

Andresen, Jess; Bills, Aric; Conners, Thomas; Dubinski, Eyal; Fiscus, Jonathan G.; Harper, Mary; Kozlov, Kirill; Malyska, Nicolas; Melot, Jennifer; Morrison, Michelle; Phillips, Josh; Ray, Jessica; Rytting, Anton; Shen, Wade; Silber, Ronnie; Tzoukermann, Evelyne; Wong, Jamie, 2021, "IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d", https://hdl.handle.net/11272.1/AB2/TNSSDU, Abacus Data Network, V2

Abstract Introduction IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 350 hours of Swahili conversational and scripted telephone speech collect...

LORELEI Oromo Incident Language Pack

Sep 29, 2021

Tracey, Jennifer; Graff, David; Strassel, Stephanie; Arrigo, Michael; Wright, Jonathan; Bies, Ann, 2021, "LORELEI Oromo Incident Language Pack", https://hdl.handle.net/11272.1/AB2/EH7NXF, Abacus Data Network, V1

Abstract Introduction LORELEI Oromo Incident Language Pack was developed by the Linguistic Data Consortium and is comprised of approximately 3.9 million words of Oromo monolingual text, 25,000 words of English monolingual text, 135,000 words of parallel and comparable Oromo-Engli...

Database of Word Level Statistics - Mandarin

Sep 3, 2021

Neergaard, Karl David; Xu, Hongzhi; Huang, Chu-Ren, 2021, "Database of Word Level Statistics - Mandarin", https://hdl.handle.net/11272.1/AB2/VJDPA0, Abacus Data Network, V1

Abstract Introduction Database of Word Level Statistics - Mandarin was developed by The Hong Kong Polytechnic University. It provides lexical characteristics of a descriptive and statistical nature for words and nonwords of Mandarin Chinese. It is designed for researchers particu...

Abstract Meaning Representation (AMR) Annotation Release 3.0

Sep 3, 2021

Knight, Kevin; Badarau, Bianca; Baranescu, Laura; Bonial, Claire; Bardocz, Madalina; Griffitt, Kira; Hermjakob, Ulf; Marcu, Daniel; Palmer, Martha; O'Gorman, Tim; Schneider, Nathan, 2021, "Abstract Meaning Representation (AMR) Annotation Release 3.0", https://hdl.handle.net/11272.1/AB2/82CVJF, Abacus Data Network, V1

Abstract Introduction Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Ins...

Penn Discourse Treebank Version 2.0 - German Translation

Sep 3, 2021

Sluyter-Gaethje, Henny; Bourgonje, Peter; Stede, Manfred, 2021, "Penn Discourse Treebank Version 2.0 - German Translation", https://hdl.handle.net/11272.1/AB2/1AXWBN, Abacus Data Network, V1

Abstract Introduction Penn Discourse Treebank Version 2.0 - German Translation was developed at the University of Potsdam's Applied Computational Linguistics group and consists of approximately one million tokens derived from Penn Discourse Treebank Version 2.0 (LDC2008T05). This...

TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010

Sep 3, 2021

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2021, "TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010", https://hdl.handle.net/11272.1/AB2/VAZOSD, Abacus Data Network, V1

Abstract Introduction TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010 was developed by the Linguistic Data Consortium and contains training and evaluation data produced in support of the 2010 TAC KBP Surprise Slot Filling track, the only y...

TAC KBP English Sentiment Slot Filling -- Comprehensive Training and Evaluation Data 2013-2014

Sep 3, 2021

Ellis, Joe; Getman, Jeremy; Strassel, Stephanie, 2021, "TAC KBP English Sentiment Slot Filling -- Comprehensive Training and Evaluation Data 2013-2014", https://hdl.handle.net/11272.1/AB2/MRZALN, Abacus Data Network, V1

Abstract Introduction TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010 was developed by the Linguistic Data Consortium and contains training and evaluation data produced in support of the 2013 and 2014 TAC KBP Sentiment Slot Filling tracks....

X-SRL: Parallel Cross-lingual Semantic Role Labeling

Sep 3, 2021

Daza, Angel; Frank, Anette, 2021, "X-SRL: Parallel Cross-lingual Semantic Role Labeling", https://hdl.handle.net/11272.1/AB2/DNOJP9, Abacus Data Network, V1

Abstract Introduction X-SRL: Parallel Cross-lingual Semantic Role Labeling was developed by Heidelberg University, Department of Computational Linguistics and the Leibniz Institute for the German Language (IDS). It consists of approximately three million words of German, French a...

ESPADA

Sep 3, 2021

Arase, Yuki; Tsujii, Junichi, 2021, "ESPADA", https://hdl.handle.net/11272.1/AB2/ANSK9Z, Abacus Data Network, V1

Abstract Introduction ESPADA (Extended Syntactic Phrase Alignment DAtaset) consists of annotated parse trees and alignment on English sentential paraphrases extracted from machine translation evaluation corpora. It extends SPADE (LDC2018T09) by adding new annotated data for train...

BOLT Chinese SMS/Chat Parallel Training Data

Sep 3, 2021

Tracey, Jennifer; Delgado, Dana; Chen, Song; Strassel, Stephanie, 2021, "BOLT Chinese SMS/Chat Parallel Training Data", https://hdl.handle.net/11272.1/AB2/O3JTA9, Abacus Data Network, V1

Abstract Introduction BOLT Chinese SMS/Chat Parallel Training Data was developed by the Linguistic Data Consortium and consists of approximately 1.8 million tokens of Chinese SMS/Chat data collected for the DARPA BOLT program along with their corresponding English translations Th...

Chinese Abstract Meaning Representation 2.0

Sep 3, 2021

Li, Bin; Xiao, Liming; Liu, Yihuan; Wen, Yuan; Song, Li; Chun, Jayeol; Feng, Minxuan; Zhou, Junsheng; Qu, Weiguang; Xue, Nianwen, 2021, "Chinese Abstract Meaning Representation 2.0", https://hdl.handle.net/11272.1/AB2/LVQEZJ, Abacus Data Network, V1

Abstract Introduction Chinese Abstract Meaning Representation (CAMR) 2.0 was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of approximately 20,000 Chinese sentences from Chinese Treebank (CTB) 8.0 (LDC2013T21)...

BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Sep 3, 2021

Agarwal, Nitin; Francini, Michelle; Kappler, Michelle; Micciulla, Linnea; Pradhan, Sameer; Ramshaw, Lance, 2021, "BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech", https://hdl.handle.net/11272.1/AB2/DXWM3B, Abacus Data Network, V1

Abstract Introduction BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on Egyptian Arabic discussion forum (DF), SMS/Chat and conversational tele...

LibriVox Spanish

Sep 2, 2021

Mena, Carlos Daniel Hernández, 2021, "LibriVox Spanish", https://hdl.handle.net/11272.1/AB2/AHBO1C, Abacus Data Network, V1

Abstract Introduction LibriVox Spanish consists of approximately 73 hours of Spanish read speech and transcripts. The audio data was taken from Spanish audiobooks developed by LibriVox, a non-profit project that creates audiobooks from public domain works. The transcripts were de...

Global TIMIT Mandarin Chinese

Sep 2, 2021

Ding, Hongwei; Liao, Sishi; Zhan, Yuqing; Yuan, Jiahong; Liberman, Mark, 2021, "Global TIMIT Mandarin Chinese", https://hdl.handle.net/11272.1/AB2/2CCXH8, Abacus Data Network, V1

Abstract Introduction Global TIMIT Mandarin Chinese was developed by the Linguistic Data Consortium and Shanghai Jiao Tong University and consists of approximately five hours of read speech and transcripts in Mandarin Chinese. The Global TIMIT project aimed to create a series of...

Add Data

Share Dataverse

Link Dataverse

Reset Modifications