Linguistic Data Consortium

Featured Dataverses

In order to use this feature you must have at least one published dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

1 to 50 of 411 Results

KAIROS Schema Learning Complex Event Annotation Sep 19, 2025 Chen, Song; Tracey, Jennifer; Bies, Ann; Caruso, Christopher; Strassel, Stephanie, 2025, "KAIROS Schema Learning Complex Event Annotation", https://hdl.handle.net/11272.1/AB2/Y1KPTS, Abacus Data Network, V1 Abstract Introduction KAIROS Schema Learning Complex Event Annotation was developed by the Linguistic Data Consortium (LDC) to support the DARPA KAIROS program. It contains English and Spanish text, audio, video and image data labeled for 93 real-world complex events (CEs) with e...
LoReHLT Uzbek Representative Language Pack Aug 19, 2025 Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Delgado, Dana; Arrigo, Michael, 2025, "LoReHLT Uzbek Representative Language Pack", https://hdl.handle.net/11272.1/AB2/VM5TBL, Abacus Data Network, V1 Abstract Introduction LoReHLT Uzbek Representative Language Pack consists of Uzbek monolingual text, Uzbek-English parallel text, annotations, audio recordings, supplemental resources and related software tools developed by the Linguistic Data Consortium for LoReHLT, a companion...
Chinese Sentence Pattern Structure Treebank Aug 18, 2025 Peng, Weiming; Zhao, Min; He, Jing; Song, Yuchen; Song, Tianbao; Guo, Dongdong; Sun, Jingbo; Zhu, Shuqin; Zhang, Yinbin; Wei, Zuntian; Hu, Jiajia; Song, Jihua; Sui, Zhifang; Wang, Ning, 2025, "Chinese Sentence Pattern Structure Treebank", https://hdl.handle.net/11272.1/AB2/QZUMNU, Abacus Data Network, V1 Abstract Introduction Chinese Sentence Pattern Structure Treebank (the SPS Treebank) was developed at Beijing Normal University and Peking University. It contains 5,016 sentences and 119,627 tokens syntactically annotated following the concept of sentence constituent analysis whi...
BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations Aug 18, 2025 Tracey, Jennifer; Chen, Song; Delgado, Dana; Strassel, Stephanie, 2025, "BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations", https://hdl.handle.net/11272.1/AB2/LGXOHL, Abacus Data Network, V1 Abstract Introduction BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations was developed by the Linguistic Data Consortium (LDC) and consists of transcripts and their corresponding English translations for 93 hours of conversational telephone speech...
IWSLT 2022-2023 Shared Task Training, Development and Test Set Aug 14, 2025 Arrigo, Michael; Delgado, Dana; Strassel, Stephanie; Graff, David, 2025, "IWSLT 2022-2023 Shared Task Training, Development and Test Set", https://hdl.handle.net/11272.1/AB2/ONUJ54, Abacus Data Network, V1 Abstract Introduction IWSLT 2022 - 2023 Shared Task Training, Development and Test Set was developed by the Linguistic Data Consortium (LDC). It contains 210 hours of Tunisian Arabic conversational telephone speech, transcripts and their English translations covering 175 hours of...
AnnoDIFP Session Audio and Transcripts Aug 14, 2025 Cieri, Christopher; Fiumara, James; Walker, Kevin; Liberman, Mark; Ryant, Neville, 2025, "AnnoDIFP Session Audio and Transcripts", https://hdl.handle.net/11272.1/AB2/OGBCJ9, Abacus Data Network, V1 Abstract Introduction AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) Session Audio and Transcripts was developed by the Linguistic Data Consortium (LDC), the Florida Institute of Technology (FIT), and the University of New Haven (UNH) to support algorith...
BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio Aug 14, 2025 Tracey, Jennifer; Graff, David; Chen, Song; Strassel, Stephanie, 2025, "BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio", https://hdl.handle.net/11272.1/AB2/1BGPSO, Abacus Data Network, V1 Abstract Introduction BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio was developed by the Linguistic Data Consortium (LDC) and consists of approximately 93 hours of speech from 236 unscripted telephone conversations between native speakers of the Mandarin Chinese di...
Penn Parsed Corpora of Historical English Second Release Jul 23, 2025 Kroch, Anthony; Santorini, Beatrice; Taylor, Ann; Diertani, Ariel, 2025, "Penn Parsed Corpora of Historical English Second Release", https://hdl.handle.net/11272.1/AB2/E4NMWX, Abacus Data Network, V1 Abstract Introduction Penn Parsed Corpora of Historical English Second Release was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the Firs...
MATERIAL Kazakh-English Language Pack Jun 9, 2025 Bekkozhanova, Gulnar; Bills, Aric; Chouder, Sarra; Jaralve, Vanessa; Corey, Cassian; Dubinski, Eyal; Ellis, Corinna; Gibby, Paul; Kazi, Michael; Lam, Julie; Le, Hanh; Malyska, Nicolas; Marcucci, Giorgia; Marvi, Sarah; McConnell, Sara; Melot, Jennifer; Mensch, Alyssa; Morrison, Michelle; Paget, Shelley; Ramizo, Katerina; Richardson, Frederick; Roberts, Annette; Rubino, Carl; Sarseke, Gulnar; Taubayev, Zharas, 2025, "MATERIAL Kazakh-English Language Pack", https://hdl.handle.net/11272.1/AB2/5G61UB, Abacus Data Network, V1 Abstract Introduction MATERIAL Kazakh-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 57 hours of K...
2015 NIST Language Recognition Evaluation Test Set Apr 29, 2025 Greenberg, Craig; Sadjadi, Omid; Graff, David; Walker, Kevin; Jones, Karen; Caruso, Christopher; Strassel, Stephanie; Wright, Jonathan, 2025, "2015 NIST Language Recognition Evaluation Test Set", https://hdl.handle.net/11272.1/AB2/TPVLOA, Abacus Data Network, V1 Abstract Introduction 2015 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It contains the evaluation test set for the 2015 NIST Language Recognition Evaluation, app...
DEFT Spanish Light and Rich ERE Annotation Apr 29, 2025 Chen, Song; Mott, Justin; Strassel, Stephanie, 2025, "DEFT Spanish Light and Rich ERE Annotation", https://hdl.handle.net/11272.1/AB2/WMSO8E, Abacus Data Network, V1 Abstract Introduction DEFT Spanish Light and Rich ERE Annotation was developed by the Linguistic Data Consortium (LDC) and consists of 158 Spanish discussion forum and newswire documents annotated for entities, relations and events (ERE). DARPA's Deep Exploration and Filtering of...
The Xi’an Multi-Language Learner Corpus Apr 29, 2025 Zhang, Xiao; Zhang, Ling; Dang, Tian; Feng, Yuanzhao; Ji, Yujing; Jiang, Xiaohui; Kang, Zhewen; Lu, Yan; Nie, Wen; Ren, Hanyu; Wang, Canjun; Wang, Jiayi; Wang, Yu; Wu, Chen; Wu, Mei; Xu, Tingting; Yang, Ruhai; Zhao, Kai; Zhao, Ran; Zhou, Quanjie; Zhu, Lei, 2025, "The Xi’an Multi-Language Learner Corpus", https://hdl.handle.net/11272.1/AB2/KEPEYK, Abacus Data Network, V1 Abstract Introduction The Xi’an Multi-Language Learner Corpus was developed by Xi'an International Studies University (XISU). It is comprised of 526 argumentative essays in 15 languages by Chinese L1 university students studying second languages, along with student metadata and w...
LORELEI Hungarian Representative Language Pack Apr 3, 2025 Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2025, "LORELEI Hungarian Representative Language Pack", https://hdl.handle.net/11272.1/AB2/6G8DZZ, Abacus Data Network, V1 Abstract Introduction LORELEI Hungarian Representative Language Pack consists of Hungarian monolingual text, Hungarian-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI program....
Abstract Meaning Representation 3.0 - Machine Translations Apr 3, 2025 Vanroy, Bram, 2025, "Abstract Meaning Representation 3.0 - Machine Translations", https://hdl.handle.net/11272.1/AB2/TKRDFD, Abacus Data Network, V1 Abstract Introduction Abstract Meaning Representation 3.0 - Machine Translations was developed by the Center for Computational Linguistics at KU Leuven in the HORIZON2020 project SignON. It is an automatic translation of a subset of sentences from Abstract Meaning Representation...
AIDA Scenario 3 Practice Topic Source Data and Annotation Apr 3, 2025 Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2025, "AIDA Scenario 3 Practice Topic Source Data and Annotation", https://hdl.handle.net/11272.1/AB2/KAFV5Q, Abacus Data Network, V1 Abstract Introduction AIDA Scenario 3 Practice Topic Source Data and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of English, Russian and Spanish web documents (text, video, image) and annotations. The DARPA AIDA (Active Interpretation of Disp...
ASpIRE Development and Development Test Sets Apr 1, 2025 Linguistic Data Consortium; Appen Pty Ltd., 2025, "ASpIRE Development and Development Test Sets", https://hdl.handle.net/11272.1/AB2/YS9IIX, Abacus Data Network, V1 Abstract Introduction ASpIRE Development and Development Test Sets was developed for the Automatic Speech recognition In Reverberant Environments (ASpIRE) Challenge sponsored by IARPA (the Intelligent Advanced Research Projects Activity). It contains approximately 226 hours of En...
MATERIAL Georgian-English Language Pack Mar 28, 2025 Asatiani, Sandro; Bills, Aric; Brunckhorst, Rachael; Chouder, Sarra; Corey, Cassian; Dubinski, Eyal; Ellis, Corinna; Gibby, Paul; Kalkhitashvili, Tamar; Kazi, Michael; Tong, Audrey; Lam, Julie; Le, Hanh; Malyska, Nicolas; Marcucci, Giorgia; Marvi, Sarah; McConnell, Sara; Melot, Jennifer; Mensch, Alyssa; Morrison, Michelle; Paget, Shelley; Richardson, Frederick; Roberts, Annette; Rubino, Carl; Samushia, Lela, 2025, "MATERIAL Georgian-English Language Pack", https://hdl.handle.net/11272.1/AB2/H5DHYO, Abacus Data Network, V1 Abstract Introduction MATERIAL Georgian-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 79 hours of...
MATERIAL Farsi-English Language Pack Mar 28, 2025 Bills, Aric; Chouder, Sarra; Corey, Cassian; Davoodian, Marjan; Dubinski, Eyal; Ellis, Corinna; Farnam, Reza; Gibby, Paul; Hartwig, Luke; Kalnins, Dagmara; Kazi, Michael; Lam, Julie; Le, Hanh; Malyska, Nicolas; Marvi, Sarah; McConnell, Sara; Melot, Jennifer; Mensch, Alyssa; Moore, Alex; Morrison, Michelle; Paget, Shelley; Richardson, Frederick; Roberts, Annette; Rubino, Carl; Moaddel, Marjan Sadeghi, 2025, "MATERIAL Farsi-English Language Pack", https://hdl.handle.net/11272.1/AB2/WLFTJ6, Abacus Data Network, V1 Abstract Introduction MATERIAL Farsi-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 61 hours of Fa...
MATERIAL Somali-English Language Pack Mar 28, 2025 Abdi, Zeinab; Ali, Zahra; Bills, Aric; Bishop, Judith; Boyle, Anne; Chouder, Sarra; Clair, Nathaniel; Conners, Tom; Corey, Cassian; Dubinski, Eyal; Ellis, Corinna; Fernando, Jess; Gibby, Paul; Abdi, Farah H; Hammond, Simon; Hubert, Maxime; Kaiser-Schatzlein, Alice; Kazi, Michael; Lam, Julie; Lazar, Rosie; Le, Hanh; Levot, Michael; Malyska, Nicolas; Melot, Jennifer; Mensch, Alyssa; Omar, Abdulkadir Arale; Paget, Shelley; Richardson, Frederick; Rubino, Carl; Samko, Bern; Sanders, Gregory; Soh, Stephanie; Strahan, Tania E.; Taylor, Jonathan; Thompson, Brian; Tong, Audrey; Tong, Richard; Yelle, Julie; Yu, Jennifer; Zavorin, Ilya, 2025, "MATERIAL Somali-English Language Pack", https://hdl.handle.net/11272.1/AB2/2FKSLF, Abacus Data Network, V1 Abstract Introduction MATERIAL Somali-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 80 hours of S...
MATERIAL Bulgarian-English Language Pack Mar 28, 2025 Bills, Aric; Bishop, Judith; Boyle, Anne; Chouder, Sarra; Clair, Nathaniel; Conners, Tom; Corey, Cassian; Cronin, Kristina; Dubinski, Eyal; Ellis, Corinna; Gibby, Paul; Hammond, Simon; Hidalgo, Guia; Kaiser-Schatzlein, Alice; Kalnins, Dagmara; Kazi, Michael; Lam, Julie; Lazar, Rosie; Le, Hanh; Malyska, Nicolas; Medel, Olivia; Melot, Jennifer; Mensch, Alyssa; Moore, Alex; Morrison, Michelle; Paget, Shelley; Raymer, Alston; Richardson, Fred; Ridgway, Hristina; Roberts, Annette; Rubino, Carl; Saw, Kenneth; Shen, Sinney; Soh, Stephanie; Taylor, Jonathan; Thompson, Brian; Tong, Audrey; Tong, Richard; Williams, Mariana; Yelle, Julie; Yu, Jennifer; Zavora, Yoanna; Zavorin, Ilya, 2025, "MATERIAL Bulgarian-English Language Pack", https://hdl.handle.net/11272.1/AB2/WCU3PV, Abacus Data Network, V1 Abstract Introduction MATERIAL Bulgarian-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 78 hours o...
Samrómur Synthetic Feb 3, 2025 Hernández Mena, Carlos Daniel; Örnólfsson, Gunnar Thor; Gudnason, Jon, 2025, "Samrómur Synthetic", https://hdl.handle.net/11272.1/AB2/DZUB82, Abacus Data Network, V1 Abstract Introduction Samrómur Synthetic was developed by the Language and Voice Lab, Reykjavik University and contains 72 hours of Icelandic synthetic speech, transcripts and metadata. Data Source sentences were extracted from the Samrómur platform, comprised of texts and transc...
Ravnursson Faroese Speech and Transcripts Feb 3, 2025 Hernández Mena, Carlos Daniel; Simonsen, Annika; Gudnason, Jon, 2025, "Ravnursson Faroese Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/OBXEAK, Abacus Data Network, V1 Abstract Introduction Ravnursson Faroese Speech and Transcripts contains 109 hours of Faroese prompted speech from 433 speakers (249 female, 184 male), corresponding transcripts and speaker metadata. It is an extract from the Basic Language Resource Kit 1.0 (BLARK 1.0) developed...
L2-KSU Native and Non-Native Arabic Speech Feb 3, 2025 Alrashoudi, Norah; AlKhalifa, Hend; Alotaibi, Yousef Ajami, 2025, "L2-KSU Native and Non-Native Arabic Speech", https://hdl.handle.net/11272.1/AB2/N7YZP8, Abacus Data Network, V1 Abstract Introduction L2-KSU Native and Non-Native Arabic Speech was developed by King Saud University (KSU) and contains approximately six hours of Modern Standard Arabic read speech from 80 subjects, along with transcripts and speaker metadata. Data The speech data was collecte...
Iraqi Arabic - English Lexical Database Feb 3, 2025 Maamouri, Mohamed; Graff, David, 2025, "Iraqi Arabic - English Lexical Database", https://hdl.handle.net/11272.1/AB2/EUPXQD, Abacus Data Network, V1 Abstract Introduction Iraqi Arabic - English Lexical Database was developed by the Linguistic Data Consortium (LDC). It contains six interrelated tables presenting over 67,000 Iraqi Arabic words as orthographic forms in Arabic script and pronunciation forms in International Phone...
LORELEI Yoruba Representative Language Pack Jan 21, 2025 Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2025, "LORELEI Yoruba Representative Language Pack", https://hdl.handle.net/11272.1/AB2/ATPB58, Abacus Data Network, V1 Abstract Introduction LORELEI Yoruba Representative Language Pack (LDC2024T10) consists of Yoruba monolingual text, Yoruba-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI progr...
MultiTACRED Jan 21, 2025 Hennig, Leonhard; Thomas, Philippe; Möller, Sebastian, 2025, "MultiTACRED", https://hdl.handle.net/11272.1/AB2/GIEQ7J, Abacus Data Network, V1 Abstract Introduction MultiTACRED was developed by the German Research Center for Artificial Intelligence (DFKI) Speech and Language Technology Lab and is a machine translation of TAC Relation Extraction Dataset (LDC2018T24) (TACRED) into twelve languages with projected entity an...
RST Continuity Corpus Jan 21, 2025 Das, Debopam; Egg, Markus, 2025, "RST Continuity Corpus", https://hdl.handle.net/11272.1/AB2/YSIB2J, Abacus Data Network, V1 Abstract Introduction RST Continuity Corpus was developed at Åbo Akademi University and Humboldt-Universität zu Berlin and contains annotations for continuity dimensions added to RST Discourse Treebank (LDC2002T07). RST Discourse Treebank is a collection of English news texts fro...
First-Year Law Students' Court Memoranda Oct 25, 2024 Larson, Brian N., 2024, "First-Year Law Students' Court Memoranda", https://hdl.handle.net/11272.1/AB2/CC9MT6, Abacus Data Network, V1 Abstract Introduction First-Year Law Students' Court Memoranda consists of 197 English law student writing samples of legal briefs annotated for certain characteristics along with accompanying survey responses by the student writers. The briefs were created in a law school writin...
Samrómur Queries Icelandic Speech 1.0 Oct 25, 2024 Hedström, Staffan; Fong, Judy; Þórhallsdóttir, Ragnheiður; Mollberg, David; Guðmundsson, Smári Freyr; Jónsson, Ólafur Helgi; Þorsteinsdóttir, Sunneva; Magnusdottir, Eydis Huld; Gudnason, Jon, 2024, "Samrómur Queries Icelandic Speech 1.0", https://hdl.handle.net/11272.1/AB2/DGPHQR, Abacus Data Network, V1 Abstract Introduction Samrómur Queries Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 20 hours of Icelandic prompted queries from 3,809 speakers represent...
TRAD Arabic-French Parallel Text -- Newswire Oct 25, 2024 Consortium, Linguistic Data; ELDA,, 2024, "TRAD Arabic-French Parallel Text -- Newswire", https://hdl.handle.net/11272.1/AB2/48BBWO, Abacus Data Network, V1 Abstract Introduction TRAD Arabic-French Parallel Text -- Newswire was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 20,000 Arabic words from NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21). The...
TRAD Chinese-French Parallel Text -- Broadcast News Oct 25, 2024 Consortium, Linguistic Data; ELDA,, 2024, "TRAD Chinese-French Parallel Text -- Broadcast News", https://hdl.handle.net/11272.1/AB2/IZFPYW, Abacus Data Network, V1 Abstract Introduction TRAD Chinese-French Parallel Text -- Broadcast News was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 30,000 Chinese characters from GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3...
2007 CoNLL Shared Task - Greek, Hungarian & Italian Oct 25, 2024 Pisa, Dipartimento di Informatica of the University of; ILC-CNR,; Processing, Institute for Language and Speech; Szeged, Institute of Informatics at the University of; Sciences, Institute of Linguistics at the Hungarian Academy of; Ltd., Morphologic, 2024, "2007 CoNLL Shared Task - Greek, Hungarian & Italian", https://hdl.handle.net/11272.1/AB2/JLYA64, Abacus Data Network, V1 Abstract Introduction 2007 CoNLL Shared Task - Greek, Hungarian & Italian consists of dependency treebanks in three languages used as part of the CoNLL 2007 shared task on multi-lingual dependency parsing and domain adaptation. The languages covered in this release are: Greek, Hu...
Vehicle City Voices Corpus – Part I Oct 25, 2024 Britt, Erica, 2024, "Vehicle City Voices Corpus – Part I", https://hdl.handle.net/11272.1/AB2/8XVBZS, Abacus Data Network, V1 Abstract Introduction Vehicle City Voices Corpus – Part I was developed at the University of Michigan-Flint, and is an ongoing oral history project and survey of English language variation in Flint, Michigan. It contains approximately 16 hours of speech with corresponding transcr...
CHM150 Oct 25, 2024 Mena, Carlos Daniel Hernández; Herrera, Abel, 2024, "CHM150", https://hdl.handle.net/11272.1/AB2/UWURFR, Abacus Data Network, V1 Abstract Introduction CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican Spanish speech, associated transcri...
Arabic Learner Corpus Oct 25, 2024 Alfaifi, Abdullah; Atwell, Eric, 2024, "Arabic Learner Corpus", https://hdl.handle.net/11272.1/AB2/DPQWPU, Abacus Data Network, V1 Abstract Introduction Arabic Learner Corpus was developed at the University of Leeds and consists of written essays and spoken recordings by Arabic learners collected in Saudi Arabia in 2012 and 2013. The corpus includes 282,732 words in 1,585 materials, produced by 942 students...
BabyEars Affective Vocalizations Oct 25, 2024 Slaney, Malcolm; McRoberts, Gerald; Scheirer, Jocelyn, 2024, "BabyEars Affective Vocalizations", https://hdl.handle.net/11272.1/AB2/VK52W9, Abacus Data Network, V1 Abstract Introduction BabyEars Affective Vocalizations was developed by Malcolm Slaney, Gerald McRoberts, and Jocelyn Scheirer. It contains approximately 22 minutes of spontaneous English speech by 12 adults interacting with their infant children, for a total of 509 infant-direct...
Second Language University Speech Intelligibility Corpus Oct 25, 2024 Kang, Okim; Hirschi, Kevin; Looney, Stephen D.; Hansen, John H. L., 2024, "Second Language University Speech Intelligibility Corpus", https://hdl.handle.net/11272.1/AB2/QHVV2O, Abacus Data Network, V1 Abstract Introduction Second Language University Speech Intelligibility Corpus was developed by Northern Arizona University, The Pennsylvania State University, and The University of Texas at Dallas. It contains 10.5 hours of English speech by 66 international faculty and universi...
AIDA Scenario 2 Practice Topic Annotation Sep 17, 2024 Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2024, "AIDA Scenario 2 Practice Topic Annotation", https://hdl.handle.net/11272.1/AB2/BFKQTZ, Abacus Data Network, V1 Abstract Introduction AIDA Scenario 2 Practice Topic Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of annotations for 29 English, Russian and Spanish web documents (text, image and video) from AIDA Scenario 2 Practice Topic Source Data (LDC2024...
Dialogs Re-Enacted Across Languages Sep 17, 2024 Ward, Nigel G.; Avila, Jonathan E.; Rivas, Emilia; Marco, Divette, 2024, "Dialogs Re-Enacted Across Languages", https://hdl.handle.net/11272.1/AB2/XRMWND, Abacus Data Network, V1 Abstract Introduction Dialogs Re-Enacted Across Languages was developed at the University of Texas at El Paso. It contains approximately 17 hours of conversational speech in English and Spanish by 129 unique bilingual speakers, specifically, short fragments extracted from spontan...
Diaspora Tibetan Speech Sep 17, 2024 Geissler, Christopher; Babinski, Sarah; Shaw, Jason, 2024, "Diaspora Tibetan Speech", https://hdl.handle.net/11272.1/AB2/OPZ58Z, Abacus Data Network, V1 Abstract Introduction Diaspora Tibetan Speech was developed at Yale University. It contains approximately 28 hours of Tibetan elicited speech by 73 speakers from the diaspora Tibetan community in Kathmandu, Nepal, along with transcripts, elicitation materials and speaker demograp...
LORELEI Uyghur Incident Language Pack Sep 17, 2024 Tracey, Jennifer; Strassel, Stephanie; Arrigo, Michael; Wright, Jonathan; Graff, David; Bies, Ann, 2024, "LORELEI Uyghur Incident Language Pack", https://hdl.handle.net/11272.1/AB2/VRJN4A, Abacus Data Network, V1 Abstract Introduction LORELEI Uyghur Incident Language Pack (LDC2024T07) was developed by the Linguistic Data Consortium and consists of approximately 28 million words of Uyghur monolingual text, 500,000 words of English monolingual text, 3.3 million words of parallel and compara...
Call My Net 1 Jul 30, 2024 Jones, Karen; Walker, Kevin; Graff, David; Wright, Jonathan; Strassel, Stephanie, 2024, "Call My Net 1", https://hdl.handle.net/11272.1/AB2/RJMIEI, Abacus Data Network, V1 Abstract Introduction Call My Net 1 was developed by the Linguistic Data Consortium and contains 364 hours of conversational telephone speech in four languages (Tagalog, Cebuano, Cantonese and Mandarin) collected in 2015 from 221 native speakers located in the Philippines and Chi...
Automatic Content Extraction for Portuguese Jul 30, 2024 Cunha, Luís Filipe; Silvano, Purificação; Campos, Ricardo; Jorge, Alípio, 2024, "Automatic Content Extraction for Portuguese", https://hdl.handle.net/11272.1/AB2/5VRIQB, Abacus Data Network, V1 Abstract Introduction Automatic Content Extraction for Portuguese (LDC2024T05) was developed at INESC TEC - Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência and consists of automatic Brazilian Portuguese and European Portuguese translations of the English...
LoReHLT Hausa Representative Language Pack May 13, 2024 Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Griffitt, Kira; Ryant, Neville; Kulick, Seth; Delgado, Dana; Arrigo, Michael, 2024, "LoReHLT Hausa Representative Language Pack", https://hdl.handle.net/11272.1/AB2/7MWKZC, Abacus Data Network, V1 Abstract Introduction LoReHLT Hausa Representative Language Pack consists of Hausa monolingual text, Hausa-English parallel text, annotations, amateur web audio recordings, supplemental resources and related software tools developed by the Linguistic Data Consortium for LoReHLT,...
AIDA Scenario 2 Practice Topic Source Data May 13, 2024 Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2024, "AIDA Scenario 2 Practice Topic Source Data", https://hdl.handle.net/11272.1/AB2/TXAWUL, Abacus Data Network, V1 Abstract Introduction AIDA Scenario 2 Practice Topic Source Data was developed by the Linguistic Data Consortium (LDC) and is comprised of 1500 root documents, including text, image, and video, from English, Russian, and Spanish web sources. The DARPA AIDA (Active Interpretation...
RATS Low Speech Density May 13, 2024 Walker, Kevin; Graff, David; Ma, Xiaoyi; Strassel, Stephanie; Jones, Karen, 2024, "RATS Low Speech Density", https://hdl.handle.net/11272.1/AB2/CXVUXZ, Abacus Data Network, V1 Abstract Introduction RATS Low Speech Density was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 87 hours of English, Levantine Arabic, Farsi, Pashto and Urdu speech and non-speech samples. The recordings were assembled by concatenating a rand...
LORELEI Farsi Representative Language Pack Mar 28, 2024 Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2024, "LORELEI Farsi Representative Language Pack", https://hdl.handle.net/11272.1/AB2/UMEVGY, Abacus Data Network, V1 Abstract Introduction LORELEI Farsi Representative Language Pack consists of Farsi monolingual text, Farsi-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI program. The LORELEI...
AIDA Scenario 1 Practice Topic Annotation Mar 28, 2024 Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2024, "AIDA Scenario 1 Practice Topic Annotation", https://hdl.handle.net/11272.1/AB2/XPPJWR, Abacus Data Network, V1 Abstract Introduction AIDA Scenario 1 Practice Topic Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of annotations for 212 English, Russian and Ukrainian web documents (text, image and video) from AIDA Scenario 1 Practice Topic Source Data (LDC2...
KASET - Kurmanji and Sorani Kurdish Speech and Transcripts Mar 28, 2024 Delgado, Dana; Walker, Kevin; Strassel, Stephanie; Graff, David; Caruso, Christopher, 2024, "KASET - Kurmanji and Sorani Kurdish Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/ODAGYC, Abacus Data Network, V1 Abstract Introduction KASET - Kurmanji and Sorani Kurdish Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of approximately 147 hours of telephone conversations (289 recordings) and broadcast news (410 recordings) in two Kurdish dialects:...
TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 Jan 11, 2024 Tracey, Jennifer; Strassel, Stephanie; Arrigo, Michael, 2024, "TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017", https://hdl.handle.net/11272.1/AB2/OM2WHS, Abacus Data Network, V1 Abstract Introduction TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 (LDC2023T13) was developed by the Linguistic Data Consortium and contains training and evaluation data produced in support of the 2016 and 2017 TAC KBP Belief and Sentiment (...

KAIROS Schema Learning Complex Event Annotation

Sep 19, 2025

Chen, Song; Tracey, Jennifer; Bies, Ann; Caruso, Christopher; Strassel, Stephanie, 2025, "KAIROS Schema Learning Complex Event Annotation", https://hdl.handle.net/11272.1/AB2/Y1KPTS, Abacus Data Network, V1

Abstract Introduction KAIROS Schema Learning Complex Event Annotation was developed by the Linguistic Data Consortium (LDC) to support the DARPA KAIROS program. It contains English and Spanish text, audio, video and image data labeled for 93 real-world complex events (CEs) with e...

LoReHLT Uzbek Representative Language Pack

Aug 19, 2025

Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Delgado, Dana; Arrigo, Michael, 2025, "LoReHLT Uzbek Representative Language Pack", https://hdl.handle.net/11272.1/AB2/VM5TBL, Abacus Data Network, V1

Abstract Introduction LoReHLT Uzbek Representative Language Pack consists of Uzbek monolingual text, Uzbek-English parallel text, annotations, audio recordings, supplemental resources and related software tools developed by the Linguistic Data Consortium for LoReHLT, a companion...

Chinese Sentence Pattern Structure Treebank

Aug 18, 2025

Peng, Weiming; Zhao, Min; He, Jing; Song, Yuchen; Song, Tianbao; Guo, Dongdong; Sun, Jingbo; Zhu, Shuqin; Zhang, Yinbin; Wei, Zuntian; Hu, Jiajia; Song, Jihua; Sui, Zhifang; Wang, Ning, 2025, "Chinese Sentence Pattern Structure Treebank", https://hdl.handle.net/11272.1/AB2/QZUMNU, Abacus Data Network, V1

Abstract Introduction Chinese Sentence Pattern Structure Treebank (the SPS Treebank) was developed at Beijing Normal University and Peking University. It contains 5,016 sentences and 119,627 tokens syntactically annotated following the concept of sentence constituent analysis whi...

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations

Aug 18, 2025

Tracey, Jennifer; Chen, Song; Delgado, Dana; Strassel, Stephanie, 2025, "BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations", https://hdl.handle.net/11272.1/AB2/LGXOHL, Abacus Data Network, V1

Abstract Introduction BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations was developed by the Linguistic Data Consortium (LDC) and consists of transcripts and their corresponding English translations for 93 hours of conversational telephone speech...

IWSLT 2022-2023 Shared Task Training, Development and Test Set

Aug 14, 2025

Arrigo, Michael; Delgado, Dana; Strassel, Stephanie; Graff, David, 2025, "IWSLT 2022-2023 Shared Task Training, Development and Test Set", https://hdl.handle.net/11272.1/AB2/ONUJ54, Abacus Data Network, V1

Abstract Introduction IWSLT 2022 - 2023 Shared Task Training, Development and Test Set was developed by the Linguistic Data Consortium (LDC). It contains 210 hours of Tunisian Arabic conversational telephone speech, transcripts and their English translations covering 175 hours of...

AnnoDIFP Session Audio and Transcripts

Aug 14, 2025

Cieri, Christopher; Fiumara, James; Walker, Kevin; Liberman, Mark; Ryant, Neville, 2025, "AnnoDIFP Session Audio and Transcripts", https://hdl.handle.net/11272.1/AB2/OGBCJ9, Abacus Data Network, V1

Abstract Introduction AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) Session Audio and Transcripts was developed by the Linguistic Data Consortium (LDC), the Florida Institute of Technology (FIT), and the University of New Haven (UNH) to support algorith...

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio

Aug 14, 2025

Tracey, Jennifer; Graff, David; Chen, Song; Strassel, Stephanie, 2025, "BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio", https://hdl.handle.net/11272.1/AB2/1BGPSO, Abacus Data Network, V1

Abstract Introduction BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio was developed by the Linguistic Data Consortium (LDC) and consists of approximately 93 hours of speech from 236 unscripted telephone conversations between native speakers of the Mandarin Chinese di...

Penn Parsed Corpora of Historical English Second Release

Jul 23, 2025

Kroch, Anthony; Santorini, Beatrice; Taylor, Ann; Diertani, Ariel, 2025, "Penn Parsed Corpora of Historical English Second Release", https://hdl.handle.net/11272.1/AB2/E4NMWX, Abacus Data Network, V1

Abstract Introduction Penn Parsed Corpora of Historical English Second Release was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the Firs...

MATERIAL Kazakh-English Language Pack

Jun 9, 2025

Bekkozhanova, Gulnar; Bills, Aric; Chouder, Sarra; Jaralve, Vanessa; Corey, Cassian; Dubinski, Eyal; Ellis, Corinna; Gibby, Paul; Kazi, Michael; Lam, Julie; Le, Hanh; Malyska, Nicolas; Marcucci, Giorgia; Marvi, Sarah; McConnell, Sara; Melot, Jennifer; Mensch, Alyssa; Morrison, Michelle; Paget, Shelley; Ramizo, Katerina; Richardson, Frederick; Roberts, Annette; Rubino, Carl; Sarseke, Gulnar; Taubayev, Zharas, 2025, "MATERIAL Kazakh-English Language Pack", https://hdl.handle.net/11272.1/AB2/5G61UB, Abacus Data Network, V1

Abstract Introduction MATERIAL Kazakh-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 57 hours of K...

2015 NIST Language Recognition Evaluation Test Set

Apr 29, 2025

Greenberg, Craig; Sadjadi, Omid; Graff, David; Walker, Kevin; Jones, Karen; Caruso, Christopher; Strassel, Stephanie; Wright, Jonathan, 2025, "2015 NIST Language Recognition Evaluation Test Set", https://hdl.handle.net/11272.1/AB2/TPVLOA, Abacus Data Network, V1

Abstract Introduction 2015 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It contains the evaluation test set for the 2015 NIST Language Recognition Evaluation, app...

DEFT Spanish Light and Rich ERE Annotation

Apr 29, 2025

Chen, Song; Mott, Justin; Strassel, Stephanie, 2025, "DEFT Spanish Light and Rich ERE Annotation", https://hdl.handle.net/11272.1/AB2/WMSO8E, Abacus Data Network, V1

Abstract Introduction DEFT Spanish Light and Rich ERE Annotation was developed by the Linguistic Data Consortium (LDC) and consists of 158 Spanish discussion forum and newswire documents annotated for entities, relations and events (ERE). DARPA's Deep Exploration and Filtering of...

The Xi’an Multi-Language Learner Corpus

Apr 29, 2025

Zhang, Xiao; Zhang, Ling; Dang, Tian; Feng, Yuanzhao; Ji, Yujing; Jiang, Xiaohui; Kang, Zhewen; Lu, Yan; Nie, Wen; Ren, Hanyu; Wang, Canjun; Wang, Jiayi; Wang, Yu; Wu, Chen; Wu, Mei; Xu, Tingting; Yang, Ruhai; Zhao, Kai; Zhao, Ran; Zhou, Quanjie; Zhu, Lei, 2025, "The Xi’an Multi-Language Learner Corpus", https://hdl.handle.net/11272.1/AB2/KEPEYK, Abacus Data Network, V1

Abstract Introduction The Xi’an Multi-Language Learner Corpus was developed by Xi'an International Studies University (XISU). It is comprised of 526 argumentative essays in 15 languages by Chinese L1 university students studying second languages, along with student metadata and w...

LORELEI Hungarian Representative Language Pack

Apr 3, 2025

Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2025, "LORELEI Hungarian Representative Language Pack", https://hdl.handle.net/11272.1/AB2/6G8DZZ, Abacus Data Network, V1

Abstract Introduction LORELEI Hungarian Representative Language Pack consists of Hungarian monolingual text, Hungarian-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI program....

Abstract Meaning Representation 3.0 - Machine Translations

Apr 3, 2025

Vanroy, Bram, 2025, "Abstract Meaning Representation 3.0 - Machine Translations", https://hdl.handle.net/11272.1/AB2/TKRDFD, Abacus Data Network, V1

Abstract Introduction Abstract Meaning Representation 3.0 - Machine Translations was developed by the Center for Computational Linguistics at KU Leuven in the HORIZON2020 project SignON. It is an automatic translation of a subset of sentences from Abstract Meaning Representation...

AIDA Scenario 3 Practice Topic Source Data and Annotation

Apr 3, 2025

Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2025, "AIDA Scenario 3 Practice Topic Source Data and Annotation", https://hdl.handle.net/11272.1/AB2/KAFV5Q, Abacus Data Network, V1

Abstract Introduction AIDA Scenario 3 Practice Topic Source Data and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of English, Russian and Spanish web documents (text, video, image) and annotations. The DARPA AIDA (Active Interpretation of Disp...

ASpIRE Development and Development Test Sets

Apr 1, 2025

Linguistic Data Consortium; Appen Pty Ltd., 2025, "ASpIRE Development and Development Test Sets", https://hdl.handle.net/11272.1/AB2/YS9IIX, Abacus Data Network, V1

Abstract Introduction ASpIRE Development and Development Test Sets was developed for the Automatic Speech recognition In Reverberant Environments (ASpIRE) Challenge sponsored by IARPA (the Intelligent Advanced Research Projects Activity). It contains approximately 226 hours of En...

MATERIAL Georgian-English Language Pack

Mar 28, 2025

Asatiani, Sandro; Bills, Aric; Brunckhorst, Rachael; Chouder, Sarra; Corey, Cassian; Dubinski, Eyal; Ellis, Corinna; Gibby, Paul; Kalkhitashvili, Tamar; Kazi, Michael; Tong, Audrey; Lam, Julie; Le, Hanh; Malyska, Nicolas; Marcucci, Giorgia; Marvi, Sarah; McConnell, Sara; Melot, Jennifer; Mensch, Alyssa; Morrison, Michelle; Paget, Shelley; Richardson, Frederick; Roberts, Annette; Rubino, Carl; Samushia, Lela, 2025, "MATERIAL Georgian-English Language Pack", https://hdl.handle.net/11272.1/AB2/H5DHYO, Abacus Data Network, V1

Abstract Introduction MATERIAL Georgian-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 79 hours of...

MATERIAL Farsi-English Language Pack

Mar 28, 2025

Bills, Aric; Chouder, Sarra; Corey, Cassian; Davoodian, Marjan; Dubinski, Eyal; Ellis, Corinna; Farnam, Reza; Gibby, Paul; Hartwig, Luke; Kalnins, Dagmara; Kazi, Michael; Lam, Julie; Le, Hanh; Malyska, Nicolas; Marvi, Sarah; McConnell, Sara; Melot, Jennifer; Mensch, Alyssa; Moore, Alex; Morrison, Michelle; Paget, Shelley; Richardson, Frederick; Roberts, Annette; Rubino, Carl; Moaddel, Marjan Sadeghi, 2025, "MATERIAL Farsi-English Language Pack", https://hdl.handle.net/11272.1/AB2/WLFTJ6, Abacus Data Network, V1

Abstract Introduction MATERIAL Farsi-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 61 hours of Fa...

MATERIAL Somali-English Language Pack

Mar 28, 2025

Abdi, Zeinab; Ali, Zahra; Bills, Aric; Bishop, Judith; Boyle, Anne; Chouder, Sarra; Clair, Nathaniel; Conners, Tom; Corey, Cassian; Dubinski, Eyal; Ellis, Corinna; Fernando, Jess; Gibby, Paul; Abdi, Farah H; Hammond, Simon; Hubert, Maxime; Kaiser-Schatzlein, Alice; Kazi, Michael; Lam, Julie; Lazar, Rosie; Le, Hanh; Levot, Michael; Malyska, Nicolas; Melot, Jennifer; Mensch, Alyssa; Omar, Abdulkadir Arale; Paget, Shelley; Richardson, Frederick; Rubino, Carl; Samko, Bern; Sanders, Gregory; Soh, Stephanie; Strahan, Tania E.; Taylor, Jonathan; Thompson, Brian; Tong, Audrey; Tong, Richard; Yelle, Julie; Yu, Jennifer; Zavorin, Ilya, 2025, "MATERIAL Somali-English Language Pack", https://hdl.handle.net/11272.1/AB2/2FKSLF, Abacus Data Network, V1

Abstract Introduction MATERIAL Somali-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 80 hours of S...

MATERIAL Bulgarian-English Language Pack

Mar 28, 2025

Bills, Aric; Bishop, Judith; Boyle, Anne; Chouder, Sarra; Clair, Nathaniel; Conners, Tom; Corey, Cassian; Cronin, Kristina; Dubinski, Eyal; Ellis, Corinna; Gibby, Paul; Hammond, Simon; Hidalgo, Guia; Kaiser-Schatzlein, Alice; Kalnins, Dagmara; Kazi, Michael; Lam, Julie; Lazar, Rosie; Le, Hanh; Malyska, Nicolas; Medel, Olivia; Melot, Jennifer; Mensch, Alyssa; Moore, Alex; Morrison, Michelle; Paget, Shelley; Raymer, Alston; Richardson, Fred; Ridgway, Hristina; Roberts, Annette; Rubino, Carl; Saw, Kenneth; Shen, Sinney; Soh, Stephanie; Taylor, Jonathan; Thompson, Brian; Tong, Audrey; Tong, Richard; Williams, Mariana; Yelle, Julie; Yu, Jennifer; Zavora, Yoanna; Zavorin, Ilya, 2025, "MATERIAL Bulgarian-English Language Pack", https://hdl.handle.net/11272.1/AB2/WCU3PV, Abacus Data Network, V1

Abstract Introduction MATERIAL Bulgarian-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 78 hours o...

Samrómur Synthetic

Feb 3, 2025

Hernández Mena, Carlos Daniel; Örnólfsson, Gunnar Thor; Gudnason, Jon, 2025, "Samrómur Synthetic", https://hdl.handle.net/11272.1/AB2/DZUB82, Abacus Data Network, V1

Abstract Introduction Samrómur Synthetic was developed by the Language and Voice Lab, Reykjavik University and contains 72 hours of Icelandic synthetic speech, transcripts and metadata. Data Source sentences were extracted from the Samrómur platform, comprised of texts and transc...

Ravnursson Faroese Speech and Transcripts

Feb 3, 2025

Hernández Mena, Carlos Daniel; Simonsen, Annika; Gudnason, Jon, 2025, "Ravnursson Faroese Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/OBXEAK, Abacus Data Network, V1

Abstract Introduction Ravnursson Faroese Speech and Transcripts contains 109 hours of Faroese prompted speech from 433 speakers (249 female, 184 male), corresponding transcripts and speaker metadata. It is an extract from the Basic Language Resource Kit 1.0 (BLARK 1.0) developed...

L2-KSU Native and Non-Native Arabic Speech

Feb 3, 2025

Alrashoudi, Norah; AlKhalifa, Hend; Alotaibi, Yousef Ajami, 2025, "L2-KSU Native and Non-Native Arabic Speech", https://hdl.handle.net/11272.1/AB2/N7YZP8, Abacus Data Network, V1

Abstract Introduction L2-KSU Native and Non-Native Arabic Speech was developed by King Saud University (KSU) and contains approximately six hours of Modern Standard Arabic read speech from 80 subjects, along with transcripts and speaker metadata. Data The speech data was collecte...

Iraqi Arabic - English Lexical Database

Feb 3, 2025

Maamouri, Mohamed; Graff, David, 2025, "Iraqi Arabic - English Lexical Database", https://hdl.handle.net/11272.1/AB2/EUPXQD, Abacus Data Network, V1

Abstract Introduction Iraqi Arabic - English Lexical Database was developed by the Linguistic Data Consortium (LDC). It contains six interrelated tables presenting over 67,000 Iraqi Arabic words as orthographic forms in Arabic script and pronunciation forms in International Phone...

LORELEI Yoruba Representative Language Pack

Jan 21, 2025

Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2025, "LORELEI Yoruba Representative Language Pack", https://hdl.handle.net/11272.1/AB2/ATPB58, Abacus Data Network, V1

Abstract Introduction LORELEI Yoruba Representative Language Pack (LDC2024T10) consists of Yoruba monolingual text, Yoruba-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI progr...

MultiTACRED

Jan 21, 2025

Hennig, Leonhard; Thomas, Philippe; Möller, Sebastian, 2025, "MultiTACRED", https://hdl.handle.net/11272.1/AB2/GIEQ7J, Abacus Data Network, V1

Abstract Introduction MultiTACRED was developed by the German Research Center for Artificial Intelligence (DFKI) Speech and Language Technology Lab and is a machine translation of TAC Relation Extraction Dataset (LDC2018T24) (TACRED) into twelve languages with projected entity an...

RST Continuity Corpus

Jan 21, 2025

Das, Debopam; Egg, Markus, 2025, "RST Continuity Corpus", https://hdl.handle.net/11272.1/AB2/YSIB2J, Abacus Data Network, V1

Abstract Introduction RST Continuity Corpus was developed at Åbo Akademi University and Humboldt-Universität zu Berlin and contains annotations for continuity dimensions added to RST Discourse Treebank (LDC2002T07). RST Discourse Treebank is a collection of English news texts fro...

First-Year Law Students' Court Memoranda

Oct 25, 2024

Larson, Brian N., 2024, "First-Year Law Students' Court Memoranda", https://hdl.handle.net/11272.1/AB2/CC9MT6, Abacus Data Network, V1

Abstract Introduction First-Year Law Students' Court Memoranda consists of 197 English law student writing samples of legal briefs annotated for certain characteristics along with accompanying survey responses by the student writers. The briefs were created in a law school writin...

Samrómur Queries Icelandic Speech 1.0

Oct 25, 2024

Hedström, Staffan; Fong, Judy; Þórhallsdóttir, Ragnheiður; Mollberg, David; Guðmundsson, Smári Freyr; Jónsson, Ólafur Helgi; Þorsteinsdóttir, Sunneva; Magnusdottir, Eydis Huld; Gudnason, Jon, 2024, "Samrómur Queries Icelandic Speech 1.0", https://hdl.handle.net/11272.1/AB2/DGPHQR, Abacus Data Network, V1

Abstract Introduction Samrómur Queries Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 20 hours of Icelandic prompted queries from 3,809 speakers represent...

TRAD Arabic-French Parallel Text -- Newswire

Oct 25, 2024

Consortium, Linguistic Data; ELDA,, 2024, "TRAD Arabic-French Parallel Text -- Newswire", https://hdl.handle.net/11272.1/AB2/48BBWO, Abacus Data Network, V1

Abstract Introduction TRAD Arabic-French Parallel Text -- Newswire was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 20,000 Arabic words from NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21). The...

TRAD Chinese-French Parallel Text -- Broadcast News

Oct 25, 2024

Consortium, Linguistic Data; ELDA,, 2024, "TRAD Chinese-French Parallel Text -- Broadcast News", https://hdl.handle.net/11272.1/AB2/IZFPYW, Abacus Data Network, V1

Abstract Introduction TRAD Chinese-French Parallel Text -- Broadcast News was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 30,000 Chinese characters from GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3...

2007 CoNLL Shared Task - Greek, Hungarian & Italian

Oct 25, 2024

Pisa, Dipartimento di Informatica of the University of; ILC-CNR,; Processing, Institute for Language and Speech; Szeged, Institute of Informatics at the University of; Sciences, Institute of Linguistics at the Hungarian Academy of; Ltd., Morphologic, 2024, "2007 CoNLL Shared Task - Greek, Hungarian & Italian", https://hdl.handle.net/11272.1/AB2/JLYA64, Abacus Data Network, V1

Abstract Introduction 2007 CoNLL Shared Task - Greek, Hungarian & Italian consists of dependency treebanks in three languages used as part of the CoNLL 2007 shared task on multi-lingual dependency parsing and domain adaptation. The languages covered in this release are: Greek, Hu...

Vehicle City Voices Corpus – Part I

Oct 25, 2024

Britt, Erica, 2024, "Vehicle City Voices Corpus – Part I", https://hdl.handle.net/11272.1/AB2/8XVBZS, Abacus Data Network, V1

Abstract Introduction Vehicle City Voices Corpus – Part I was developed at the University of Michigan-Flint, and is an ongoing oral history project and survey of English language variation in Flint, Michigan. It contains approximately 16 hours of speech with corresponding transcr...

CHM150

Oct 25, 2024

Mena, Carlos Daniel Hernández; Herrera, Abel, 2024, "CHM150", https://hdl.handle.net/11272.1/AB2/UWURFR, Abacus Data Network, V1

Abstract Introduction CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican Spanish speech, associated transcri...

Arabic Learner Corpus

Oct 25, 2024

Alfaifi, Abdullah; Atwell, Eric, 2024, "Arabic Learner Corpus", https://hdl.handle.net/11272.1/AB2/DPQWPU, Abacus Data Network, V1

Abstract Introduction Arabic Learner Corpus was developed at the University of Leeds and consists of written essays and spoken recordings by Arabic learners collected in Saudi Arabia in 2012 and 2013. The corpus includes 282,732 words in 1,585 materials, produced by 942 students...

BabyEars Affective Vocalizations

Oct 25, 2024

Slaney, Malcolm; McRoberts, Gerald; Scheirer, Jocelyn, 2024, "BabyEars Affective Vocalizations", https://hdl.handle.net/11272.1/AB2/VK52W9, Abacus Data Network, V1

Abstract Introduction BabyEars Affective Vocalizations was developed by Malcolm Slaney, Gerald McRoberts, and Jocelyn Scheirer. It contains approximately 22 minutes of spontaneous English speech by 12 adults interacting with their infant children, for a total of 509 infant-direct...

Second Language University Speech Intelligibility Corpus

Oct 25, 2024

Kang, Okim; Hirschi, Kevin; Looney, Stephen D.; Hansen, John H. L., 2024, "Second Language University Speech Intelligibility Corpus", https://hdl.handle.net/11272.1/AB2/QHVV2O, Abacus Data Network, V1

Abstract Introduction Second Language University Speech Intelligibility Corpus was developed by Northern Arizona University, The Pennsylvania State University, and The University of Texas at Dallas. It contains 10.5 hours of English speech by 66 international faculty and universi...

AIDA Scenario 2 Practice Topic Annotation

Sep 17, 2024

Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2024, "AIDA Scenario 2 Practice Topic Annotation", https://hdl.handle.net/11272.1/AB2/BFKQTZ, Abacus Data Network, V1

Abstract Introduction AIDA Scenario 2 Practice Topic Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of annotations for 29 English, Russian and Spanish web documents (text, image and video) from AIDA Scenario 2 Practice Topic Source Data (LDC2024...

Dialogs Re-Enacted Across Languages

Sep 17, 2024

Ward, Nigel G.; Avila, Jonathan E.; Rivas, Emilia; Marco, Divette, 2024, "Dialogs Re-Enacted Across Languages", https://hdl.handle.net/11272.1/AB2/XRMWND, Abacus Data Network, V1

Abstract Introduction Dialogs Re-Enacted Across Languages was developed at the University of Texas at El Paso. It contains approximately 17 hours of conversational speech in English and Spanish by 129 unique bilingual speakers, specifically, short fragments extracted from spontan...

Diaspora Tibetan Speech

Sep 17, 2024

Geissler, Christopher; Babinski, Sarah; Shaw, Jason, 2024, "Diaspora Tibetan Speech", https://hdl.handle.net/11272.1/AB2/OPZ58Z, Abacus Data Network, V1

Abstract Introduction Diaspora Tibetan Speech was developed at Yale University. It contains approximately 28 hours of Tibetan elicited speech by 73 speakers from the diaspora Tibetan community in Kathmandu, Nepal, along with transcripts, elicitation materials and speaker demograp...

LORELEI Uyghur Incident Language Pack

Sep 17, 2024

Tracey, Jennifer; Strassel, Stephanie; Arrigo, Michael; Wright, Jonathan; Graff, David; Bies, Ann, 2024, "LORELEI Uyghur Incident Language Pack", https://hdl.handle.net/11272.1/AB2/VRJN4A, Abacus Data Network, V1

Abstract Introduction LORELEI Uyghur Incident Language Pack (LDC2024T07) was developed by the Linguistic Data Consortium and consists of approximately 28 million words of Uyghur monolingual text, 500,000 words of English monolingual text, 3.3 million words of parallel and compara...

Call My Net 1

Jul 30, 2024

Jones, Karen; Walker, Kevin; Graff, David; Wright, Jonathan; Strassel, Stephanie, 2024, "Call My Net 1", https://hdl.handle.net/11272.1/AB2/RJMIEI, Abacus Data Network, V1

Abstract Introduction Call My Net 1 was developed by the Linguistic Data Consortium and contains 364 hours of conversational telephone speech in four languages (Tagalog, Cebuano, Cantonese and Mandarin) collected in 2015 from 221 native speakers located in the Philippines and Chi...

Automatic Content Extraction for Portuguese

Jul 30, 2024

Cunha, Luís Filipe; Silvano, Purificação; Campos, Ricardo; Jorge, Alípio, 2024, "Automatic Content Extraction for Portuguese", https://hdl.handle.net/11272.1/AB2/5VRIQB, Abacus Data Network, V1

Abstract Introduction Automatic Content Extraction for Portuguese (LDC2024T05) was developed at INESC TEC - Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência and consists of automatic Brazilian Portuguese and European Portuguese translations of the English...

LoReHLT Hausa Representative Language Pack

May 13, 2024

Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Griffitt, Kira; Ryant, Neville; Kulick, Seth; Delgado, Dana; Arrigo, Michael, 2024, "LoReHLT Hausa Representative Language Pack", https://hdl.handle.net/11272.1/AB2/7MWKZC, Abacus Data Network, V1

Abstract Introduction LoReHLT Hausa Representative Language Pack consists of Hausa monolingual text, Hausa-English parallel text, annotations, amateur web audio recordings, supplemental resources and related software tools developed by the Linguistic Data Consortium for LoReHLT,...

AIDA Scenario 2 Practice Topic Source Data

May 13, 2024

Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2024, "AIDA Scenario 2 Practice Topic Source Data", https://hdl.handle.net/11272.1/AB2/TXAWUL, Abacus Data Network, V1

Abstract Introduction AIDA Scenario 2 Practice Topic Source Data was developed by the Linguistic Data Consortium (LDC) and is comprised of 1500 root documents, including text, image, and video, from English, Russian, and Spanish web sources. The DARPA AIDA (Active Interpretation...

RATS Low Speech Density

May 13, 2024

Walker, Kevin; Graff, David; Ma, Xiaoyi; Strassel, Stephanie; Jones, Karen, 2024, "RATS Low Speech Density", https://hdl.handle.net/11272.1/AB2/CXVUXZ, Abacus Data Network, V1

Abstract Introduction RATS Low Speech Density was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 87 hours of English, Levantine Arabic, Farsi, Pashto and Urdu speech and non-speech samples. The recordings were assembled by concatenating a rand...

LORELEI Farsi Representative Language Pack

Mar 28, 2024

Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2024, "LORELEI Farsi Representative Language Pack", https://hdl.handle.net/11272.1/AB2/UMEVGY, Abacus Data Network, V1

Abstract Introduction LORELEI Farsi Representative Language Pack consists of Farsi monolingual text, Farsi-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI program. The LORELEI...

AIDA Scenario 1 Practice Topic Annotation

Mar 28, 2024

Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2024, "AIDA Scenario 1 Practice Topic Annotation", https://hdl.handle.net/11272.1/AB2/XPPJWR, Abacus Data Network, V1

Abstract Introduction AIDA Scenario 1 Practice Topic Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of annotations for 212 English, Russian and Ukrainian web documents (text, image and video) from AIDA Scenario 1 Practice Topic Source Data (LDC2...

KASET - Kurmanji and Sorani Kurdish Speech and Transcripts

Mar 28, 2024

Delgado, Dana; Walker, Kevin; Strassel, Stephanie; Graff, David; Caruso, Christopher, 2024, "KASET - Kurmanji and Sorani Kurdish Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/ODAGYC, Abacus Data Network, V1

Abstract Introduction KASET - Kurmanji and Sorani Kurdish Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of approximately 147 hours of telephone conversations (289 recordings) and broadcast news (410 recordings) in two Kurdish dialects:...

TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017

Jan 11, 2024

Tracey, Jennifer; Strassel, Stephanie; Arrigo, Michael, 2024, "TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017", https://hdl.handle.net/11272.1/AB2/OM2WHS, Abacus Data Network, V1

Abstract Introduction TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 (LDC2023T13) was developed by the Linguistic Data Consortium and contains training and evaluation data produced in support of the 2016 and 2017 TAC KBP Belief and Sentiment (...

Add Data

Share Dataverse

Link Dataverse

Reset Modifications