Linguistic Data Consortium

Featured Dataverses

In order to use this feature you must have at least one published dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

1 to 50 of 365 Results

LORELEI Farsi Representative Language Pack Mar 28, 2024 Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2024, "LORELEI Farsi Representative Language Pack", https://hdl.handle.net/11272.1/AB2/UMEVGY, Abacus Data Network, V1 Abstract Introduction LORELEI Farsi Representative Language Pack consists of Farsi monolingual text, Farsi-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI program. The LORELEI...
AIDA Scenario 1 Practice Topic Annotation Mar 28, 2024 Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2024, "AIDA Scenario 1 Practice Topic Annotation", https://hdl.handle.net/11272.1/AB2/XPPJWR, Abacus Data Network, V1 Abstract Introduction AIDA Scenario 1 Practice Topic Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of annotations for 212 English, Russian and Ukrainian web documents (text, image and video) from AIDA Scenario 1 Practice Topic Source Data (LDC2...
KASET - Kurmanji and Sorani Kurdish Speech and Transcripts Mar 28, 2024 Delgado, Dana; Walker, Kevin; Strassel, Stephanie; Graff, David; Caruso, Christopher, 2024, "KASET - Kurmanji and Sorani Kurdish Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/ODAGYC, Abacus Data Network, V1 Abstract Introduction KASET - Kurmanji and Sorani Kurdish Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of approximately 147 hours of telephone conversations (289 recordings) and broadcast news (410 recordings) in two Kurdish dialects:...
TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 Jan 11, 2024 Tracey, Jennifer; Strassel, Stephanie; Arrigo, Michael, 2024, "TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017", https://hdl.handle.net/11272.1/AB2/OM2WHS, Abacus Data Network, V1 Abstract Introduction TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 (LDC2023T13) was developed by the Linguistic Data Consortium and contains training and evaluation data produced in support of the 2016 and 2017 TAC KBP Belief and Sentiment (...
Kasdi-Merbah (University) Emotional Database in Arabic Speech Jan 11, 2024 Belhadj, Mourad; Bendellali, Ilham; Lakhdari, Elalia, 2024, "Kasdi-Merbah (University) Emotional Database in Arabic Speech", https://hdl.handle.net/11272.1/AB2/Y4LDPA, Abacus Data Network, V1 Abstract Introduction Kasdi-Merbah Emotional Database in Arabic Speech was developed by the University of Kasdi Merbah Ouargla. The corpus contains two hours of Modern Standard Arabic prompted speech from 500 speakers (254 female, 246 male) representing 5,000 utterances. Data Spe...
AIDA Scenario 1 and 2 Reference Knowledge Base Dec 5, 2023 Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2023, "AIDA Scenario 1 and 2 Reference Knowledge Base", https://hdl.handle.net/11272.1/AB2/YTF9AB, Abacus Data Network, V1 Abstract Introduction AIDA Scenario 1 and 2 Reference Knowledge Base was developed by the Linguistic Data Consortium (LDC) and contains the English knowledge base (KB) used for all AIDA entity linking annotation in Scenario 1 (Russia-Ukraine Relations) and Scenario 2 (Crisis in V...
REMIX Telephone Collection Dec 5, 2023 Graff, David; Jones, Karen; Strassel, Stephanie; Walker, Kevin, 2023, "REMIX Telephone Collection", https://hdl.handle.net/11272.1/AB2/VJPGYX, Abacus Data Network, V1 Abstract Introduction REMIX Telephone Collection was developed by the Linguistic Data Consortium (LDC) and contains 320 hours of English conversational telephone speech from 358 speakers who had completed all tasks in one of the previous LDC Mixer collections, specifically, Mixer...
AIDA Scenario 1 Practice Topic Source Data Dec 5, 2023 Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2023, "AIDA Scenario 1 Practice Topic Source Data", https://hdl.handle.net/11272.1/AB2/M4QWGV, Abacus Data Network, V1 Abstract Introduction AIDA Scenario 1 Practice Topic Source Data was developed by the Linguistic Data Consortium (LDC) and is comprised of 1511 documents (text, image, and video) from English, Russian, and Ukrainian web sources. The DARPA AIDA (Active Interpretation of Disparate...
CALLFRIEND Russian Text Oct 17, 2023 Miller, David; Walker, Kevin; Graff, David; Canavan, Alexandra, 2023, "CALLFRIEND Russian Text", https://hdl.handle.net/11272.1/AB2/BNFFSZ, Abacus Data Network, V1 Abstract Introduction CALLFRIEND Russian Text (LDC2023T09) was developed by the Linguistic Data Consortium and consists of transcripts for approximately 48 hours of telephone conversations (100 recordings) between native Russian speakers. The calls were recorded in 1999 as part o...
2019 OpenSAT Public Safety Communications Simulation Oct 17, 2023 Delgado, Dana; Jones, Karen; Walker, Kevin; Strassel, Stephanie; Caruso, Christopher; Graff, David, 2023, "2019 OpenSAT Public Safety Communications Simulation", https://hdl.handle.net/11272.1/AB2/BOXO5O, Abacus Data Network, V1 Abstract Introduction 2019 OpenSAT Public Safety Communications Simulation was developed by the Linguistic Data Consortium (LDC) and contains approximately 141 hours of speech recordings and transcripts used in the used in the National Institute of Standards and Technology (NIST)...
CALLFRIEND Russian Speech Oct 16, 2023 Miller, David; Walker, Kevin; Graff, David; Canavan, Alexandra, 2023, "CALLFRIEND Russian Speech", https://hdl.handle.net/11272.1/AB2/NGRVVO, Abacus Data Network, V1 Abstract Introduction CALLFRIEND Russian Speech (LDC2023S08) was developed by the Linguistic Data Consortium (LDC) and consists of approximately 48 hours of telephone conversations (100 recordings) between native speakers of Russian. The calls were recorded in 1999 as part of the...
KAFD: Arabic Font Database Aug 29, 2023 Luqman, Hamzah; Mahmoud, Sabri; Awaida, Sameh, 2016, "KAFD: Arabic Font Database", https://hdl.handle.net/11272.1/AB2/A0JPYM, Abacus Data Network, V2 Introduction KAFD: Arabic Font Database was developed by King Fahd University of Petroleum & Minerals and Qassim University. It is comprised of approximately 2.5 million scanned Arabic printed pages in a variety of fonts, sizes and resolutions along with corresponding transcripts...
Noisy TIMIT Speech Aug 29, 2023 Abdulaziz, Azhar; Kepuska, Veton, 2017, "Noisy TIMIT Speech", https://hdl.handle.net/11272.1/AB2/FFFXT2, Abacus Data Network, V2 Introduction Noisy TIMIT Speech was developed by the Florida Institute of Technology and contains approximately 322 hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) modified with different additive noise levels. Only the audio has been modified;...
UCLA High-Speed Laryngeal Video and Audio Aug 29, 2023 Chen, Gang; Neubauer, Juergen; Garellek, Marc; Samlan, Robin; Gerratt, Bruce R.; Kreiman, Jody; Alwan, Abeer, 2017, "UCLA High-Speed Laryngeal Video and Audio", https://hdl.handle.net/11272.1/AB2/OWLHMG, Abacus Data Network, V2 UCLA High-Speed Laryngeal Video and Audio was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of high-speed laryngeal video recordings of the vocal folds and synchronized audio recordings from nine subjects collected between April 2012 and...
CHiME2 WSJ0 Aug 29, 2023 Vincent, Emmanuel; Barker, Jon; Watanabe, Shinji; Le Roux, Jonathan; Nesta, Francesco; Matassoni, Marco, 2017, "CHiME2 WSJ0", https://hdl.handle.net/11272.1/AB2/IUB8PD, Abacus Data Network, V2 CHiME2 WSJ0 was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 166 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-...
BOLT English Discussion Forums Aug 29, 2023 Tracey, Jennifer; Lee, Haejoong; Strassel, Stephanie, 2017, "BOLT English Discussion Forums", https://hdl.handle.net/11272.1/AB2/VDFID2, Abacus Data Network, V2 BOLT English Discussion Forums was developed by the Linguistic Data Consortium (LDC) and consists of 830,440 discussion forum threads in English harvested from the Internet using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational Language Translati...
BOLT Arabic Discussion Forums Aug 29, 2023 Tracey, Jennifer; Lee, Haejoong; Strassel, Stephanie; Ismael, Safa, 2018, "BOLT Arabic Discussion Forums", https://hdl.handle.net/11272.1/AB2/DP4INP, Abacus Data Network, V2 BOLT Arabic Discussion Forums was developed by the Linguistic Data Consortium (LDC) and consists of 813,080 discussion forum threads in Egyptian Arabic harvested from the Internet using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational Language Tr...
Concretely Annotated New York Times Aug 29, 2023 Ferraro, Francis; Thomas, Max; Wolfe, Travis; R. Gormley, Matthew; Harman, Craig; Van Durme, Benjamin, 2018, "Concretely Annotated New York Times", https://hdl.handle.net/11272.1/AB2/VA98GM, Abacus Data Network, V2 Introduction Concretely Annotated New York Times was developed by Johns Hopkins University’s Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic and coreference annotations to The New York Times Annot...
Concretely Annotated English Gigaword Aug 29, 2023 Ferraro, Francis; Thomas, Max; Gormley, Matthew R.; Wolfe, Travis; Harman, Craig; Van Durme, Benjamin, 2018, "Concretely Annotated English Gigaword", https://hdl.handle.net/11272.1/AB2/NQCDFR, Abacus Data Network, V2 Concretely Annotated English Gigaword was developed by Johns Hopkins University’s Human Language Technology Center of Excellence (JHU). It adds multiple kinds and instances of automatically-generated syntactic, semantic and coreference annotations to English Gigaword Fifth Editio...
HAVIC MED Progress Test -- Videos, Metadata and Annotation Aug 29, 2023 Morris, Amanda; Strassel, Stephanie; Li, Xuansong; Antonishek, Brian; Fiscus, Jonathan G., 2019, "HAVIC MED Progress Test -- Videos, Metadata and Annotation", https://hdl.handle.net/11272.1/AB2/QYTBMD, Abacus Data Network, V2 HAVIC MED Progress Test – Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,650 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and related technologies, LDC...
2010 NIST Speaker Recognition Evaluation Test Set Aug 29, 2023 Greenberg, Craig; Martin, Alvin; Graff, David; Brandschain, Linda; Walker, Kevin, 2017, "2010 NIST Speaker Recognition Evaluation Test Set", https://hdl.handle.net/11272.1/AB2/2CPM3O, Abacus Data Network, V2 Introduction 2010 NIST Speaker Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It contains 2,255 hours of American English telephone speech and speech recorded over a microphone chann...
CHiME3 Aug 29, 2023 Barker, Jon; Marxer, Ricard; Vincent, Emmanuel; Watanabe, Shinji, 2017, "CHiME3", https://hdl.handle.net/11272.1/AB2/HGHM4U, Abacus Data Network, V2 Introduction CHiME3 was developed as part of The 3rd CHiME Speech Separation and Recognition Challenge and contains approximately 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio. The CHiME Challenges focus on distant-mic...
AISHELL-1 Aug 29, 2023 Bu, Hui, 2018, "AISHELL-1", https://hdl.handle.net/11272.1/AB2/2WMDTT, Abacus Data Network, V2 AISHELL-1 was developed by Beijing Shell Shell Technology Co., Ltd. It contains approximately 520 hours of Chinese Mandarin speech from 400 speakers recorded simultaneously on three different devices with associated transcripts. The goal of the collection was to support speech re...
Mixer 4 and 5 Speech Aug 29, 2023 Brandschain, Linda; Walker, Kevin; Graff, David; Cieri, Christopher; Neely, Abby; Mirghafori, Nikki; Peskin, Barbara; Godfrey, Jack; Strassel, Stephanie; Goodman, Fred; Doddington, George R.; King, Mike, 2021, "Mixer 4 and 5 Speech", https://hdl.handle.net/11272.1/AB2/LU0TQ8, Abacus Data Network, V2 Abstract Introduction Mixer 4 and 5 Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 14,185 hours of audio recordings of conversational telephone speech, interviews, elicitation exercises and transcript readings involving 616 distinct...
RATS Speaker Identification Aug 29, 2023 Graff, David; Ma, Xiaoyi; Strassel, Stephanie; Walker, Kevin; Jones, Karen, 2021, "RATS Speaker Identification", https://hdl.handle.net/11272.1/AB2/BZYHPS, Abacus Data Network, V2 Abstract Introduction RATS Speaker Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 1,900 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotations of speech segments. The audio w...
HAVIC MED Training Data -- Videos, Metadata and Annotation Aug 29, 2023 Morris, Amanda; Strassel, Stephanie; Li, Xuansong; Antonishek, Brian; Fiscus, Jonathan G., 2022, "HAVIC MED Training Data -- Videos, Metadata and Annotation", https://hdl.handle.net/11272.1/AB2/TQLGAR, Abacus Data Network, V2 Abstract Introduction HAVIC MED Training Data -- Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 2,100 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and re...
HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation Aug 29, 2023 Li, Xuansong; Strassel, Stephanie; Jones, Karen; Antonishek, Brian; Fiscus, Jonathan G., 2022, "HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation", https://hdl.handle.net/11272.1/AB2/SXVGS7, Abacus Data Network, V2 Abstract Introduction HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,800 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and rel...
KHATT: Handwritten Arabic Text Aug 29, 2023 Mahmoud, Sabri; Ahmad, Irfan; Al-Khatib, Wasﬁ; Alshayeb, Mohammad; Parvez, Mohammad; Märgner, Volker; Fink, Gernot, 2015, "KHATT: Handwritten Arabic Text", https://hdl.handle.net/11272.1/AB2/PL0DHA, Abacus Data Network, V2 Introduction KHATT: Handwritten Arabic Text was developed by King Fahd University of Petroleum & Minerals, Technical University of Dortmund and Braunschweig University of Technology. It is comprised of scanned Arabic handwriting from 1,000 distinct male and female writers represe...
The Subglottal Resonances Database Aug 25, 2023 Alwan, Abeer; Lulich, Steven; Sommers, Mitchell, 2015, "The Subglottal Resonances Database", https://hdl.handle.net/11272.1/AB2/R82KKG, Abacus Data Network, V2 Introduction The Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American En...
RATS Speech Activity Detection Aug 25, 2023 Walker, Kevin; Ma, Xiaoyi; Graff, David; Strassel, Stephanie; Sessa, Stephanie; Jones, Karen, 2015, "RATS Speech Activity Detection", https://hdl.handle.net/11272.1/AB2/1UISJ7, Abacus Data Network, V2 Introduction RATS Speech Activity Detection was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech seg...
MASRI Synthetic Aug 18, 2023 Hernández Mena, Carlos Daniel; Gatt, Albert; Borg, Claudia; DeMarco, Andrea; van der Plas, Lonneke, 2023, "MASRI Synthetic", https://hdl.handle.net/11272.1/AB2/WBPJBV, Abacus Data Network, V1 Abstract Introduction MASRI (Maltese Automatic Speech Recognition I) Synthetic was developed by the MASRI team at the University of Malta and consists of approximately 99 hours of synthesized Maltese speech. Data Source sentences were extracted from the Maltese Language Resource...
MyST Children's Conversational Speech Aug 18, 2023 Pradhan, Sameer; Cole, Ronald Allan; Ward, Wayne, 2023, "MyST Children's Conversational Speech", https://hdl.handle.net/11272.1/AB2/QUHJRW, Abacus Data Network, V1 Abstract Introduction MyST (My Science Tutor) Children's Conversational Speech was developed by Boulder Learning Inc. It is comprised of approximately 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science in...
LORELEI Indonesian Representative Language Pack Aug 17, 2023 Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2023, "LORELEI Indonesian Representative Language Pack", https://hdl.handle.net/11272.1/AB2/JLEISQ, Abacus Data Network, V1 Abstract Introduction LORELEI Indonesian Representative Language Pack consists of Indonesian monolingual text, Indonesian-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium (LDC) for the DARPA LORELEI...
Althingi Parliamentary Speech Aug 17, 2023 Helgadóttir, Inga Rún; Kjaran, Róbert; Nikulásdóttir, Anna Björk; Gudnason, Jon, 2023, "Althingi Parliamentary Speech", https://hdl.handle.net/11272.1/AB2/NIG304, Abacus Data Network, V1 Abstract Introduction Althingi Parliamentary Speech consists of approximately 542 hours of recorded speech from Althingi, the Icelandic Parliament, along with corresponding transcripts, a pronunciation dictionary and two language models. Speeches date from 2005-2016. This dataset...
LORELEI Thai Representative Language Pack Aug 17, 2023 Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2023, "LORELEI Thai Representative Language Pack", https://hdl.handle.net/11272.1/AB2/GCBMNV, Abacus Data Network, V1 Abstract Introduction LORELEI Thai Representative Language Pack (LDC2023T08) consists of Thai monolingual text, Thai-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium (LDC) for the DARPA LORELEI progr...
Mixer 7 Spanish Speech Aug 17, 2023 Brandschain, Linda; Walker, Kevin; Graff, David, 2023, "Mixer 7 Spanish Speech", https://hdl.handle.net/11272.1/AB2/CYMBUE, Abacus Data Network, V1 Abstract Introduction Mixer 7 Spanish Speech (LDC2023S04) was developed by the Linguistic Data Consortium (LDC) and contains 9,600 hours of audio recordings of interviews, transcript readings and conversational telephone speech involving 191 distinct native Spanish speakers. This...
Moroccan Arabic - English Lexical Database Aug 17, 2023 Maamouri, Mohamed; Graff, David, 2023, "Moroccan Arabic - English Lexical Database", https://hdl.handle.net/11272.1/AB2/E8N63E, Abacus Data Network, V1 Abstract Introduction Moroccan Arabic - English Lexical Database was developed by the Linguistic Data Consortium (LDC). It is comprised of a set of five interrelated tables presenting each Moroccan Arabic word as an orthographic form in Arabic script and a pronunciation form in I...
Samrómur Children Icelandic Speech 1.0 Aug 17, 2023 Hernández Mena, Carlos Daniel; Borsky, Michal; Mollberg, David; Guðmundsson, Smári Freyr; Hedström, Staffan; Pálsson, Ragnar; Jónsson, Ólafur Helgi; Þorsteinsdóttir, Sunneva; Guðmundsdóttir, Jóhanna Vigdís; Magnusdottir, Eydis Huld; Þórhallsdóttir, Ragnheiður; Gudnason, Jon, 2023, "Samrómur Children Icelandic Speech 1.0", https://hdl.handle.net/11272.1/AB2/LKGTIU, Abacus Data Network, V1 Abstract Introduction Samrómur Children Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 131 hours of Icelandic prompted speech from 3,175 speakers (childre...
Samrómur Icelandic Speech 1.0 Aug 17, 2023 Mollberg, David; Jónsson, Ólafur Helgi; Þorsteinsdóttir, Sunneva; Guðmundsdóttir, Jóhanna Vigdís; Steingrimsson, Steinthor; Magnusdottir, Eydis Huld; Fong, Judy; Borsky, Michal; Gudnason, Jon, 2023, "Samrómur Icelandic Speech 1.0", https://hdl.handle.net/11272.1/AB2/JXQH5C, Abacus Data Network, V1 Abstract Introduction Samrómur Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,...
Spoken Digits in Hindi and Indian English Aug 17, 2023 Sen Bhattacharya, Basabdatta; Subramanian, Aiswarya; Chatterjee, Purbayan; Dey, Sounak, 2023, "Spoken Digits in Hindi and Indian English", https://hdl.handle.net/11272.1/AB2/VQQK0O, Abacus Data Network, V1 Abstract Introduction Spoken Digits in Hindi and Indian English was developed by the Birla Institute of Technology and Science Pilani. It contains approximately two hours of speech comprised of spoken digits from one to ten in Hindi and English with regional accents from across I...
Second DIHARD Challenge Development - SEEDLingS Aug 17, 2023 Ryant, Neville; Liberman, Mark; Fiumara, James; Cieri, Christopher, 2023, "Second DIHARD Challenge Development - SEEDLingS", https://hdl.handle.net/11272.1/AB2/PKMDCL, Abacus Data Network, V1 Abstract Introduction Second DIHARD Challenge Development - SEEDLinGS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the Second DIHARD Challenge. This relea...
Columbia Games Corpus Aug 17, 2023 Hirschberg, Julia; Gravano, Agustin; Benus, Stefan; Ward, Gregory; German, Elisa Sneed, 2023, "Columbia Games Corpus", https://hdl.handle.net/11272.1/AB2/TGPSBO, Abacus Data Network, V1 Abstract Introduction Columbia Games Corpus was developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation along with corresponding orthographic...
Second DIHARD Challenge Evaluation - SEEDLingS Jul 24, 2023 Ryant, Neville; Liberman, Mark; Fiumara, James; Cieri, Christopher, 2023, "Second DIHARD Challenge Evaluation - SEEDLingS", https://hdl.handle.net/11272.1/AB2/CXOTQ3, Abacus Data Network, V1 Abstract Introduction Second DIHARD Challenge Evaluation - SEEDLingS was developed by Duke University and the Linguistic Data Consortium (LDC) and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the Sec...
Ethnobotanical Research and Language Documentation of Nahuatl Jul 24, 2023 Amith, Jonathan D.; Alcántara, Amelia Domínguez; Osollo, Hermelindo Salazar; Castañeda, Ceferino Salgado; Salgado, Eleuterio Gorostiza, 2023, "Ethnobotanical Research and Language Documentation of Nahuatl", https://hdl.handle.net/11272.1/AB2/EEHKAK, Abacus Data Network, V1 Abstract Introduction Ethnobotanical Research and Language Documentation of Nahuatl consists of approximately 190 hours of field recordings collected in the Sierra Nororiental and Sierra Norte regions of Puebla, Mexico. The corpus contains audio and video recordings of native Nah...
2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge Jun 21, 2023 Greenberg, Craig; Sadjadi, Omid; Singer, Elliot; Walker, Kevin; Jones, Karen; Caruso, Christopher; Wright, Jonathan; Strassel, Stephanie, 2023, "2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge", https://hdl.handle.net/11272.1/AB2/JEG5RH, Abacus Data Network, V1 Abstract Introduction 2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge was developed by the Linguistic Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It contains approximately 635 hours of Tunisian Arabic telephone recordings fo...
Hong Kong Parallel Text Jun 20, 2023 Ma, Xiaoyi, 2023, "Hong Kong Parallel Text", https://hdl.handle.net/11272.1/AB2/MX5PAM, Abacus Data Network, V1 Abstract Introduction Hong Kong Parallel Text was developed by the Linguistic Data Consortium (LDC) and contains data from three sub-corpora, namely Hong Kong Hansards Parallel Text, Hong Kong Laws Parallel Text and Hong Kong News Parallel Text. Hong Kong Hansards Parallel Text c...
NIST 2008 Open Machine Translation (OpenMT) Evaluation Jun 20, 2023 NIST Multimodal Information Group, 2023, "NIST 2008 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/YEK10L, Abacus Data Network, V1 Abstract Introduction NIST 2008 Open Machine Translation (OpenMT) Evaluation, Linguistic Data Consortium (LDC) catalog number LDC2010T21 and isbn 1-58563-567-7, is a package containing source data, reference translations and scoring software used in the NIST 2008 OpenMT evaluatio...
NIST 2006 Open Machine Translation (OpenMT) Evaluation Jun 20, 2023 NIST Multimodal Information Group, 2023, "NIST 2006 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/6UBB7S, Abacus Data Network, V1 Abstract Introduction NIST 2006 Open Machine Translation (OpenMT) Evaluation, Linguistic Data Consortium (LDC) catalog number LDC2010T17 and isbn 1-58563-562-6, is a package containing source data, reference translations and scoring software used in the NIST 2006 OpenMT evaluatio...
NIST 2003 Open Machine Translation (OpenMT) Evaluation Jun 20, 2023 NIST Multimodal Information Group, 2023, "NIST 2003 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/ZH4VPY, Abacus Data Network, V1 Abstract Introduction NIST 2003 Open Machine Translation (OpenMT) Evaluation is a package containing source data, reference translations, and scoring software used in the NIST 2003 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems...
NIST 2002 Open Machine Translation (OpenMT) Evaluation Jun 16, 2023 NIST Multimodal Information Group, 2023, "NIST 2002 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/AO1F7Z, Abacus Data Network, V1 Abstract Introduction NIST 2002 Open Machine Translation (OpenMT) Evaluation is a package containing source data, reference translations, and scoring software used in the NIST 2002 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems...

LORELEI Farsi Representative Language Pack

Mar 28, 2024

Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2024, "LORELEI Farsi Representative Language Pack", https://hdl.handle.net/11272.1/AB2/UMEVGY, Abacus Data Network, V1

Abstract Introduction LORELEI Farsi Representative Language Pack consists of Farsi monolingual text, Farsi-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI program. The LORELEI...

AIDA Scenario 1 Practice Topic Annotation

Mar 28, 2024

Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2024, "AIDA Scenario 1 Practice Topic Annotation", https://hdl.handle.net/11272.1/AB2/XPPJWR, Abacus Data Network, V1

Abstract Introduction AIDA Scenario 1 Practice Topic Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of annotations for 212 English, Russian and Ukrainian web documents (text, image and video) from AIDA Scenario 1 Practice Topic Source Data (LDC2...

KASET - Kurmanji and Sorani Kurdish Speech and Transcripts

Mar 28, 2024

Delgado, Dana; Walker, Kevin; Strassel, Stephanie; Graff, David; Caruso, Christopher, 2024, "KASET - Kurmanji and Sorani Kurdish Speech and Transcripts", https://hdl.handle.net/11272.1/AB2/ODAGYC, Abacus Data Network, V1

Abstract Introduction KASET - Kurmanji and Sorani Kurdish Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of approximately 147 hours of telephone conversations (289 recordings) and broadcast news (410 recordings) in two Kurdish dialects:...

TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017

Jan 11, 2024

Tracey, Jennifer; Strassel, Stephanie; Arrigo, Michael, 2024, "TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017", https://hdl.handle.net/11272.1/AB2/OM2WHS, Abacus Data Network, V1

Abstract Introduction TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 (LDC2023T13) was developed by the Linguistic Data Consortium and contains training and evaluation data produced in support of the 2016 and 2017 TAC KBP Belief and Sentiment (...

Kasdi-Merbah (University) Emotional Database in Arabic Speech

Jan 11, 2024

Belhadj, Mourad; Bendellali, Ilham; Lakhdari, Elalia, 2024, "Kasdi-Merbah (University) Emotional Database in Arabic Speech", https://hdl.handle.net/11272.1/AB2/Y4LDPA, Abacus Data Network, V1

Abstract Introduction Kasdi-Merbah Emotional Database in Arabic Speech was developed by the University of Kasdi Merbah Ouargla. The corpus contains two hours of Modern Standard Arabic prompted speech from 500 speakers (254 female, 246 male) representing 5,000 utterances. Data Spe...

AIDA Scenario 1 and 2 Reference Knowledge Base

Dec 5, 2023

Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2023, "AIDA Scenario 1 and 2 Reference Knowledge Base", https://hdl.handle.net/11272.1/AB2/YTF9AB, Abacus Data Network, V1

Abstract Introduction AIDA Scenario 1 and 2 Reference Knowledge Base was developed by the Linguistic Data Consortium (LDC) and contains the English knowledge base (KB) used for all AIDA entity linking annotation in Scenario 1 (Russia-Ukraine Relations) and Scenario 2 (Crisis in V...

REMIX Telephone Collection

Dec 5, 2023

Graff, David; Jones, Karen; Strassel, Stephanie; Walker, Kevin, 2023, "REMIX Telephone Collection", https://hdl.handle.net/11272.1/AB2/VJPGYX, Abacus Data Network, V1

Abstract Introduction REMIX Telephone Collection was developed by the Linguistic Data Consortium (LDC) and contains 320 hours of English conversational telephone speech from 358 speakers who had completed all tasks in one of the previous LDC Mixer collections, specifically, Mixer...

AIDA Scenario 1 Practice Topic Source Data

Dec 5, 2023

Tracey, Jennifer; Strassel, Stephanie; Getman, Jeremy; Bies, Ann; Griffitt, Kira; Graff, David; Caruso, Christopher, 2023, "AIDA Scenario 1 Practice Topic Source Data", https://hdl.handle.net/11272.1/AB2/M4QWGV, Abacus Data Network, V1

Abstract Introduction AIDA Scenario 1 Practice Topic Source Data was developed by the Linguistic Data Consortium (LDC) and is comprised of 1511 documents (text, image, and video) from English, Russian, and Ukrainian web sources. The DARPA AIDA (Active Interpretation of Disparate...

CALLFRIEND Russian Text

Oct 17, 2023

Miller, David; Walker, Kevin; Graff, David; Canavan, Alexandra, 2023, "CALLFRIEND Russian Text", https://hdl.handle.net/11272.1/AB2/BNFFSZ, Abacus Data Network, V1

Abstract Introduction CALLFRIEND Russian Text (LDC2023T09) was developed by the Linguistic Data Consortium and consists of transcripts for approximately 48 hours of telephone conversations (100 recordings) between native Russian speakers. The calls were recorded in 1999 as part o...

2019 OpenSAT Public Safety Communications Simulation

Oct 17, 2023

Delgado, Dana; Jones, Karen; Walker, Kevin; Strassel, Stephanie; Caruso, Christopher; Graff, David, 2023, "2019 OpenSAT Public Safety Communications Simulation", https://hdl.handle.net/11272.1/AB2/BOXO5O, Abacus Data Network, V1

Abstract Introduction 2019 OpenSAT Public Safety Communications Simulation was developed by the Linguistic Data Consortium (LDC) and contains approximately 141 hours of speech recordings and transcripts used in the used in the National Institute of Standards and Technology (NIST)...

CALLFRIEND Russian Speech

Oct 16, 2023

Miller, David; Walker, Kevin; Graff, David; Canavan, Alexandra, 2023, "CALLFRIEND Russian Speech", https://hdl.handle.net/11272.1/AB2/NGRVVO, Abacus Data Network, V1

Abstract Introduction CALLFRIEND Russian Speech (LDC2023S08) was developed by the Linguistic Data Consortium (LDC) and consists of approximately 48 hours of telephone conversations (100 recordings) between native speakers of Russian. The calls were recorded in 1999 as part of the...

KAFD: Arabic Font Database

Aug 29, 2023

Luqman, Hamzah; Mahmoud, Sabri; Awaida, Sameh, 2016, "KAFD: Arabic Font Database", https://hdl.handle.net/11272.1/AB2/A0JPYM, Abacus Data Network, V2

Introduction KAFD: Arabic Font Database was developed by King Fahd University of Petroleum & Minerals and Qassim University. It is comprised of approximately 2.5 million scanned Arabic printed pages in a variety of fonts, sizes and resolutions along with corresponding transcripts...

Noisy TIMIT Speech

Aug 29, 2023

Abdulaziz, Azhar; Kepuska, Veton, 2017, "Noisy TIMIT Speech", https://hdl.handle.net/11272.1/AB2/FFFXT2, Abacus Data Network, V2

Introduction Noisy TIMIT Speech was developed by the Florida Institute of Technology and contains approximately 322 hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) modified with different additive noise levels. Only the audio has been modified;...

UCLA High-Speed Laryngeal Video and Audio

Aug 29, 2023

Chen, Gang; Neubauer, Juergen; Garellek, Marc; Samlan, Robin; Gerratt, Bruce R.; Kreiman, Jody; Alwan, Abeer, 2017, "UCLA High-Speed Laryngeal Video and Audio", https://hdl.handle.net/11272.1/AB2/OWLHMG, Abacus Data Network, V2

UCLA High-Speed Laryngeal Video and Audio was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of high-speed laryngeal video recordings of the vocal folds and synchronized audio recordings from nine subjects collected between April 2012 and...

CHiME2 WSJ0

Aug 29, 2023

Vincent, Emmanuel; Barker, Jon; Watanabe, Shinji; Le Roux, Jonathan; Nesta, Francesco; Matassoni, Marco, 2017, "CHiME2 WSJ0", https://hdl.handle.net/11272.1/AB2/IUB8PD, Abacus Data Network, V2

CHiME2 WSJ0 was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 166 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-...

BOLT English Discussion Forums

Aug 29, 2023

Tracey, Jennifer; Lee, Haejoong; Strassel, Stephanie, 2017, "BOLT English Discussion Forums", https://hdl.handle.net/11272.1/AB2/VDFID2, Abacus Data Network, V2

BOLT English Discussion Forums was developed by the Linguistic Data Consortium (LDC) and consists of 830,440 discussion forum threads in English harvested from the Internet using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational Language Translati...

BOLT Arabic Discussion Forums

Aug 29, 2023

Tracey, Jennifer; Lee, Haejoong; Strassel, Stephanie; Ismael, Safa, 2018, "BOLT Arabic Discussion Forums", https://hdl.handle.net/11272.1/AB2/DP4INP, Abacus Data Network, V2

BOLT Arabic Discussion Forums was developed by the Linguistic Data Consortium (LDC) and consists of 813,080 discussion forum threads in Egyptian Arabic harvested from the Internet using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational Language Tr...

Concretely Annotated New York Times

Aug 29, 2023

Ferraro, Francis; Thomas, Max; Wolfe, Travis; R. Gormley, Matthew; Harman, Craig; Van Durme, Benjamin, 2018, "Concretely Annotated New York Times", https://hdl.handle.net/11272.1/AB2/VA98GM, Abacus Data Network, V2

Introduction Concretely Annotated New York Times was developed by Johns Hopkins University’s Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic and coreference annotations to The New York Times Annot...

Concretely Annotated English Gigaword

Aug 29, 2023

Ferraro, Francis; Thomas, Max; Gormley, Matthew R.; Wolfe, Travis; Harman, Craig; Van Durme, Benjamin, 2018, "Concretely Annotated English Gigaword", https://hdl.handle.net/11272.1/AB2/NQCDFR, Abacus Data Network, V2

Concretely Annotated English Gigaword was developed by Johns Hopkins University’s Human Language Technology Center of Excellence (JHU). It adds multiple kinds and instances of automatically-generated syntactic, semantic and coreference annotations to English Gigaword Fifth Editio...

HAVIC MED Progress Test -- Videos, Metadata and Annotation

Aug 29, 2023

Morris, Amanda; Strassel, Stephanie; Li, Xuansong; Antonishek, Brian; Fiscus, Jonathan G., 2019, "HAVIC MED Progress Test -- Videos, Metadata and Annotation", https://hdl.handle.net/11272.1/AB2/QYTBMD, Abacus Data Network, V2

HAVIC MED Progress Test – Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,650 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and related technologies, LDC...

2010 NIST Speaker Recognition Evaluation Test Set

Aug 29, 2023

Greenberg, Craig; Martin, Alvin; Graff, David; Brandschain, Linda; Walker, Kevin, 2017, "2010 NIST Speaker Recognition Evaluation Test Set", https://hdl.handle.net/11272.1/AB2/2CPM3O, Abacus Data Network, V2

Introduction 2010 NIST Speaker Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It contains 2,255 hours of American English telephone speech and speech recorded over a microphone chann...

CHiME3

Aug 29, 2023

Barker, Jon; Marxer, Ricard; Vincent, Emmanuel; Watanabe, Shinji, 2017, "CHiME3", https://hdl.handle.net/11272.1/AB2/HGHM4U, Abacus Data Network, V2

Introduction CHiME3 was developed as part of The 3rd CHiME Speech Separation and Recognition Challenge and contains approximately 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio. The CHiME Challenges focus on distant-mic...

AISHELL-1

Aug 29, 2023

Bu, Hui, 2018, "AISHELL-1", https://hdl.handle.net/11272.1/AB2/2WMDTT, Abacus Data Network, V2

AISHELL-1 was developed by Beijing Shell Shell Technology Co., Ltd. It contains approximately 520 hours of Chinese Mandarin speech from 400 speakers recorded simultaneously on three different devices with associated transcripts. The goal of the collection was to support speech re...

Mixer 4 and 5 Speech

Aug 29, 2023

Brandschain, Linda; Walker, Kevin; Graff, David; Cieri, Christopher; Neely, Abby; Mirghafori, Nikki; Peskin, Barbara; Godfrey, Jack; Strassel, Stephanie; Goodman, Fred; Doddington, George R.; King, Mike, 2021, "Mixer 4 and 5 Speech", https://hdl.handle.net/11272.1/AB2/LU0TQ8, Abacus Data Network, V2

Abstract Introduction Mixer 4 and 5 Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 14,185 hours of audio recordings of conversational telephone speech, interviews, elicitation exercises and transcript readings involving 616 distinct...

RATS Speaker Identification

Aug 29, 2023

Graff, David; Ma, Xiaoyi; Strassel, Stephanie; Walker, Kevin; Jones, Karen, 2021, "RATS Speaker Identification", https://hdl.handle.net/11272.1/AB2/BZYHPS, Abacus Data Network, V2

Abstract Introduction RATS Speaker Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 1,900 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotations of speech segments. The audio w...

HAVIC MED Training Data -- Videos, Metadata and Annotation

Aug 29, 2023

Morris, Amanda; Strassel, Stephanie; Li, Xuansong; Antonishek, Brian; Fiscus, Jonathan G., 2022, "HAVIC MED Training Data -- Videos, Metadata and Annotation", https://hdl.handle.net/11272.1/AB2/TQLGAR, Abacus Data Network, V2

Abstract Introduction HAVIC MED Training Data -- Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 2,100 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and re...

HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation

Aug 29, 2023

Li, Xuansong; Strassel, Stephanie; Jones, Karen; Antonishek, Brian; Fiscus, Jonathan G., 2022, "HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation", https://hdl.handle.net/11272.1/AB2/SXVGS7, Abacus Data Network, V2

Abstract Introduction HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,800 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and rel...

KHATT: Handwritten Arabic Text

Aug 29, 2023

Mahmoud, Sabri; Ahmad, Irfan; Al-Khatib, Wasﬁ; Alshayeb, Mohammad; Parvez, Mohammad; Märgner, Volker; Fink, Gernot, 2015, "KHATT: Handwritten Arabic Text", https://hdl.handle.net/11272.1/AB2/PL0DHA, Abacus Data Network, V2

Introduction KHATT: Handwritten Arabic Text was developed by King Fahd University of Petroleum & Minerals, Technical University of Dortmund and Braunschweig University of Technology. It is comprised of scanned Arabic handwriting from 1,000 distinct male and female writers represe...

The Subglottal Resonances Database

Aug 25, 2023

Alwan, Abeer; Lulich, Steven; Sommers, Mitchell, 2015, "The Subglottal Resonances Database", https://hdl.handle.net/11272.1/AB2/R82KKG, Abacus Data Network, V2

Introduction The Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American En...

RATS Speech Activity Detection

Aug 25, 2023

Walker, Kevin; Ma, Xiaoyi; Graff, David; Strassel, Stephanie; Sessa, Stephanie; Jones, Karen, 2015, "RATS Speech Activity Detection", https://hdl.handle.net/11272.1/AB2/1UISJ7, Abacus Data Network, V2

Introduction RATS Speech Activity Detection was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech seg...

MASRI Synthetic

Aug 18, 2023

Hernández Mena, Carlos Daniel; Gatt, Albert; Borg, Claudia; DeMarco, Andrea; van der Plas, Lonneke, 2023, "MASRI Synthetic", https://hdl.handle.net/11272.1/AB2/WBPJBV, Abacus Data Network, V1

Abstract Introduction MASRI (Maltese Automatic Speech Recognition I) Synthetic was developed by the MASRI team at the University of Malta and consists of approximately 99 hours of synthesized Maltese speech. Data Source sentences were extracted from the Maltese Language Resource...

MyST Children's Conversational Speech

Aug 18, 2023

Pradhan, Sameer; Cole, Ronald Allan; Ward, Wayne, 2023, "MyST Children's Conversational Speech", https://hdl.handle.net/11272.1/AB2/QUHJRW, Abacus Data Network, V1

Abstract Introduction MyST (My Science Tutor) Children's Conversational Speech was developed by Boulder Learning Inc. It is comprised of approximately 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science in...

LORELEI Indonesian Representative Language Pack

Aug 17, 2023

Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2023, "LORELEI Indonesian Representative Language Pack", https://hdl.handle.net/11272.1/AB2/JLEISQ, Abacus Data Network, V1

Abstract Introduction LORELEI Indonesian Representative Language Pack consists of Indonesian monolingual text, Indonesian-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium (LDC) for the DARPA LORELEI...

Althingi Parliamentary Speech

Aug 17, 2023

Helgadóttir, Inga Rún; Kjaran, Róbert; Nikulásdóttir, Anna Björk; Gudnason, Jon, 2023, "Althingi Parliamentary Speech", https://hdl.handle.net/11272.1/AB2/NIG304, Abacus Data Network, V1

Abstract Introduction Althingi Parliamentary Speech consists of approximately 542 hours of recorded speech from Althingi, the Icelandic Parliament, along with corresponding transcripts, a pronunciation dictionary and two language models. Speeches date from 2005-2016. This dataset...

LORELEI Thai Representative Language Pack

Aug 17, 2023

Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2023, "LORELEI Thai Representative Language Pack", https://hdl.handle.net/11272.1/AB2/GCBMNV, Abacus Data Network, V1

Abstract Introduction LORELEI Thai Representative Language Pack (LDC2023T08) consists of Thai monolingual text, Thai-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium (LDC) for the DARPA LORELEI progr...

Mixer 7 Spanish Speech

Aug 17, 2023

Brandschain, Linda; Walker, Kevin; Graff, David, 2023, "Mixer 7 Spanish Speech", https://hdl.handle.net/11272.1/AB2/CYMBUE, Abacus Data Network, V1

Abstract Introduction Mixer 7 Spanish Speech (LDC2023S04) was developed by the Linguistic Data Consortium (LDC) and contains 9,600 hours of audio recordings of interviews, transcript readings and conversational telephone speech involving 191 distinct native Spanish speakers. This...

Moroccan Arabic - English Lexical Database

Aug 17, 2023

Maamouri, Mohamed; Graff, David, 2023, "Moroccan Arabic - English Lexical Database", https://hdl.handle.net/11272.1/AB2/E8N63E, Abacus Data Network, V1

Abstract Introduction Moroccan Arabic - English Lexical Database was developed by the Linguistic Data Consortium (LDC). It is comprised of a set of five interrelated tables presenting each Moroccan Arabic word as an orthographic form in Arabic script and a pronunciation form in I...

Samrómur Children Icelandic Speech 1.0

Aug 17, 2023

Hernández Mena, Carlos Daniel; Borsky, Michal; Mollberg, David; Guðmundsson, Smári Freyr; Hedström, Staffan; Pálsson, Ragnar; Jónsson, Ólafur Helgi; Þorsteinsdóttir, Sunneva; Guðmundsdóttir, Jóhanna Vigdís; Magnusdottir, Eydis Huld; Þórhallsdóttir, Ragnheiður; Gudnason, Jon, 2023, "Samrómur Children Icelandic Speech 1.0", https://hdl.handle.net/11272.1/AB2/LKGTIU, Abacus Data Network, V1

Abstract Introduction Samrómur Children Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 131 hours of Icelandic prompted speech from 3,175 speakers (childre...

Samrómur Icelandic Speech 1.0

Aug 17, 2023

Mollberg, David; Jónsson, Ólafur Helgi; Þorsteinsdóttir, Sunneva; Guðmundsdóttir, Jóhanna Vigdís; Steingrimsson, Steinthor; Magnusdottir, Eydis Huld; Fong, Judy; Borsky, Michal; Gudnason, Jon, 2023, "Samrómur Icelandic Speech 1.0", https://hdl.handle.net/11272.1/AB2/JXQH5C, Abacus Data Network, V1

Abstract Introduction Samrómur Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,...

Spoken Digits in Hindi and Indian English

Aug 17, 2023

Sen Bhattacharya, Basabdatta; Subramanian, Aiswarya; Chatterjee, Purbayan; Dey, Sounak, 2023, "Spoken Digits in Hindi and Indian English", https://hdl.handle.net/11272.1/AB2/VQQK0O, Abacus Data Network, V1

Abstract Introduction Spoken Digits in Hindi and Indian English was developed by the Birla Institute of Technology and Science Pilani. It contains approximately two hours of speech comprised of spoken digits from one to ten in Hindi and English with regional accents from across I...

Second DIHARD Challenge Development - SEEDLingS

Aug 17, 2023

Ryant, Neville; Liberman, Mark; Fiumara, James; Cieri, Christopher, 2023, "Second DIHARD Challenge Development - SEEDLingS", https://hdl.handle.net/11272.1/AB2/PKMDCL, Abacus Data Network, V1

Abstract Introduction Second DIHARD Challenge Development - SEEDLinGS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the Second DIHARD Challenge. This relea...

Columbia Games Corpus

Aug 17, 2023

Hirschberg, Julia; Gravano, Agustin; Benus, Stefan; Ward, Gregory; German, Elisa Sneed, 2023, "Columbia Games Corpus", https://hdl.handle.net/11272.1/AB2/TGPSBO, Abacus Data Network, V1

Abstract Introduction Columbia Games Corpus was developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation along with corresponding orthographic...

Second DIHARD Challenge Evaluation - SEEDLingS

Jul 24, 2023

Ryant, Neville; Liberman, Mark; Fiumara, James; Cieri, Christopher, 2023, "Second DIHARD Challenge Evaluation - SEEDLingS", https://hdl.handle.net/11272.1/AB2/CXOTQ3, Abacus Data Network, V1

Abstract Introduction Second DIHARD Challenge Evaluation - SEEDLingS was developed by Duke University and the Linguistic Data Consortium (LDC) and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the Sec...

Ethnobotanical Research and Language Documentation of Nahuatl

Jul 24, 2023

Amith, Jonathan D.; Alcántara, Amelia Domínguez; Osollo, Hermelindo Salazar; Castañeda, Ceferino Salgado; Salgado, Eleuterio Gorostiza, 2023, "Ethnobotanical Research and Language Documentation of Nahuatl", https://hdl.handle.net/11272.1/AB2/EEHKAK, Abacus Data Network, V1

Abstract Introduction Ethnobotanical Research and Language Documentation of Nahuatl consists of approximately 190 hours of field recordings collected in the Sierra Nororiental and Sierra Norte regions of Puebla, Mexico. The corpus contains audio and video recordings of native Nah...

2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge

Jun 21, 2023

Greenberg, Craig; Sadjadi, Omid; Singer, Elliot; Walker, Kevin; Jones, Karen; Caruso, Christopher; Wright, Jonathan; Strassel, Stephanie, 2023, "2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge", https://hdl.handle.net/11272.1/AB2/JEG5RH, Abacus Data Network, V1

Abstract Introduction 2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge was developed by the Linguistic Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It contains approximately 635 hours of Tunisian Arabic telephone recordings fo...

Hong Kong Parallel Text

Jun 20, 2023

Ma, Xiaoyi, 2023, "Hong Kong Parallel Text", https://hdl.handle.net/11272.1/AB2/MX5PAM, Abacus Data Network, V1

Abstract Introduction Hong Kong Parallel Text was developed by the Linguistic Data Consortium (LDC) and contains data from three sub-corpora, namely Hong Kong Hansards Parallel Text, Hong Kong Laws Parallel Text and Hong Kong News Parallel Text. Hong Kong Hansards Parallel Text c...

NIST 2008 Open Machine Translation (OpenMT) Evaluation

Jun 20, 2023

NIST Multimodal Information Group, 2023, "NIST 2008 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/YEK10L, Abacus Data Network, V1

Abstract Introduction NIST 2008 Open Machine Translation (OpenMT) Evaluation, Linguistic Data Consortium (LDC) catalog number LDC2010T21 and isbn 1-58563-567-7, is a package containing source data, reference translations and scoring software used in the NIST 2008 OpenMT evaluatio...

NIST 2006 Open Machine Translation (OpenMT) Evaluation

Jun 20, 2023

NIST Multimodal Information Group, 2023, "NIST 2006 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/6UBB7S, Abacus Data Network, V1

Abstract Introduction NIST 2006 Open Machine Translation (OpenMT) Evaluation, Linguistic Data Consortium (LDC) catalog number LDC2010T17 and isbn 1-58563-562-6, is a package containing source data, reference translations and scoring software used in the NIST 2006 OpenMT evaluatio...

NIST 2003 Open Machine Translation (OpenMT) Evaluation

Jun 20, 2023

NIST Multimodal Information Group, 2023, "NIST 2003 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/ZH4VPY, Abacus Data Network, V1

Abstract Introduction NIST 2003 Open Machine Translation (OpenMT) Evaluation is a package containing source data, reference translations, and scoring software used in the NIST 2003 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems...

NIST 2002 Open Machine Translation (OpenMT) Evaluation

Jun 16, 2023

NIST Multimodal Information Group, 2023, "NIST 2002 Open Machine Translation (OpenMT) Evaluation", https://hdl.handle.net/11272.1/AB2/AO1F7Z, Abacus Data Network, V1

Abstract Introduction NIST 2002 Open Machine Translation (OpenMT) Evaluation is a package containing source data, reference translations, and scoring software used in the NIST 2002 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems...

Add Data

Share Dataverse

Link Dataverse

Reset Modifications