Join | Mozilla Data Collective

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 470+ high-quality global datasets, built by and for the community in a transparent and ethical way.

Datasets

Community

Malayalam Time-Aligned Speech Corpus

A Malayalam speech dataset containing 100 audio files with time-aligned .srt transcriptions from 5 speakers.

License: CC-BY-NC-4.0

Locale: mal

Task: ASR

Format: WAV, SRT

Size: 1.50 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part3

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 1.33 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part2

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 8.07 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part1

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for Somali language, produced by Digital Divide Data

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 7.68 GB

Community

TODa: Tamazight Open Dataset

Welcome to the Tamazight Open Dataset (TODa), a groundbreaking open-source project dedicated to preserving and advancing the Tamazight language. With its extensive collection of linguistic data, TODa stands as a pioneering collaborative project for Tamazight <=> Englis translation, specifically designed for Natural Language Processing applications. TODa's unique approach combines both semantic and syntactic categorization methods, offering a rich representation of words in their various contexts and forms. The dataset encompasses a comprehensive collection of linguistic elements, including detailed verb conjugations across different tenses, noun variations, and an extensive compilation of translated expressions that capture the language's nuances. What sets TODa apart is its inclusive approach to Tamazight's writing systems. The dataset thoughtfully incorporates Latin alphabets, acknowledging and preserving the diverse writing traditions practiced across Amazigh communities. This dual-script approach ensures broader accessibility and cultural authenticity. Our vision is to establish TODa as the cornerstone resource for Tamazight Natural Language Processing. Through this meticulously curated dataset, we strive to empower developers and researchers to create innovative NLP solutions that authentically serve the Amazigh-speaking community. We take pride in our current progress, yet acknowledge that language documentation is an evolving journey. We actively encourage participation from the Amazigh technology community to contribute their expertise in expanding and refining the dataset. Through collaborative effort, we can create a robust foundation for technological innovations that honor and advance Amazigh linguistic heritage.

License: CC-BY-4.0

Locale: zgh

Task: NLP

Format: CSV

Size: 3.27 MB

Community

TTS Balinese Language

This TTS dataset contains Balinese language used in daily activities.

License: CC-BY-SA-4.0

Locale: ban

Task: TTS

Format: WEBM, TSV

Size: 301.05 MB

Community

Kokoro Speech Dataset

Kokoro Speech Dataset is a public domain Japanese speech dataset. (https://github.com/kaiidams/Kokoro-Speech-Dataset)

License: libribox

Locale: ja

Task: TTS

Format: FLAC

Size: 3.98 GB

Community

Sundanese TTS

This dataset uses the Priangan dialect of West Java with Indonesian code-mixing and code-switching.

License: CC-BY-SA-4.0

Locale: sun

Task: TTS

Format: WEBM, TSV

Size: 298.10 MB

MDC Community Concierge

Bangor Miami Spanish-English Corpus

Spanish-English bilingual speech corpus with 35 hours of recorded audio and 240,000 words.

License: GPL-3.0

Locale: es-US, en-US

Task: ASR

Format: MP3, CHA, TSV

Size: 1.12 GB

Keblagh e Azergi

Elkhani Hazargi Literature Corpus

Hazargi literary corpus (~0.5M tokens) of poetry, folklore, and prose texts representing Hazara linguistic and cultural heritage.

License: CC-BY-NC-4.0

Locale: haz

Task: NLP

Format: TXT

Size: 2.46 MB

Aim Foundation

Dari Literature Corpus by Anjuman e Adabi Nayestan

A ~1 M-token Dari (Afghan Persian) literary corpus compiled by Anjuman e Adabi Nayestan, covering prose, poetry, and cultural texts in Perso-Arabic script.

License: CC-BY-NC-4.0

Locale: prs

Task: NLP

Format: TXT

Size: 12.67 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Wordlist

The IBT Torwali Wordlist contains approximately 20,000 unique entries in Torwali (ISO 639-3: trw), an under-documented Indo-Aryan language spoken in northern Pakistan. The dataset comprises standardized lexical entries covering core vocabulary, function words, and culturally salient terms, with consistent orthography and normalization suitable for linguistic and computational use. Entries are aligned with English and Urdu glosses, and include part-of-speech tag.

License: CC-BY-SA-4.0

Locale: trw

Task: NLP

Format: CSV

Size: 312.87 KB

IT'S EASY TO UPLOAD & CONTROL YOUR DATA

Upload your dataset

IT'S EASY TO UPLOAD & CONTROL YOUR DATA

Upload your dataset

Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it. You can share openly, using existing licenses, or you can build your own.

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

JOIN THE MOVEMENT

Join Mozilla Data Collective

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.

How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at [email protected].

Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.