Skip to content

EspanX/language-resources

 
 

Repository files navigation

Language Resources and Tools

Build Status

Datasets and scripts for basic natural language and speech processing.

This is not an official Google product.

Natural Languages

Directory Language Available
af Afrikaans
bn Bengali / Bangla
hi_ur Hindi & Urdu
is Icelandic
jv Javanese
km Khmer
lo Lao
my Burmese / Myanmar
ne Nepali
si Sinhala
su Sundanese
xh Xhosa
zu Zulu

Tools

We are including a few tools for working with the natural language datasets. These tools are written in C++ and Python and are built with Bazel. To compile and use these tools, install a recent version of Bazel (minimally Bazel release 0.4.5 is required).

Opensourced Audio Data

Resource Link
Sinhala TTS recordings (~3K) http://www.openslr.org/30/
TTS recordings for four South African languages (af, st, tn, xh) http://www.openslr.org/32/
Large Javanese ASR training data set (~185K) http://www.openslr.org/35/
Large Sundanese ASR training data set (~220K) http://www.openslr.org/36/
High quality TTS data for Bengali languages http://www.openslr.org/37/
High quality TTS data for Javanese http://www.openslr.org/41/
High quality TTS data for Khmer http://www.openslr.org/42/
High quality TTS data for Nepali http://www.openslr.org/43/
High quality TTS data for Sundanese http://www.openslr.org/44/
Large Sinhala ASR training data set http://www.openslr.org/52/
Large Bengali ASR training data set http://www.openslr.org/53/
Large Nepali ASR training data set http://www.openslr.org/54/

Other reading resources

SLTU 2016 Tutorial - https://sites.google.com/site/sltututorial/overview

License

Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.

Where specifically noted, some datasets are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

The directory third_party/ contains third-party works, which we are including under the respective licenses of the upstream projects. See third_party/README.md for further details.

About

Datasets and tools for basic natural language processing.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 45.6%
  • C++ 39.0%
  • Java 9.0%
  • Shell 4.7%
  • Dockerfile 0.8%
  • C 0.7%
  • Other 0.2%