AnotaNusa - Preserving Indonesian Local Languages Through Data Annotation

Inspiration

With over 700 local languages spoken across its archipelago, Indonesia is a nation of remarkable linguistic diversity, which is an essential part of its national identity, embodied in the motto "Bhinneka Tunggal Ika" (Unity in Diversity). Yet in the digital age, this rich heritage is increasingly at risk. Most of these languages are critically underrepresented in technology and media, leading to a widening digital divide that threatens both their preservation and the cultural connections they sustain. While pioneering initiatives such as NusaCrowd, IndoNLG, NusaX, and NusaWrites have made valuable contributions, they also highlight how difficult and resource-intensive it is to create high-quality, labeled datasets for local languages. The core challenge lies in a severe shortage of high-quality, labeled data needed for training advanced AI and NLP systems like Cendol and SahabatAI. For the vast majority of local languages, labeled data is either extremely limited or entirely absent, making it nearly impossible to build robust AI systems that can preserve and promote these languages in the digital era.

What it does

AnotaNusa is a collaborative, crowdsourced web platform built to drive the creation of high-quality, culturally aware NLP datasets for all of Indonesia's languages. It operates as a two-sided platform that directly addresses the critical data bottleneck hindering the development of truly localized AI systems.

For Creators (Researchers and Developers):

  • Easily design and launch annotation projects for various AI and NLP tasks
  • Access qualified, native local language speakers without logistical burden
  • Obtain high-quality, culturally authentic datasets

For Contributors/Annotators (Indonesian Speakers):

  • Register and select tasks in their native local language(s)
  • Contribute linguistic expertise through intuitive interfaces
  • Receive fair, direct compensation for their work

Key Features:

1. Core NLP Tasks Annotation

  • Text Classification: Simple interface for sentiment and emotion labeling with majority voting for quality assurance
  • Text-to-Text Generation: Human-powered translation, summarization, and conversational AI training data creation
  • Text Ranking: Drag-and-drop interface for ranking AI outputs from best to worst

2. Direct Contributor Reward System

  • Transparent per-task payment system
  • Fair compensation for local language speakers
  • Sustainable, community-driven data creation model

How we built it

Our development process leveraged modern tools and AI-assisted development: Tech Stack:

  • Frontend & Backend: Next.js (Full-stack framework)
  • Database: Firebase & Firestore (NoSQL database)
  • Hosting: Vercel AI-Assisted Development Tools:
  • Cursor: AI-powered code editor for enhanced productivity
  • v0.dev: AI-powered UI component generation and rapid prototyping
  • Various Generative AI Models: For code generation, debugging, and development assistance The platform features specialized annotation workflows designed to support key NLP tasks and mobilize a nationwide effort to create high-quality, labeled datasets. We implemented intuitive interfaces for text classification, text-to-text generation, and text ranking tasks, each optimized for different types of linguistic annotation work.

Challenges we ran into

1. Time Management

  • Working within tight hackathon deadlines while building a comprehensive platform
  • Balancing feature development with quality assurance

2. Team Constraints (-1 member)

  • One team member fell sick and couldn't attend, reducing our team to just 3 members
  • Had to redistribute workload and responsibilities on the fly

3. Continuous Brainstorming and Alignment

  • Ensuring all team members were aligned on the vision and implementation
  • Avoiding time waste through effective communication and decision-making
  • Balancing ambitious goals with realistic implementation timelines

Accomplishments that we're proud of

Despite working with only 3 team members, we successfully delivered a comprehensive solution to preserve low-resource languages in Indonesia. Our key accomplishments include:

1. Complete Platform Development

  • Built a fully functional two-sided platform connecting researchers with native speakers
  • Implemented all core NLP annotation features (text classification, text-to-text generation, text ranking)

2. Scalable Architecture

  • Created a sustainable, community-driven workflow for data generation
  • Designed transparent reward systems to motivate contributor participation

3. Cultural Authenticity Focus

  • Ensured the platform captures not just grammatical correctness but cultural authenticity
  • Built tools that leverage deep cultural knowledge of native speakers

4. Technical Innovation

  • Successfully integrated modern AI-assisted development tools
  • Created intuitive interfaces that make complex annotation tasks accessible While our solution isn't perfect (because no one is perfect), we're proud of delivering a platform that leverages all the core features needed to address Indonesia's linguistic diversity challenge.

What we learned

1. The Power of AI-Assisted Development

  • Modern tools like Cursor and v0.dev can significantly accelerate development
  • Generative AI can help bridge skill gaps and enhance productivity

2. Community-Driven Solutions Work

  • Crowdsourcing can be an effective approach to large-scale data collection
  • Fair compensation and transparent processes are crucial for sustainable participation

3. Cultural Sensitivity in Tech

  • Building for linguistic diversity requires deep understanding of cultural contexts
  • Technology solutions must respect and preserve cultural authenticity

4. Team Resilience

  • Small, dedicated teams can achieve significant results with proper planning
  • Effective communication becomes even more critical with reduced team size

What's next for AnotaNusa

1. Community Building

  • Launch pilot programs with specific language communities
  • Partner with universities and research institutions
  • Create educational resources about language preservation

2. Data Quality Enhancement

  • Implement advanced validation and verification systems
  • Develop contributor training programs
  • Create quality metrics and feedback systems

3. Integration and Partnerships

  • Connect with existing Indonesian AI initiatives (Cendol, SahabatAI)
  • Partner with government agencies focused on cultural preservation
  • Collaborate with international organizations working on endangered languages

4. Sustainability Model

  • Develop long-term funding strategies
  • Create enterprise solutions for commercial NLP development
  • Establish partnerships with tech companies needing Indonesian language data

Our ultimate goal is to ensure that every Indonesian language has the digital representation it deserves, preserving our nation's linguistic heritage for future generations while enabling the development of truly inclusive AI systems.


#BahasaUntukBangsa - Preserving Indonesian Local Languages Through Data Annotation

Built With

Share this project:

Updates