Skip to content

alexkroman/tiny-audio

Repository files navigation

Tiny Audio Logo

Tiny Audio

Train your own speech recognition model in 24 hours for $12

A minimal, hackable ASR codebase. Connect a frozen audio encoder to a frozen LLM via a trainable projector (~12M params). That's it.

License: MIT Python 3.10+ Model

Try It

Live Demo

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
print(pipe("audio.wav")["text"])

Quick Start

Installation

# Clone and install
git clone https://github.com/alexkroman/tiny-audio.git && cd tiny-audio
poetry install

# Or install from PyPI (inference only)
pip install tiny-audio

Basic Inference

from transformers import pipeline

# Load model
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)

# Transcribe audio file
result = pipe("audio.wav")
print(result["text"])

# Transcribe from URL
result = pipe("https://example.com/audio.mp3")

# Transcribe numpy array (16kHz)
import numpy as np
audio = np.random.randn(16000)  # 1 second of audio
result = pipe(audio)

Streaming Inference

from tiny_audio import ASRModel, ASRProcessor

model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")

# Stream tokens as they're generated
for token in model.generate_streaming(audio_features):
    print(token, end="", flush=True)

Word-Level Timestamps

# Enable timestamps in pipeline
result = pipe("audio.wav", return_timestamps="word")

# Returns: {"text": "hello world", "chunks": [{"text": "hello", "start": 0.0, "end": 0.5}, ...]}

Train Your Own

Quick Test

# Verify setup (~5 min)
poetry run python scripts/train.py +experiments=transcription data.max_train_samples=100 training.max_steps=10

Full Training

# Standard training (~24 hours on A40)
poetry run python scripts/train.py +experiments=transcription

# With custom learning rate
poetry run python scripts/train.py +experiments=transcription training.learning_rate=1e-4

# Resume from checkpoint
poetry run python scripts/train.py +experiments=transcription training.resume_from_checkpoint=/path/to/checkpoint-XXXX

Projector Types

poetry run python scripts/train.py +experiments=transcription  # Simple MLP (~12M params)
poetry run python scripts/train.py +experiments=mosa           # Dense mixture of experts
poetry run python scripts/train.py +experiments=moe            # Sparse routed experts
poetry run python scripts/train.py +experiments=qformer        # Transformer with learnable queries

Multi-Stage Training with LoRA

# Stage 1: Train projector only (default)
poetry run python scripts/train.py +experiments=transcription

# Stage 2: Freeze projector, train LoRA adapters on LLM
poetry run python scripts/train.py +experiments=mlp_lora

# Stage 3: Fine-tune both projector and LoRA
poetry run python scripts/train.py +experiments=mlp_fine_tune

Architecture

Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text

Only the projector trains. The encoder and decoder stay frozen, leveraging their pretrained knowledge.

Component Params Status
GLM-ASR Encoder ~600M Frozen
MLP Projector ~12M Trained
Qwen3-0.6B ~600M Frozen

How It Works

  1. Audio Encoder: GLM-ASR converts raw audio to frame-level embeddings
  2. Projector: Bridges the modality gap by projecting audio embeddings to text embedding space
  3. Language Model: Qwen3 generates text conditioned on the projected audio

The projector uses frame stacking to reduce sequence length: output_len = (input_len - k) // k + 1 where k is the pooling stride.

Evaluation

# Evaluate on default dataset
poetry run ta eval -m mazesmazes/tiny-audio -n 100

# Evaluate on specific dataset
poetry run ta eval -m mazesmazes/tiny-audio -d loquacious -n 1000

# Compare with other models
poetry run ta eval -m assemblyai --assemblyai-model universal -d loquacious -n 100

# WER analysis
poetry run ta analysis high-wer mazesmazes/tiny-audio --threshold 30
poetry run ta analysis compare model1 model2

CLI Reference

All commands available via tiny-audio (or ta for short):

poetry run ta --help  # Show all commands
Command Description
ta eval Evaluate ASR models on datasets
ta analysis WER analysis (high-wer, entity-errors, compare)
ta deploy Deploy demo to HuggingFace Space
ta push Push model to HuggingFace Hub
ta demo Launch local Gradio demo
ta debug Debug utilities (check-mosa, analyze-lora)
ta runpod Remote training on RunPod
ta dev Development tools (lint, format, test, etc.)

Common Options

Option Short Description
--model -m HuggingFace model ID
--datasets -d Datasets to evaluate
--max-samples -n Maximum samples
--output-dir -o Output directory
--num-workers -w Parallel workers

Configuration

Configuration uses Hydra. Override any value with key=value syntax:

# Override model settings
poetry run python scripts/train.py model.projector_type=moe

# Override training settings
poetry run python scripts/train.py training.learning_rate=1e-4 training.batch_size=8

# Override data settings
poetry run python scripts/train.py data.max_train_samples=10000

Config Files

configs/
├── config.yaml           # Main config (imports data + training)
├── experiments/          # Projector presets
│   ├── transcription.yaml # Simple MLP (Stage 1)
│   ├── mosa.yaml         # Dense MoE
│   ├── moe.yaml          # Sparse MoE
│   ├── qformer.yaml      # Transformer
│   ├── mlp_lora.yaml     # Stage 2: LoRA only
│   └── mlp_fine_tune.yaml # Stage 3: Projector + LoRA
├── data/
│   └── multiasr.yaml     # Multi-ASR dataset config
└── training/
    └── production.yaml   # Training hyperparameters

Key Config Parameters

model:
  audio_model_id: "zai-org/GLM-ASR-Nano-2512"  # Audio encoder
  text_model_id: "Qwen/Qwen3-0.6B"              # Language model
  projector_type: mlp                           # mlp, mosa, moe, qformer

training:
  learning_rate: 1e-4
  batch_size: 4
  max_steps: 50000
  warmup_steps: 1000

Project Structure

tiny-audio/
├── tiny_audio/              # Core library
│   ├── asr_modeling.py      # ASRModel: encoder + projector + decoder
│   ├── asr_config.py        # ASRConfig: all model settings
│   ├── asr_pipeline.py      # HuggingFace pipeline for inference
│   ├── asr_processing.py    # ASRProcessor: audio/text preprocessing
│   ├── projectors.py        # All projector architectures
│   ├── handler.py           # HF Inference Endpoints handler
│   └── integrations/        # Voice agent integrations
├── scripts/
│   ├── train.py             # Training script (Hydra)
│   ├── cli.py               # Unified CLI entry point
│   ├── dev.py               # Development utilities
│   ├── analysis.py          # WER analysis tools
│   ├── eval/                # Evaluation framework
│   │   ├── evaluators/      # ASR, diarization, alignment
│   │   └── datasets.py      # Dataset loading
│   ├── deploy/              # HF Space deployment
│   ├── hub/                 # HF Hub integration
│   └── debug/               # Debug utilities
├── configs/                 # Hydra configuration
├── tests/                   # Test suite
└── docs/                    # Documentation and course

Development

Setup

git clone https://github.com/alexkroman/tiny-audio.git
cd tiny-audio
poetry install

Running Tests

poetry run ta dev test                    # Run all tests
poetry run pytest tests/test_projectors.py -v  # Single file
poetry run pytest -k "test_forward" -v    # By name pattern

Code Quality

poetry run ta dev format      # Format code (black, isort)
poetry run ta dev lint        # Lint (ruff, black --check)
poetry run ta dev type-check  # Type check (mypy, pyright)
poetry run ta dev precommit   # Full quality gate

Adding a New Projector

  1. Add your projector class to tiny_audio/projectors.py:
class MyProjector(nn.Module):
    def __init__(self, input_dim: int, output_dim: int, config):
        super().__init__()
        # Your architecture here

    def forward(self, x: torch.Tensor, attention_mask: torch.Tensor = None) -> torch.Tensor:
        # Return projected features
        return x

    def get_output_length(self, input_length: int) -> int:
        # Return output sequence length given input length
        return input_length
  1. Register it in the PROJECTOR_CLASSES dict in projectors.py:
PROJECTOR_CLASSES = {
    "mlp": MLPAudioProjector,
    "mosa": MOSAProjector,
    "moe": MoEAudioProjector,
    "qformer": QFormerAudioProjector,
    "my_projector": MyProjector,  # Add here
}
  1. Create an experiment config configs/experiments/my_projector.yaml:
model:
  projector_type: my_projector
  1. Train: poetry run python scripts/train.py +experiments=my_projector

Adding a New Dataset

  1. Add a config file configs/data/my_dataset.yaml:
dataset_name: "your-org/your-dataset"
dataset_split: "train"
audio_column: "audio"
text_column: "text"
  1. Train with your dataset: poetry run python scripts/train.py data=my_dataset

Key Files to Understand

File Purpose When to Modify
asr_modeling.py Core model class Adding model features, changing forward pass
asr_config.py Configuration Adding new config parameters
projectors.py Projector architectures Adding new projector types
asr_processing.py Audio/text preprocessing Changing input processing
train.py Training loop Modifying training behavior

Environment Variables

Variable Description
HF_TOKEN HuggingFace API token (for private models/pushing)
WANDB_API_KEY Weights & Biases API key
WANDB_RUN_ID Resume a specific W&B run
ASSEMBLYAI_API_KEY For AssemblyAI evaluation comparison

Learn More

Acknowledgments

License

MIT

About

Train your own speech AI model from scratch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages