Tiny Audio

Train your own speech recognition model in 24 hours for $12

A minimal, hackable ASR codebase. Connect a frozen audio encoder to a frozen LLM via a trainable projector (~12M params). That's it.

Try It

Live Demo

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
print(pipe("audio.wav")["text"])

Quick Start

Installation

# Clone and install
git clone https://github.com/alexkroman/tiny-audio.git && cd tiny-audio
poetry install

# Or install from PyPI (inference only)
pip install tiny-audio

Basic Inference

from transformers import pipeline

# Load model
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)

# Transcribe audio file
result = pipe("audio.wav")
print(result["text"])

# Transcribe from URL
result = pipe("https://example.com/audio.mp3")

# Transcribe numpy array (16kHz)
import numpy as np
audio = np.random.randn(16000)  # 1 second of audio
result = pipe(audio)

Streaming Inference

from tiny_audio import ASRModel, ASRProcessor

model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")

# Stream tokens as they're generated
for token in model.generate_streaming(audio_features):
    print(token, end="", flush=True)

Word-Level Timestamps

# Enable timestamps in pipeline
result = pipe("audio.wav", return_timestamps="word")

# Returns: {"text": "hello world", "chunks": [{"text": "hello", "start": 0.0, "end": 0.5}, ...]}

Train Your Own

Quick Test

# Verify setup (~5 min)
poetry run python scripts/train.py +experiments=transcription data.max_train_samples=100 training.max_steps=10

Full Training

# Standard training (~24 hours on A40)
poetry run python scripts/train.py +experiments=transcription

# With custom learning rate
poetry run python scripts/train.py +experiments=transcription training.learning_rate=1e-4

# Resume from checkpoint
poetry run python scripts/train.py +experiments=transcription training.resume_from_checkpoint=/path/to/checkpoint-XXXX

Projector Types

poetry run python scripts/train.py +experiments=transcription  # Simple MLP (~12M params)
poetry run python scripts/train.py +experiments=mosa           # Dense mixture of experts
poetry run python scripts/train.py +experiments=moe            # Sparse routed experts
poetry run python scripts/train.py +experiments=qformer        # Transformer with learnable queries

Multi-Stage Training with LoRA

# Stage 1: Train projector only (default)
poetry run python scripts/train.py +experiments=transcription

# Stage 2: Freeze projector, train LoRA adapters on LLM
poetry run python scripts/train.py +experiments=mlp_lora

# Stage 3: Fine-tune both projector and LoRA
poetry run python scripts/train.py +experiments=mlp_fine_tune

Architecture

Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text

Only the projector trains. The encoder and decoder stay frozen, leveraging their pretrained knowledge.

Component	Params	Status
GLM-ASR Encoder	~600M	Frozen
MLP Projector	~12M	Trained
Qwen3-0.6B	~600M	Frozen

How It Works

Audio Encoder: GLM-ASR converts raw audio to frame-level embeddings
Projector: Bridges the modality gap by projecting audio embeddings to text embedding space
Language Model: Qwen3 generates text conditioned on the projected audio

The projector uses frame stacking to reduce sequence length: output_len = (input_len - k) // k + 1 where k is the pooling stride.

Evaluation

# Evaluate on default dataset
poetry run ta eval -m mazesmazes/tiny-audio -n 100

# Evaluate on specific dataset
poetry run ta eval -m mazesmazes/tiny-audio -d loquacious -n 1000

# Compare with other models
poetry run ta eval -m assemblyai --assemblyai-model universal -d loquacious -n 100

# WER analysis
poetry run ta analysis high-wer mazesmazes/tiny-audio --threshold 30
poetry run ta analysis compare model1 model2

CLI Reference

All commands available via tiny-audio (or ta for short):

poetry run ta --help  # Show all commands

Command	Description
`ta eval`	Evaluate ASR models on datasets
`ta analysis`	WER analysis (high-wer, entity-errors, compare)
`ta deploy`	Deploy demo to HuggingFace Space
`ta push`	Push model to HuggingFace Hub
`ta demo`	Launch local Gradio demo
`ta debug`	Debug utilities (check-mosa, analyze-lora)
`ta runpod`	Remote training on RunPod
`ta dev`	Development tools (lint, format, test, etc.)

Common Options

Option	Short	Description
`--model`	`-m`	HuggingFace model ID
`--datasets`	`-d`	Datasets to evaluate
`--max-samples`	`-n`	Maximum samples
`--output-dir`	`-o`	Output directory
`--num-workers`	`-w`	Parallel workers

Configuration

Configuration uses Hydra. Override any value with key=value syntax:

# Override model settings
poetry run python scripts/train.py model.projector_type=moe

# Override training settings
poetry run python scripts/train.py training.learning_rate=1e-4 training.batch_size=8

# Override data settings
poetry run python scripts/train.py data.max_train_samples=10000

Config Files

configs/
├── config.yaml           # Main config (imports data + training)
├── experiments/          # Projector presets
│   ├── transcription.yaml # Simple MLP (Stage 1)
│   ├── mosa.yaml         # Dense MoE
│   ├── moe.yaml          # Sparse MoE
│   ├── qformer.yaml      # Transformer
│   ├── mlp_lora.yaml     # Stage 2: LoRA only
│   └── mlp_fine_tune.yaml # Stage 3: Projector + LoRA
├── data/
│   └── multiasr.yaml     # Multi-ASR dataset config
└── training/
    └── production.yaml   # Training hyperparameters

Key Config Parameters

model:
  audio_model_id: "zai-org/GLM-ASR-Nano-2512"  # Audio encoder
  text_model_id: "Qwen/Qwen3-0.6B"              # Language model
  projector_type: mlp                           # mlp, mosa, moe, qformer

training:
  learning_rate: 1e-4
  batch_size: 4
  max_steps: 50000
  warmup_steps: 1000

Project Structure

tiny-audio/
├── tiny_audio/              # Core library
│   ├── asr_modeling.py      # ASRModel: encoder + projector + decoder
│   ├── asr_config.py        # ASRConfig: all model settings
│   ├── asr_pipeline.py      # HuggingFace pipeline for inference
│   ├── asr_processing.py    # ASRProcessor: audio/text preprocessing
│   ├── projectors.py        # All projector architectures
│   ├── handler.py           # HF Inference Endpoints handler
│   └── integrations/        # Voice agent integrations
├── scripts/
│   ├── train.py             # Training script (Hydra)
│   ├── cli.py               # Unified CLI entry point
│   ├── dev.py               # Development utilities
│   ├── analysis.py          # WER analysis tools
│   ├── eval/                # Evaluation framework
│   │   ├── evaluators/      # ASR, diarization, alignment
│   │   └── datasets.py      # Dataset loading
│   ├── deploy/              # HF Space deployment
│   ├── hub/                 # HF Hub integration
│   └── debug/               # Debug utilities
├── configs/                 # Hydra configuration
├── tests/                   # Test suite
└── docs/                    # Documentation and course

Development

Setup

git clone https://github.com/alexkroman/tiny-audio.git
cd tiny-audio
poetry install

Running Tests

poetry run ta dev test                    # Run all tests
poetry run pytest tests/test_projectors.py -v  # Single file
poetry run pytest -k "test_forward" -v    # By name pattern

Code Quality

poetry run ta dev format      # Format code (black, isort)
poetry run ta dev lint        # Lint (ruff, black --check)
poetry run ta dev type-check  # Type check (mypy, pyright)
poetry run ta dev precommit   # Full quality gate

Adding a New Projector

Add your projector class to tiny_audio/projectors.py:

class MyProjector(nn.Module):
    def __init__(self, input_dim: int, output_dim: int, config):
        super().__init__()
        # Your architecture here

    def forward(self, x: torch.Tensor, attention_mask: torch.Tensor = None) -> torch.Tensor:
        # Return projected features
        return x

    def get_output_length(self, input_length: int) -> int:
        # Return output sequence length given input length
        return input_length

Register it in the PROJECTOR_CLASSES dict in projectors.py:

PROJECTOR_CLASSES = {
    "mlp": MLPAudioProjector,
    "mosa": MOSAProjector,
    "moe": MoEAudioProjector,
    "qformer": QFormerAudioProjector,
    "my_projector": MyProjector,  # Add here
}

Create an experiment config configs/experiments/my_projector.yaml:

model:
  projector_type: my_projector

Train: poetry run python scripts/train.py +experiments=my_projector

Adding a New Dataset

Add a config file configs/data/my_dataset.yaml:

dataset_name: "your-org/your-dataset"
dataset_split: "train"
audio_column: "audio"
text_column: "text"

Train with your dataset: poetry run python scripts/train.py data=my_dataset

Key Files to Understand

File	Purpose	When to Modify
`asr_modeling.py`	Core model class	Adding model features, changing forward pass
`asr_config.py`	Configuration	Adding new config parameters
`projectors.py`	Projector architectures	Adding new projector types
`asr_processing.py`	Audio/text preprocessing	Changing input processing
`train.py`	Training loop	Modifying training behavior

Environment Variables

Variable	Description
`HF_TOKEN`	HuggingFace API token (for private models/pushing)
`WANDB_API_KEY`	Weights & Biases API key
`WANDB_RUN_ID`	Resume a specific W&B run
`ASSEMBLYAI_API_KEY`	For AssemblyAI evaluation comparison

Learn More

Free 3.5-hour course — Build ASR from scratch
Quick Start Guide — Detailed setup instructions
Model Card — Model documentation template

Acknowledgments

GLM-ASR for audio encoding
Qwen3 for language modeling
LoquaciousSet for training data

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
configs		configs
demo		demo
docs		docs
public		public
scripts		scripts
tests		tests
tiny_audio		tiny_audio
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.yamllint		.yamllint
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Tiny Audio

Try It

Quick Start

Installation

Basic Inference

Streaming Inference

Word-Level Timestamps

Train Your Own

Quick Test

Full Training

Projector Types

Multi-Stage Training with LoRA

Architecture

How It Works

Evaluation

CLI Reference

Common Options

Configuration

Config Files

Key Config Parameters

Project Structure

Development

Setup

Running Tests

Code Quality

Adding a New Projector

Adding a New Dataset

Key Files to Understand

Environment Variables

Learn More

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages