Skip to content

jmondaud/audio-transcriber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audio Transcriber

A Python-based audio transcription tool using Faster Whisper, an optimized implementation of OpenAI's Whisper model. This tool processes audio files with GPU acceleration for fast, accurate transcriptions.

Features

  • GPU-accelerated transcription using NVIDIA CUDA
  • Supports multiple audio formats (MP3, WAV, M4A, FLAC, OGG, WMA, AAC)
  • Automatic organization of audio files and transcripts
  • Uses Whisper's best model (large-v3) for maximum accuracy
  • Efficient processing with faster-whisper and CTranslate2

Requirements

  • Python 3.10-3.13
  • NVIDIA GPU with CUDA support
  • CUDA 12.x
  • cuDNN libraries

Installation

1. Install dependencies

This project uses uv for dependency management:

uv sync

2. Install CUDA libraries

For GPU acceleration, install cuDNN:

uv pip install nvidia-cudnn-cu12

Project Structure

audio-transcriber/
├── .venv/              # Virtual environment (created by uv)
├── audio/              # Input directory - place audio files here
├── transcript/         # Output directory - organized results
├── main.py             # Main transcription script
├── run.sh              # Wrapper script to run with CUDA libraries
├── pyproject.toml      # Project configuration and dependencies
└── .gitignore          # Git ignore rules

Usage

Basic Usage

  1. Place your audio files in the audio/ directory

  2. Run the transcription script:

./run.sh

Important: Always use ./run.sh to run the script. This wrapper sets up the required CUDA library paths before starting the transcription. Running directly with uv run python main.py or python main.py will fail with cuDNN errors.

Output Structure

The script processes each audio file and creates an organized folder structure:

transcript/
├── audio_file_1/
│   ├── audio_file_1.mp3
│   └── audio_file_1_transcript.txt
└── audio_file_2/
    ├── audio_file_2.wav
    └── audio_file_2_transcript.txt

Each audio file gets its own folder named after the original file (without extension), containing:

  • The original audio file (moved from audio/)
  • The generated transcript as a text file

Supported Audio Formats

  • MP3 (.mp3)
  • WAV (.wav)
  • M4A (.m4a)
  • FLAC (.flac)
  • OGG (.ogg)
  • WMA (.wma)
  • AAC (.aac)

Configuration

The script uses the following defaults (can be modified in main.py):

  • Model: large-v3 (Whisper's most accurate model)
  • Device: cuda (GPU acceleration)
  • Compute Type: float16 (balanced speed/accuracy)

Troubleshooting

GPU Not Working

If you see CUDA/cuDNN errors, ensure you have installed the required libraries:

uv pip install nvidia-cudnn-cu12

Verify your GPU is detected:

nvidia-smi

No Audio Files Found

Make sure your audio files are:

  • Placed in the audio/ directory
  • In a supported format (see list above)
  • Not hidden files

Memory Issues

If you encounter out-of-memory errors, the large-v3 model requires significant VRAM. You can modify main.py to use a smaller model:

model = WhisperModel("medium", device="cuda", compute_type="float16")

Available models (in order of size/accuracy):

  • tiny, base, small, medium, large-v2, large-v3

License

This project uses the following open-source components:

  • faster-whisper (MIT License)
  • OpenAI Whisper model (MIT License)

About

GPU-accelerated audio transcription tool using Faster Whisper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors