Discord Audio Transcript Deduplication Pipeline

This repository contains a multi-stage pipeline for processing, transcribing, and deduplicating Discord voice session recordings into clean text transcripts. It includes tools for audio capture, filtering, transcription, clustering-based deduplication, and final text output.

📚 Overview

The pipeline operates in the following phases:

Phase 0 – Discord Audio Capture
Captures user audio streams as individual .wav files and generates session logs.
Phase 1 – Audio Validation and Filtering
Filters audio for silence, duration constraints, and rescues bursty utterances with VAD.
Phase 2 – Whisper Transcription
Transcribes accepted audio files to text using a CTranslate2-based Whisper model.
Phase 3 – Deduplication by Clustering
Clusters transcriptions and deduplicates based on similarity, canonical form, and scoring.
Output – A cleaned .txt transcript preserving character, flow, and session integrity.

🛠 Scripts

Script	Purpose
`index.ts`	Captures Discord voice as per-user `.wav` files
`dedupe_audit.py`	Filters raw audio: silence, noise, duplicates, duration
`burst_scope.py`	Rescues short sharp utterances from false VAD rejection
`transcribe_accepted.py`	Transcribes accepted `.wav` files into enriched JSONL
`dedupe_transcript.py`	Deduplicates transcribed JSONL using clustering

🚀 Quick Start

Clone the repo and install required Python and Node.js dependencies.
Configure .env with your Discord bot credentials.
Run each phase in sequence:
- index.ts to capture audio.
- dedupe_audit.py to filter audio.
- transcribe_accepted.py to transcribe.
- dedupe_transcript.py to deduplicate.
Review the final transcript output.

⚡ Key Notes

Built around faster-whisper with standard CTranslate2 binary releases from PyPI.
Supports both GPU (CUDA) and CPU transcription paths via runtime flags in transcribe_accepted.py.
Still tested heavily on RTX-class GPUs, but no longer documented as requiring a custom local CTranslate2 build.

📦 Installation

Python Dependencies

pip install -r requirements.txt

(further dependencies may be required)

CTranslate2 is available on PyPI (for example, pip index versions ctranslate2 currently reports 4.7.1 and historical releases).

Node.js Dependencies

npm install

(further dependencies may be required)

AI Transparency Statement - Mostly built with the aid of ChatGPT. Author is a sysadmin and project manager with some decades of experience. Author believes he can appropriately supervise the "dev team", nevertheless wishes to be honest and upfront for anyone who worries about such things.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
LICENSE		LICENSE
README.md		README.md
burst_scope.py		burst_scope.py
dedupe_audit.py		dedupe_audit.py
dedupe_transcript.py		dedupe_transcript.py
discord_transcript_pipeline.md		discord_transcript_pipeline.md
index.ts		index.ts
package.json		package.json
requirements.txt		requirements.txt
transcribe_accepted.py		transcribe_accepted.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discord Audio Transcript Deduplication Pipeline

📚 Overview

🛠 Scripts

🚀 Quick Start

⚡ Key Notes

📦 Installation

Python Dependencies

Node.js Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Discord Audio Transcript Deduplication Pipeline

📚 Overview

🛠 Scripts

🚀 Quick Start

⚡ Key Notes

📦 Installation

Python Dependencies

Node.js Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages