Skip to content

Srv99x/voice-detection-ai

Repository files navigation

title VoiceGuard AI
emoji πŸŽ™οΈ
colorFrom blue
colorTo purple
sdk docker
app_port 7860
pinned false
license mit
short_description Real-time deepfake audio detection API (FastAPI + Wav2Vec2)

Python FastAPI PyTorch React Vite Docker


Hugging Face Vercel License Stars


Can you trust the voice you just heard?
VoiceGuard AI answers that in under 300ms.


πŸš€ Try Live Demo Β Β·Β  πŸ“‘ API Playground Β Β·Β  πŸ› Report a Bug Β Β·Β  πŸ’‘ Request Feature


🧠 What Is This?

VoiceGuard AI is a production-grade deepfake audio detection system that determines β€” in real time β€” whether a voice recording is authentic human speech or AI-synthesized audio.

As voice cloning tools like ElevenLabs, Murf, and PlayHT become disturbingly accessible, the ability to verify audio authenticity is no longer optional β€” it's critical for fraud prevention, digital forensics, and AI transparency.

This project was built for a CyberSecurity Hackathon and is fully deployed and publicly accessible.

Why it's different:

  • Uses Meta's Wav2Vec2-Large-XLSR-53 β€” one of the most powerful multilingual speech transformers ever released β€” for raw vocal fingerprinting, not just spectral features
  • Real model, real data β€” trained on diverse human + AI audio with 4Γ— augmentation
  • Full-stack, production-deployed, with a polished cyberpunk frontend

✨ Key Features

Feature Details
πŸŽ™οΈ Multi-format support MP3, WAV, M4A, OGG, FLAC β€” all accepted
⚑ Sub-300ms Inference Async FastAPI + pre-loaded model
🌐 Multilingual Wav2Vec2-XLSR trained on 53 languages
πŸ”’ API Key Auth Supports both x-api-key and Authorization: Bearer headers
πŸ“Š Confidence Scoring Probability score, not just binary output
🐳 Containerized Docker-ready for one-command deployment
πŸ” Data Augmentation 4Γ— augmentation per sample (noise + pitch shift Β±2)
πŸ–₯️ Stunning UI Cyberpunk React frontend with matrix rain and animated confidence ring

πŸ“Š Model Performance

Evaluated on a held-out test set of real and AI-generated audio samples

Metric Score
βœ… Accuracy ~91%
🎯 Precision ~89%
πŸ” Recall ~93%
βš–οΈ F1 Score ~91%
⚑ Inference Time < 300ms

Performance may vary based on audio quality, background noise, and the TTS engine used to generate synthetic samples. Clips < 2 seconds may yield lower confidence.


πŸ—οΈ System Architecture

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                        CLIENT (Browser)                         β”‚
  β”‚            Drag & Drop / Record β†’ Base64 encode audio           β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚  POST /detect-audio/
                              β”‚  { audio_base64, audio_format }
                              β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                      FastAPI Backend                            β”‚
  β”‚                                                                 β”‚
  β”‚  1. Verify API Key (x-api-key or Authorization header)          β”‚
  β”‚  2. Decode Base64 β†’ raw audio bytes                             β”‚
  β”‚  3. Librosa: load + resample to 16kHz                           β”‚
  β”‚  4. Wav2Vec2-Large-XLSR-53 β†’ 1024-dim vocal embedding           β”‚
  β”‚  5. StandardScaler: normalize embedding                         β”‚
  β”‚  6. MLP Classifier: predict [human_prob, ai_prob]               β”‚
  β”‚  7. Return verdict + confidence score                           β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚         JSON Response         β”‚
              β”‚  {                            β”‚
              β”‚    is_ai_generated: bool,     β”‚
              β”‚    confidence_score: float,   β”‚
              β”‚    message: string            β”‚
              β”‚  }                            β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Stack:

  • Feature Extractor β†’ facebook/wav2vec2-large-xlsr-53 (1024-dim embeddings)
  • Classifier β†’ MLP Classifier (MLPClassifier with hidden layers 128, 64), max_iter=500
  • Scaler β†’ StandardScaler (fitted on training set)
  • Training augmentation β†’ Original + Gaussian noise + Pitch Β±2 semitones

Why MLP? Wav2Vec2 produces 1024-dimensional embeddings that capture highly non-linear speech phenomena; an MLP with hidden layers [128, 64] can learn these complex decision boundaries far more effectively than logistic regression, which is limited to linear separability in the original feature space.


πŸ”Œ API Reference

Base URL: https://srv99x-voice-detector-live.hf.space
Interactive Docs: https://srv99x-voice-detector-live.hf.space/docs

Endpoints

Method Endpoint Auth Description
GET / ❌ Health check + API info
POST /detect-audio/ βœ… Main detection endpoint

POST /detect-audio/

Request Headers:

Content-Type: application/json
x-api-key: YOUR_API_KEY

or

Authorization: Bearer YOUR_API_KEY

Request Body:

{
  "audio_base64": "<BASE64_ENCODED_AUDIO>",
  "audio_format": "mp3",
  "language": "en"
}
Field Type Required Description
audio_base64 string βœ… Base64-encoded audio content
audio_format string βœ… "mp3", "wav", "m4a", "ogg", "flac"
language string ❌ Language hint (optional)

Response:

{
  "is_ai_generated": true,
  "confidence_score": 0.9842,
  "message": "Synthetic spectral patterns detected. AI-generated voice signature found."
}
Field Type Description
is_ai_generated boolean true = AI-generated, false = human
confidence_score float Prediction probability between 0.0 and 1.0
message string Human-readable result summary

Error Codes:

Code Meaning
401 Missing or invalid API key
400 Invalid base64 or missing audio data
500 Internal server error

πŸ§ͺ Quick Test (cURL)

# Step 1: Encode your audio to base64
BASE64=$(base64 -w 0 your_audio.mp3)

# Step 2: Send request
curl -X POST "https://srv99x-voice-detector-live.hf.space/detect-audio/" \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d "{\"audio_base64\": \"$BASE64\", \"audio_format\": \"mp3\", \"language\": \"en\"}"

🐍 Python Example

import requests
import base64

# Encode audio
with open("your_audio.mp3", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode("utf-8")

# Call API
response = requests.post(
    "https://srv99x-voice-detector-live.hf.space/detect-audio/",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "audio_base64": audio_b64,
        "audio_format": "mp3",
        "language": "en"
    }
)

result = response.json()
print(f"AI Generated: {result['is_ai_generated']}")
print(f"Confidence:   {result['confidence_score']:.1%}")
print(f"Message:      {result['message']}")

🌐 JavaScript / Fetch Example

const audioFile = document.querySelector('input[type="file"]').files[0];
const reader = new FileReader();

reader.onload = async () => {
  const base64 = reader.result.split(',')[1];

  const response = await fetch('https://srv99x-voice-detector-live.hf.space/detect-audio/', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'x-api-key': 'YOUR_API_KEY'
    },
    body: JSON.stringify({ audio_base64: base64, audio_format: 'mp3', language: 'en' })
  });

  const result = await response.json();
  console.log(result);
};

reader.readAsDataURL(audioFile);

πŸ› οΈ Local Development Setup

Prerequisites

  • Python 3.10+
  • Node.js 18+
  • ffmpeg installed β†’ Download here
  • Git

Backend Setup

# 1. Clone the repository
git clone https://github.com/Srv99x/voice-detection-ai.git
cd voice-detection-ai

# 2. Create a virtual environment
python -m venv venv

# Windows
venv\Scripts\activate

# macOS / Linux
source venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Create a .env file
echo "SECRET_API_KEY=your_secret_key_here" > .env

# 5. Start the server
uvicorn main:app --reload

API will be live at: http://127.0.0.1:8000
Interactive Swagger docs: http://127.0.0.1:8000/docs

Frontend Setup

# Navigate to frontend
cd frontend

# Copy env file and fill in values
cp .env.example .env
# Edit .env: set VITE_API_URL and VITE_API_KEY

# Install dependencies
npm install

# Start dev server
npm run dev

Frontend will be live at: http://localhost:5173

Environment Variables

Backend (.env in root):

SECRET_API_KEY=your_super_secret_key_here

Frontend (frontend/.env):

VITE_API_URL=http://localhost:8000
VITE_API_KEY=your_super_secret_key_here

🐳 Docker Deployment

# Build the image
docker build -t voiceguard-ai .

# Run the container
docker run -p 7860:7860 -e SECRET_API_KEY=your_key_here voiceguard-ai

API available at http://localhost:7860


πŸ‹οΈ Training Your Own Model

If you want to retrain the classifier on your own dataset:

1. Prepare your dataset

voice-detection-ai/
└── dataset/
    β”œβ”€β”€ real/          ← Human voice recordings
    β”‚   β”œβ”€β”€ english/
    β”‚   β”œβ”€β”€ hindi/
    β”‚   └── ...
    └── ai/            ← AI/TTS-generated audio
        β”œβ”€β”€ elevenlabs/
        β”œβ”€β”€ murf/
        └── ...

Supports .mp3, .wav, .m4a β€” recursive subfolder scanning

2. Run the training script

python train_model.py

The script will:

  1. Load and augment all audio files (4Γ— per file: original, noise, pitchΒ±2)
  2. Extract 1024-dim embeddings via wav2vec2-large-xlsr-53
  3. Train an MLPClassifier(hidden_layers=[128, 64])
  4. Save hackathon_model.pkl + model_scaler.pkl

πŸ“ Project Structure

voice-detection-ai/
β”‚
β”œβ”€β”€ 🐍 Backend
β”‚   β”œβ”€β”€ main.py                  # FastAPI app + detection engine
β”‚   β”œβ”€β”€ train_model.py           # ML training pipeline with augmentation
β”‚   β”œβ”€β”€ test_api.py              # API test suite (4 test cases)
β”‚   β”œβ”€β”€ setup_ffmpeg.py          # FFmpeg installation helper
β”‚   β”œβ”€β”€ hackathon_model.pkl      # Trained MLP classifier (~4.3 MB)
β”‚   β”œβ”€β”€ model_scaler.pkl         # Fitted StandardScaler (~24 KB)
β”‚   β”œβ”€β”€ requirements.txt         # Python dependencies
β”‚   β”œβ”€β”€ Dockerfile               # Container config (port 7860 for HF Spaces)
β”‚   └── .github/workflows/ci.yml  # GitHub Actions CI pipeline
β”‚
└── βš›οΈ Frontend (frontend/)
    β”œβ”€β”€ src/
    β”‚   β”œβ”€β”€ App.jsx              # Main React app (matrix rain, drag-drop, results)
    β”‚   β”œβ”€β”€ index.css            # Full cyberpunk design system (644 lines)
    β”‚   └── main.jsx             # React entry point
    β”œβ”€β”€ public/                  # Static assets
    β”œβ”€β”€ package.json             # Vite 7 + React 19
    β”œβ”€β”€ vite.config.js           # Vite configuration
    └── vercel.json              # SPA routing config for Vercel

πŸš€ Deployment

Backend β†’ Hugging Face Spaces

The backend is deployed as a Docker Space on Hugging Face. The Dockerfile is pre-configured for the HF Spaces environment (port 7860).

To deploy your own fork:

  1. Create a new Space on huggingface.co/spaces
  2. Set Space SDK to Docker
  3. Push this repo to your Space's git remote
  4. Add SECRET_API_KEY in Space Settings β†’ Repository secrets

Frontend β†’ Vercel

Deploy with Vercel

Or manually:

cd frontend
npx vercel --prod

Set environment variables in the Vercel dashboard:

  • VITE_API_URL β†’ Your Hugging Face Space URL
  • VITE_API_KEY β†’ Your secret API key

⚠️ Limitations

  • Short clips (< 2 seconds) may yield lower confidence scores
  • Professional-grade voice clones with high-fidelity TTS may occasionally bypass detection
  • Background noise can reduce classification reliability
  • Language bias β€” model is strongest on English; multilingual performance depends on training data diversity
  • Large audio files will increase base64 payload size and transfer time

πŸ›£οΈ Roadmap

  • Streaming audio input (WebSocket API)
  • Support more formats: OGG, FLAC, M4A via API
  • Waveform visualizer on frontend
  • History tab with past analysis results
  • Confidence threshold configuration
  • Batch file analysis endpoint
  • Public leaderboard of tested TTS engines

🀝 Contributing

Contributions are welcome and appreciated!

# Fork the repo, then:
git checkout -b feature/your-amazing-feature
git commit -m "feat: add your amazing feature"
git push origin feature/your-amazing-feature
# Open a Pull Request!

Please read CONTRIBUTING.md before submitting changes.

Good first issues to tackle:

  • Improve model accuracy with larger/more diverse datasets
  • Add real waveform visualization using Web Audio API
  • Add OGG/FLAC/M4A support to the API
  • Write more comprehensive unit tests
  • Add dark/light mode toggle to frontend

βš™οΈ Tech Stack

Layer Technology
Language Model facebook/wav2vec2-large-xlsr-53
Classifier scikit-learn MLPClassifier
Audio Processing librosa, soundfile, pydub
Backend Framework FastAPI + Uvicorn
Frontend React 19 + Vite 7
Styling Vanilla CSS (glassmorphism + cyberpunk)
Containerization Docker
Backend Hosting Hugging Face Spaces
Frontend Hosting Vercel
CI/CD GitHub Actions

πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.


πŸ‘€ Author

Sourav Chakraborty
B.Tech CSE Β· AI/ML Enthusiast Β· Full-Stack Developer

LinkedIn GitHub


Built with 🧠 neural networks, β˜• caffeine, and ❀️ for AI transparency.

If this project helped you, please consider giving it a ⭐ β€” it really means a lot!

About

A robust AI-powered web app to detect deepfake audio using Wav2Vec2 and FastAPI. Deployed on Hugging Face Spaces for the CyberSecurity Hackathon.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors