| title | VoiceGuard AI |
|---|---|
| emoji | ποΈ |
| colorFrom | blue |
| colorTo | purple |
| sdk | docker |
| app_port | 7860 |
| pinned | false |
| license | mit |
| short_description | Real-time deepfake audio detection API (FastAPI + Wav2Vec2) |
Can you trust the voice you just heard?
VoiceGuard AI answers that in under 300ms.
π Try Live Demo Β Β·Β π‘ API Playground Β Β·Β π Report a Bug Β Β·Β π‘ Request Feature
VoiceGuard AI is a production-grade deepfake audio detection system that determines β in real time β whether a voice recording is authentic human speech or AI-synthesized audio.
As voice cloning tools like ElevenLabs, Murf, and PlayHT become disturbingly accessible, the ability to verify audio authenticity is no longer optional β it's critical for fraud prevention, digital forensics, and AI transparency.
This project was built for a CyberSecurity Hackathon and is fully deployed and publicly accessible.
- Uses Meta's Wav2Vec2-Large-XLSR-53 β one of the most powerful multilingual speech transformers ever released β for raw vocal fingerprinting, not just spectral features
- Real model, real data β trained on diverse human + AI audio with 4Γ augmentation
- Full-stack, production-deployed, with a polished cyberpunk frontend
| Feature | Details |
|---|---|
| ποΈ Multi-format support | MP3, WAV, M4A, OGG, FLAC β all accepted |
| β‘ Sub-300ms Inference | Async FastAPI + pre-loaded model |
| π Multilingual | Wav2Vec2-XLSR trained on 53 languages |
| π API Key Auth | Supports both x-api-key and Authorization: Bearer headers |
| π Confidence Scoring | Probability score, not just binary output |
| π³ Containerized | Docker-ready for one-command deployment |
| π Data Augmentation | 4Γ augmentation per sample (noise + pitch shift Β±2) |
| π₯οΈ Stunning UI | Cyberpunk React frontend with matrix rain and animated confidence ring |
Evaluated on a held-out test set of real and AI-generated audio samples
| Metric | Score |
|---|---|
| β Accuracy | ~91% |
| π― Precision | ~89% |
| π Recall | ~93% |
| βοΈ F1 Score | ~91% |
| β‘ Inference Time | < 300ms |
Performance may vary based on audio quality, background noise, and the TTS engine used to generate synthetic samples. Clips < 2 seconds may yield lower confidence.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT (Browser) β
β Drag & Drop / Record β Base64 encode audio β
βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β POST /detect-audio/
β { audio_base64, audio_format }
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β β
β 1. Verify API Key (x-api-key or Authorization header) β
β 2. Decode Base64 β raw audio bytes β
β 3. Librosa: load + resample to 16kHz β
β 4. Wav2Vec2-Large-XLSR-53 β 1024-dim vocal embedding β
β 5. StandardScaler: normalize embedding β
β 6. MLP Classifier: predict [human_prob, ai_prob] β
β 7. Return verdict + confidence score β
βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β JSON Response β
β { β
β is_ai_generated: bool, β
β confidence_score: float, β
β message: string β
β } β
βββββββββββββββββββββββββββββββββ
Model Stack:
- Feature Extractor β
facebook/wav2vec2-large-xlsr-53(1024-dim embeddings) - Classifier β
MLP Classifier (MLPClassifier with hidden layers 128, 64),max_iter=500 - Scaler β
StandardScaler(fitted on training set) - Training augmentation β Original + Gaussian noise + Pitch Β±2 semitones
Why MLP? Wav2Vec2 produces 1024-dimensional embeddings that capture highly non-linear speech phenomena; an MLP with hidden layers [128, 64] can learn these complex decision boundaries far more effectively than logistic regression, which is limited to linear separability in the original feature space.
Base URL: https://srv99x-voice-detector-live.hf.space
Interactive Docs: https://srv99x-voice-detector-live.hf.space/docs
| Method | Endpoint | Auth | Description |
|---|---|---|---|
GET |
/ |
β | Health check + API info |
POST |
/detect-audio/ |
β | Main detection endpoint |
Request Headers:
Content-Type: application/json
x-api-key: YOUR_API_KEYor
Authorization: Bearer YOUR_API_KEYRequest Body:
{
"audio_base64": "<BASE64_ENCODED_AUDIO>",
"audio_format": "mp3",
"language": "en"
}| Field | Type | Required | Description |
|---|---|---|---|
audio_base64 |
string |
β | Base64-encoded audio content |
audio_format |
string |
β | "mp3", "wav", "m4a", "ogg", "flac" |
language |
string |
β | Language hint (optional) |
Response:
{
"is_ai_generated": true,
"confidence_score": 0.9842,
"message": "Synthetic spectral patterns detected. AI-generated voice signature found."
}| Field | Type | Description |
|---|---|---|
is_ai_generated |
boolean |
true = AI-generated, false = human |
confidence_score |
float |
Prediction probability between 0.0 and 1.0 |
message |
string |
Human-readable result summary |
Error Codes:
| Code | Meaning |
|---|---|
401 |
Missing or invalid API key |
400 |
Invalid base64 or missing audio data |
500 |
Internal server error |
# Step 1: Encode your audio to base64
BASE64=$(base64 -w 0 your_audio.mp3)
# Step 2: Send request
curl -X POST "https://srv99x-voice-detector-live.hf.space/detect-audio/" \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-d "{\"audio_base64\": \"$BASE64\", \"audio_format\": \"mp3\", \"language\": \"en\"}"import requests
import base64
# Encode audio
with open("your_audio.mp3", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("utf-8")
# Call API
response = requests.post(
"https://srv99x-voice-detector-live.hf.space/detect-audio/",
headers={"x-api-key": "YOUR_API_KEY"},
json={
"audio_base64": audio_b64,
"audio_format": "mp3",
"language": "en"
}
)
result = response.json()
print(f"AI Generated: {result['is_ai_generated']}")
print(f"Confidence: {result['confidence_score']:.1%}")
print(f"Message: {result['message']}")const audioFile = document.querySelector('input[type="file"]').files[0];
const reader = new FileReader();
reader.onload = async () => {
const base64 = reader.result.split(',')[1];
const response = await fetch('https://srv99x-voice-detector-live.hf.space/detect-audio/', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': 'YOUR_API_KEY'
},
body: JSON.stringify({ audio_base64: base64, audio_format: 'mp3', language: 'en' })
});
const result = await response.json();
console.log(result);
};
reader.readAsDataURL(audioFile);- Python 3.10+
- Node.js 18+
ffmpeginstalled β Download here- Git
# 1. Clone the repository
git clone https://github.com/Srv99x/voice-detection-ai.git
cd voice-detection-ai
# 2. Create a virtual environment
python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Create a .env file
echo "SECRET_API_KEY=your_secret_key_here" > .env
# 5. Start the server
uvicorn main:app --reloadAPI will be live at:
http://127.0.0.1:8000
Interactive Swagger docs:http://127.0.0.1:8000/docs
# Navigate to frontend
cd frontend
# Copy env file and fill in values
cp .env.example .env
# Edit .env: set VITE_API_URL and VITE_API_KEY
# Install dependencies
npm install
# Start dev server
npm run devFrontend will be live at:
http://localhost:5173
Backend (.env in root):
SECRET_API_KEY=your_super_secret_key_hereFrontend (frontend/.env):
VITE_API_URL=http://localhost:8000
VITE_API_KEY=your_super_secret_key_here# Build the image
docker build -t voiceguard-ai .
# Run the container
docker run -p 7860:7860 -e SECRET_API_KEY=your_key_here voiceguard-aiAPI available at
http://localhost:7860
If you want to retrain the classifier on your own dataset:
voice-detection-ai/
βββ dataset/
βββ real/ β Human voice recordings
β βββ english/
β βββ hindi/
β βββ ...
βββ ai/ β AI/TTS-generated audio
βββ elevenlabs/
βββ murf/
βββ ...
Supports
.mp3,.wav,.m4aβ recursive subfolder scanning
python train_model.pyThe script will:
- Load and augment all audio files (4Γ per file: original, noise, pitchΒ±2)
- Extract 1024-dim embeddings via
wav2vec2-large-xlsr-53 - Train an
MLPClassifier(hidden_layers=[128, 64]) - Save
hackathon_model.pkl+model_scaler.pkl
voice-detection-ai/
β
βββ π Backend
β βββ main.py # FastAPI app + detection engine
β βββ train_model.py # ML training pipeline with augmentation
β βββ test_api.py # API test suite (4 test cases)
β βββ setup_ffmpeg.py # FFmpeg installation helper
β βββ hackathon_model.pkl # Trained MLP classifier (~4.3 MB)
β βββ model_scaler.pkl # Fitted StandardScaler (~24 KB)
β βββ requirements.txt # Python dependencies
β βββ Dockerfile # Container config (port 7860 for HF Spaces)
β βββ .github/workflows/ci.yml # GitHub Actions CI pipeline
β
βββ βοΈ Frontend (frontend/)
βββ src/
β βββ App.jsx # Main React app (matrix rain, drag-drop, results)
β βββ index.css # Full cyberpunk design system (644 lines)
β βββ main.jsx # React entry point
βββ public/ # Static assets
βββ package.json # Vite 7 + React 19
βββ vite.config.js # Vite configuration
βββ vercel.json # SPA routing config for Vercel
The backend is deployed as a Docker Space on Hugging Face. The Dockerfile is pre-configured for the HF Spaces environment (port 7860).
To deploy your own fork:
- Create a new Space on huggingface.co/spaces
- Set Space SDK to Docker
- Push this repo to your Space's git remote
- Add
SECRET_API_KEYin Space Settings β Repository secrets
Or manually:
cd frontend
npx vercel --prodSet environment variables in the Vercel dashboard:
VITE_API_URLβ Your Hugging Face Space URLVITE_API_KEYβ Your secret API key
- Short clips (< 2 seconds) may yield lower confidence scores
- Professional-grade voice clones with high-fidelity TTS may occasionally bypass detection
- Background noise can reduce classification reliability
- Language bias β model is strongest on English; multilingual performance depends on training data diversity
- Large audio files will increase base64 payload size and transfer time
- Streaming audio input (WebSocket API)
- Support more formats: OGG, FLAC, M4A via API
- Waveform visualizer on frontend
- History tab with past analysis results
- Confidence threshold configuration
- Batch file analysis endpoint
- Public leaderboard of tested TTS engines
Contributions are welcome and appreciated!
# Fork the repo, then:
git checkout -b feature/your-amazing-feature
git commit -m "feat: add your amazing feature"
git push origin feature/your-amazing-feature
# Open a Pull Request!Please read CONTRIBUTING.md before submitting changes.
Good first issues to tackle:
- Improve model accuracy with larger/more diverse datasets
- Add real waveform visualization using Web Audio API
- Add OGG/FLAC/M4A support to the API
- Write more comprehensive unit tests
- Add dark/light mode toggle to frontend
| Layer | Technology |
|---|---|
| Language Model | facebook/wav2vec2-large-xlsr-53 |
| Classifier | scikit-learn MLPClassifier |
| Audio Processing | librosa, soundfile, pydub |
| Backend Framework | FastAPI + Uvicorn |
| Frontend | React 19 + Vite 7 |
| Styling | Vanilla CSS (glassmorphism + cyberpunk) |
| Containerization | Docker |
| Backend Hosting | Hugging Face Spaces |
| Frontend Hosting | Vercel |
| CI/CD | GitHub Actions |
This project is licensed under the MIT License β see the LICENSE file for details.