AI Music Detector

A model, training/inference scripts, and a library for detecting Suno ≤ 5 and Udio ≤ 1.5 generated music.

Model Description

This model detects AI-generated music by exploiting spectral artifacts inherent to neural vocoders. Most audio generators rely on deconvolution layers to upsample latent representations back to audio sample rates. These layers leave predictable fingerprints in the frequency domain.

Deconvolution Artifacts

A deconvolution (transposed convolution) with stride $k$ is equivalent to two sequential operations:

Zero-upsampling: Insert $k-1$ zeros between each sample
Convolution: Apply the learned kernel

The zero-upsampling is equivalent to oversampling a discrete signal. For a signal $s$ sampled at frequency $f_s$, the zero-upsampled version $v$ with stride $k$ satisfies:

$$v = s \cdot \amalg_{1/kf_s}$$

where $\amalg_T$ denotes a Dirac comb with period $T$. From the Fourier duality of Dirac combs:

$$\mathcal{F}[\amalg_T] = \frac{1}{T}\amalg_{1/T}$$

This means the spectrum of $v$ is read up to frequency $kf_s$ instead of $f_s$, causing periodic replication of the original spectrum. The high-energy DC component (bias from activations and layer outputs) gets cloned throughout the frequency range, creating characteristic peaks.

Peak Locations

For a single deconvolution with stride $k$, peaks appear at frequencies $n \cdot f_s$ for all integers $n \in [0, \lfloor k/2 \rfloor]$.

For $L$ stacked deconvolution layers with strides ${k_1, k_2, \ldots, k_L}$, artifacts compound recursively—each layer replicates not just the DC component but all previous peaks. The total number of peaks is:

$$P = \left\lfloor \frac{\prod_{i=1}^{L} k_i}{2} \right\rfloor + 1$$

Architecture Fingerprinting

These artifacts depend only on the stride configuration, not on training data or learned weights. This has two implications:

High accuracy on known architectures: The spectral fingerprint is deterministic and consistent across all outputs from a given generator
Requires retraining for new architectures: Different vocoder designs produce different peak patterns, so the model must be updated when generators change their architecture

We extract a fakeprint by computing the average spectrum, subtracting its lower envelope (to isolate peaks from melodic content), and analyzing the 1-8 kHz band where artifacts are most prominent.

Quick Start

C# Usage

using AiMusicDetector;

// Load the detector with your trained model
using var detector = MusicDetector.Load("ai_music_detector.onnx");

// Analyze an audio file
var result = detector.Analyze("song.mp3");

Console.WriteLine($"AI Probability: {result.AiProbability:P1}");
Console.WriteLine($"Classification: {result.Classification}");
Console.WriteLine($"Confidence: {result.Confidence:P0}");

Installation

# NuGet (once published)
dotnet add package AiMusicDetector

# Or add project reference
dotnet add reference src/AiMusicDetector/AiMusicDetector.csproj

In-process FFmpeg decoding (FFmpeg.AutoGen)

For the closest match to Python/torchaudio/FFmpeg decoding, the C# pipeline can decode MP3 using in-process FFmpeg (no ffmpeg.exe) via FFmpeg.AutoGen.

Override library path:
- Set AIMUSICDETECTOR_FFMPEG_LIBS to the folder containing FFmpeg shared libraries (avcodec, avformat, swresample, etc.)
Bundled binaries layout (recommended):
- FFmpeg/bin/win-x64/*.dll
- FFmpeg/bin/win-arm64/*.dll
- FFmpeg/bin/linux-x64/*.so
- FFmpeg/bin/linux-arm64/*.so
- FFmpeg/bin/osx-x64/*.dylib
- FFmpeg/bin/osx-arm64/*.dylib

The loader also supports the legacy AutoGen example layout on Windows:

FFmpeg/bin/x64/*.dll

If no in-process FFmpeg libraries are found, the code will fall back to the other decoding path(s).

Training Your Own Model

Prerequisites

Miniconda or Anaconda
CUDA-capable GPU (optional, speeds up feature extraction)
~200 GB disk space for datasets

Step 1: Setup Environment

cd python

# Windows
setup_env.bat

# Linux/macOS
chmod +x setup_env.sh
./setup_env.sh

# Or manually:
conda env create -f environment.yml
conda activate ai-music-detector

Step 2: Download Datasets

# Activate environment
conda activate ai-music-detector

# Download FMA (real music) and SONICS (AI-generated)
python download_data.py --dataset all

Datasets:

FMA Medium - 25,000 real music tracks (22 GB)
SONICS - 49,000+ AI-generated songs from Suno/Udio (~150 GB)

Step 3: Extract Fakeprints

# Extract features from real music
python extract_fakeprints.py \
    --input ./data/fma/fma_medium \
    --output ./output/fma_fakeprints.npy \
    --label real

# Extract features from AI-generated music
python extract_fakeprints.py \
    --input ./data/sonics/fake_songs \
    --output ./output/sonics_fakeprints.npy \
    --label fake

Step 4: Train Model

python train_model.py \
    --real ./output/fma_fakeprints.npy \
    --fake ./output/sonics_fakeprints.npy

Step 5: Export to ONNX

python export_onnx.py --model ./models

The trained model will be saved to ./models/ai_music_detector.onnx.

Alternative: CNN Model (Robust to Audio Modifications)

For better robustness against pitch shifts, EQ changes, mastering, and transcoding, train the CNN-based model:

# Train CNN with on-the-fly augmentations
python train_cnn.py \
    --real ./data/fma/fma_medium \
    --fake ./data/sonics/fake_songs

# Export to ONNX
python export_onnx_cnn.py

# Inference
python inference_cnn.py --model ./models/cnn_detector.onnx

The CNN model uses CQT spectrograms (log-frequency) which provide pitch-shift invariance, and trains with data augmentation including pitch shifting, EQ, compression, and noise injection.

API Reference

MusicDetector

// Load model
using var detector = MusicDetector.Load("model.onnx");
using var detector = MusicDetector.Load(modelBytes);

// Analyze files
DetectionResult result = detector.Analyze("song.mp3");
float probability = detector.Predict("song.mp3");
bool isAi = detector.IsAiGenerated("song.mp3");

// Analyze samples directly
float[] samples = LoadYourAudio();
var result = detector.Analyze(samples, sampleRate: 44100);

// Batch processing
var results = detector.AnalyzeBatch(new[] { "song1.mp3", "song2.mp3" });

DetectionResult

public class DetectionResult
{
    float AiProbability;       // 0.0 (Real) to 1.0 (AI)
    bool IsAiGenerated;        // true if probability > threshold
    float Confidence;          // 0.0 (uncertain) to 1.0 (certain)
    string Classification;     // "Real Music" or "AI-Generated"
    double AudioDurationSeconds;
    long ProcessingTimeMs;
}

Configuration

var options = new MusicDetectorOptions
{
    SampleRate = 44100,
    MaxDurationSeconds = 180,
    Threshold = 0.5f,
    UseGpu = false
};

using var detector = MusicDetector.Load("model.onnx", options);

Command Line Tool

# Build
dotnet build src/AiMusicDetector.Console

# Run
dotnet run --project src/AiMusicDetector.Console -- model.onnx song1.mp3 song2.mp3

Performance

Evaluated on a held-out test set of 17,866 samples (5,741 real, 12,125 AI-generated).

Metric	Value
Accuracy	99.88%
Precision	0.9985
Recall	0.9998
F1 Score	0.9991
False Positive Rate	0.31%
False Negative Rate	0.02%

Limitations

Sample Rate Dependent: Audio must be resampled to 16000 Hz
Minimum Duration: Works best with 10+ seconds of audio
Evolving Generators: Needs retraining on new generations of AI music generators

Acknowledgements

This implementation is based on the fakeprint detection method proposed by Afchar et al. [1], which demonstrates that neural vocoders in generative music models produce characteristic frequency-domain artifacts due to their deconvolution architecture.

References

[1] D. Afchar, G. Meseguer-Brocal, K. Akesbi, and R. Hennequin, "A Fourier Explanation of AI-music Artifacts," in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2025. Available: https://arxiv.org/abs/2506.19108

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src		src
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Music Detector

Model Description

Deconvolution Artifacts

Peak Locations

Architecture Fingerprinting

Quick Start

C# Usage

Installation

In-process FFmpeg decoding (FFmpeg.AutoGen)

Training Your Own Model

Prerequisites

Step 1: Setup Environment

Step 2: Download Datasets

Step 3: Extract Fakeprints

Step 4: Train Model

Step 5: Export to ONNX

Alternative: CNN Model (Robust to Audio Modifications)

API Reference

MusicDetector

DetectionResult

Configuration

Command Line Tool

Performance

Limitations

Acknowledgements

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Music Detector

Model Description

Deconvolution Artifacts

Peak Locations

Architecture Fingerprinting

Quick Start

C# Usage

Installation

In-process FFmpeg decoding (FFmpeg.AutoGen)

Training Your Own Model

Prerequisites

Step 1: Setup Environment

Step 2: Download Datasets

Step 3: Extract Fakeprints

Step 4: Train Model

Step 5: Export to ONNX

Alternative: CNN Model (Robust to Audio Modifications)

API Reference

MusicDetector

DetectionResult

Configuration

Command Line Tool

Performance

Limitations

Acknowledgements

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages