Skip to content

Chamath-Adithya/EigenBeats

Repository files navigation

EigenBeats

Beyond the Playlist: Using Linear Algebra to Discover a Song's "Audio DNA"

EigenBeats is a content-based music similarity engine that uses machine learning to find perceptually similar songs. Instead of relying on subjective genre tags or collaborative filtering, it analyzes the audio content of music directly to compute a unique mathematical "fingerprint" or "Audio DNA" for each track.

This project is inspired by the idea that the core essence of a song—its timbre, harmony, and energy—can be represented mathematically.

Core Concepts

The project is divided into two main versions, as outlined in the research report:

  1. EigenBeats v1.0: A foundational model using Principal Component Analysis (PCA). It extracts low-level audio features (MFCCs and Chroma) and uses PCA to find the principal components of musical variation across a dataset. This creates a compact and efficient "Audio DNA" for each song.

  2. EigenBeats v2.0: An advanced model using a Deep Learning Autoencoder. This version learns a more nuanced and powerful representation of the audio by training a neural network to compress and reconstruct song spectrograms. The compressed representation from the autoencoder's "bottleneck" layer serves as the new "Audio DNA".

Similarity between songs is measured by calculating the Cosine Similarity between their "Audio DNA" vectors.

Technology Stack

This project will be developed in Python 3.9+ and will primarily use the following libraries:

  • Audio Processing: librosa
  • Numerical & ML: numpy, scikit-learn, pandas
  • Deep Learning (for v2.0): tensorflow or pytorch
  • Data Handling: tqdm for progress bars, pickle or joblib for saving models.

Project Structure

.
├── data/
│   ├── raw/          # Raw audio files (e.g., from FMA)
│   └── processed/    # Processed features and DNA vectors
├── notebooks/        # Jupyter notebooks for exploration and analysis
├── src/              # Source code for the project
│   ├── __init__.py
│   ├── config.py     # Configuration variables
│   ├── data_loader.py # Scripts for loading and processing data
│   ├── feature_extraction.py # Feature extraction logic
│   ├── models/       # PCA and Autoencoder model code
│   │   └── train_pca.py
│   └── query.py      # Script to query for similar songs
├── tests/            # Test scripts
├── ARCHITECTURE.md
├── database.md
├── plan.md
├── README.md
├── test.md
├── UI.md
├── create_sample_audio.py  # Sample data generator
└── teach.md               # Teaching guide

Installation & Setup

Prerequisites

  • Python 3.9+
  • pip package manager
  • Audio files (MP3, WAV, FLAC, OGG, or M4A format)

Step 1: Clone and Setup Environment

# Clone the repository
git clone https://github.com/Chamath-Adithya/EigenBeats.git
cd EigenBeats

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install numpy librosa scikit-learn tqdm joblib matplotlib pydub

Step 2: Prepare Audio Data

Add your audio files to the data/raw/ directory:

# Supported formats: MP3, WAV, FLAC, OGG, M4A
mkdir -p data/raw
# Copy your audio files here, e.g.:
# cp ~/Music/*.mp3 data/raw/

For Testing: Use the included sample generator:

python create_sample_audio.py

Usage Guide

Complete Pipeline (3 Steps)

Step 1: Extract Audio Features

source venv/bin/activate  # Activate virtual environment
python -m src.process_features

What it does:

  • Scans data/raw/ for audio files
  • Extracts MFCCs (20 coefficients) and Chroma features (12 bins)
  • Aggregates features using mean and standard deviation
  • Saves 64-dimensional feature vectors to data/processed/features_v1.pkl

Output:

Found 8000 audio files in /home/chamath-adithya/Documents/EigenBeats/data/raw
Sample files: ['000002', '000005', '000010', '000140', '000141']...
Extracting features from audio files...
100%|████████████████████████████████████████| 8000/8000 [18:21<00:00, 7.27it/s]
Successfully processed 7997 out of 8000 files
Saved features to /home/chamath-adithya/Documents/EigenBeats/data/processed/features_v1.pkl
Feature matrix shape: 7997 x 64

Step 2: Train PCA Model

source venv/bin/activate  # Activate virtual environment (if not already)
python -m src.models.train_pca

What it does:

  • Loads extracted features
  • Standardizes data (mean=0, variance=1)
  • Determines optimal number of PCA components (95% variance)
  • Trains PCA model and creates "Audio DNA" vectors
  • Saves model, scaler, and DNA vectors

Output:

Starting PCA training for EigenBeats v1.0...
Loaded feature matrix: (10, 64)
Explained variance with 6 components: 0.965
Optimal number of PCA components: 6
Trained PCA with 6 components
Created Audio DNA with shape: (10, 6)
PCA training completed successfully!

Step 3: Find Similar Songs

# Activate virtual environment (if not already)
source venv/bin/activate

# Basic query (replace 000002 with a valid track ID from your dataset)
python -m src.query --track_id 000002

# Get top 5 similar songs
python -m src.query --track_id 000002 --top_k 5

Sample Output:

🎵 EigenBeats: Songs similar to '000002'
 1. 110263          - Similarity:  67.5%
 2. 144941          - Similarity:  65.3%
 3. 110261          - Similarity:  64.4%
 4. 145703          - Similarity:  64.3%
 1. sample_008      - Similarity:  10.6%
 2. sample_003      - Similarity:   2.2%
 3. sample_004      - Similarity:  -4.1%
 4. sample_005      - Similarity: -11.0%
 5. sample_006      - Similarity: -16.9%
  1. 078848 - Similarity: 61.3%
==================================================
 1. sample_008      - Similarity:  10.6%
 2. sample_003      - Similarity:   2.2%
 3. sample_004      - Similarity:  -4.1%
 4. sample_005      - Similarity: -11.0%
 5. sample_006      - Similarity: -16.9%
==================================================

Understanding the Output

Similarity Scores

  • Range: -100% to +100% (cosine similarity)
  • Interpretation:
    • High positive (>50%): Very similar songs
    • Moderate positive (10-50%): Somewhat similar
    • Near zero/low negative: Different songs
    • High negative (<-50%): Very different/opposite characteristics

Audio DNA Vectors

  • Dimensions: Typically 6-10 components (captures 95%+ variance)
  • Representation: Mathematical fingerprint of audio content
  • Storage: NumPy array saved as data/processed/audio_dna_pca.npy

Feature Analysis

  • MFCCs: 20 coefficients × 2 stats (mean, std) = 40 features
  • Chroma: 12 bins × 2 stats (mean, std) = 24 features
  • Total: 64-dimensional feature space per song

Quick Start with Sample Data

# 1. Setup environment
python3 -m venv venv
source venv/bin/activate
pip install numpy librosa scikit-learn tqdm joblib matplotlib pydub

# 2. Generate sample audio
python create_sample_audio.py

# 3. Process features
python -m src.process_features

# 4. Train model
python -m src.models.train_pca

# 5. Query similar songs (replace 000002 with a valid track ID)
python -m src.query --track_id 000002

Troubleshooting

Common Issues

"ModuleNotFoundError: No module named 'src'"

# Set Python path
export PYTHONPATH=/path/to/EigenBeats:$PYTHONPATH
# Or run with:
PYTHONPATH=/path/to/EigenBeats python src/process_features.py

"No audio files found"

  • Check that audio files are in data/raw/
  • Ensure files have supported extensions (.mp3, .wav, .flac, .ogg, .m4a)
  • Verify file permissions

"Track ID not found"

  • Use exact filename without extension as track ID
  • List available tracks: check data/processed/features_v1.pkl or run data_loader.py

File Structure After Processing

data/
├── raw/
│   ├── song1.mp3
│   └── song2.wav
└── processed/
    ├── features_v1.pkl          # Raw features + track IDs
    ├── audio_dna_pca.npy        # PCA-transformed DNA vectors
    ├── pca_model.pkl            # Trained PCA model
    ├── scaler.pkl               # Feature scaler
    └── pca_variance_analysis.png # Variance plot

Advanced Usage

Custom Feature Extraction

Modify src/feature_extraction.py to add new features:

# Add spectral centroid features
def extract_spectral_centroid(audio, sr):
    centroid = librosa.feature.spectral_centroid(y=audio, sr=sr)
    return np.mean(centroid), np.std(centroid)

Different Similarity Metrics

The system uses cosine similarity, but you can modify src/query.py to use:

  • Euclidean distance
  • Manhattan distance
  • Pearson correlation

Batch Processing

For large datasets, modify src/process_features.py to:

  • Process files in batches
  • Save intermediate results
  • Resume processing on interruption

Releases

No releases published

Packages

 
 
 

Contributors

Languages