EigenBeats

Beyond the Playlist: Using Linear Algebra to Discover a Song's "Audio DNA"

EigenBeats is a content-based music similarity engine that uses machine learning to find perceptually similar songs. Instead of relying on subjective genre tags or collaborative filtering, it analyzes the audio content of music directly to compute a unique mathematical "fingerprint" or "Audio DNA" for each track.

This project is inspired by the idea that the core essence of a song—its timbre, harmony, and energy—can be represented mathematically.

Core Concepts

The project is divided into two main versions, as outlined in the research report:

EigenBeats v1.0: A foundational model using Principal Component Analysis (PCA). It extracts low-level audio features (MFCCs and Chroma) and uses PCA to find the principal components of musical variation across a dataset. This creates a compact and efficient "Audio DNA" for each song.
EigenBeats v2.0: An advanced model using a Deep Learning Autoencoder. This version learns a more nuanced and powerful representation of the audio by training a neural network to compress and reconstruct song spectrograms. The compressed representation from the autoencoder's "bottleneck" layer serves as the new "Audio DNA".

Similarity between songs is measured by calculating the Cosine Similarity between their "Audio DNA" vectors.

Technology Stack

This project will be developed in Python 3.9+ and will primarily use the following libraries:

Audio Processing: librosa
Numerical & ML: numpy, scikit-learn, pandas
Deep Learning (for v2.0): tensorflow or pytorch
Data Handling: tqdm for progress bars, pickle or joblib for saving models.

Project Structure

.
├── data/
│   ├── raw/          # Raw audio files (e.g., from FMA)
│   └── processed/    # Processed features and DNA vectors
├── notebooks/        # Jupyter notebooks for exploration and analysis
├── src/              # Source code for the project
│   ├── __init__.py
│   ├── config.py     # Configuration variables
│   ├── data_loader.py # Scripts for loading and processing data
│   ├── feature_extraction.py # Feature extraction logic
│   ├── models/       # PCA and Autoencoder model code
│   │   └── train_pca.py
│   └── query.py      # Script to query for similar songs
├── tests/            # Test scripts
├── ARCHITECTURE.md
├── database.md
├── plan.md
├── README.md
├── test.md
├── UI.md
├── create_sample_audio.py  # Sample data generator
└── teach.md               # Teaching guide

Installation & Setup

Prerequisites

Python 3.9+
pip package manager
Audio files (MP3, WAV, FLAC, OGG, or M4A format)

Step 1: Clone and Setup Environment

# Clone the repository
git clone https://github.com/Chamath-Adithya/EigenBeats.git
cd EigenBeats

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install numpy librosa scikit-learn tqdm joblib matplotlib pydub

Step 2: Prepare Audio Data

Add your audio files to the data/raw/ directory:

# Supported formats: MP3, WAV, FLAC, OGG, M4A
mkdir -p data/raw
# Copy your audio files here, e.g.:
# cp ~/Music/*.mp3 data/raw/

For Testing: Use the included sample generator:

python create_sample_audio.py

Usage Guide

Complete Pipeline (3 Steps)

Step 1: Extract Audio Features

source venv/bin/activate  # Activate virtual environment
python -m src.process_features

What it does:

Scans data/raw/ for audio files
Extracts MFCCs (20 coefficients) and Chroma features (12 bins)
Aggregates features using mean and standard deviation
Saves 64-dimensional feature vectors to data/processed/features_v1.pkl

Output:

Found 8000 audio files in /home/chamath-adithya/Documents/EigenBeats/data/raw
Sample files: ['000002', '000005', '000010', '000140', '000141']...
Extracting features from audio files...
100%|████████████████████████████████████████| 8000/8000 [18:21<00:00, 7.27it/s]
Successfully processed 7997 out of 8000 files
Saved features to /home/chamath-adithya/Documents/EigenBeats/data/processed/features_v1.pkl
Feature matrix shape: 7997 x 64

Step 2: Train PCA Model

source venv/bin/activate  # Activate virtual environment (if not already)
python -m src.models.train_pca

What it does:

Loads extracted features
Standardizes data (mean=0, variance=1)
Determines optimal number of PCA components (95% variance)
Trains PCA model and creates "Audio DNA" vectors
Saves model, scaler, and DNA vectors

Output:

Starting PCA training for EigenBeats v1.0...
Loaded feature matrix: (10, 64)
Explained variance with 6 components: 0.965
Optimal number of PCA components: 6
Trained PCA with 6 components
Created Audio DNA with shape: (10, 6)
PCA training completed successfully!

Step 3: Find Similar Songs

# Activate virtual environment (if not already)
source venv/bin/activate

# Basic query (replace 000002 with a valid track ID from your dataset)
python -m src.query --track_id 000002

# Get top 5 similar songs
python -m src.query --track_id 000002 --top_k 5

Sample Output:

🎵 EigenBeats: Songs similar to '000002'
 1. 110263          - Similarity:  67.5%
 2. 144941          - Similarity:  65.3%
 3. 110261          - Similarity:  64.4%
 4. 145703          - Similarity:  64.3%
 1. sample_008      - Similarity:  10.6%
 2. sample_003      - Similarity:   2.2%
 3. sample_004      - Similarity:  -4.1%
 4. sample_005      - Similarity: -11.0%
 5. sample_006      - Similarity: -16.9%

078848 - Similarity: 61.3%

==================================================
 1. sample_008      - Similarity:  10.6%
 2. sample_003      - Similarity:   2.2%
 3. sample_004      - Similarity:  -4.1%
 4. sample_005      - Similarity: -11.0%
 5. sample_006      - Similarity: -16.9%
==================================================

Understanding the Output

Similarity Scores

Range: -100% to +100% (cosine similarity)
Interpretation:
- High positive (>50%): Very similar songs
- Moderate positive (10-50%): Somewhat similar
- Near zero/low negative: Different songs
- High negative (<-50%): Very different/opposite characteristics

Audio DNA Vectors

Dimensions: Typically 6-10 components (captures 95%+ variance)
Representation: Mathematical fingerprint of audio content
Storage: NumPy array saved as data/processed/audio_dna_pca.npy

Feature Analysis

MFCCs: 20 coefficients × 2 stats (mean, std) = 40 features
Chroma: 12 bins × 2 stats (mean, std) = 24 features
Total: 64-dimensional feature space per song

Quick Start with Sample Data

# 1. Setup environment
python3 -m venv venv
source venv/bin/activate
pip install numpy librosa scikit-learn tqdm joblib matplotlib pydub

# 2. Generate sample audio
python create_sample_audio.py

# 3. Process features
python -m src.process_features

# 4. Train model
python -m src.models.train_pca

# 5. Query similar songs (replace 000002 with a valid track ID)
python -m src.query --track_id 000002

Troubleshooting

Common Issues

"ModuleNotFoundError: No module named 'src'"

# Set Python path
export PYTHONPATH=/path/to/EigenBeats:$PYTHONPATH
# Or run with:
PYTHONPATH=/path/to/EigenBeats python src/process_features.py

"No audio files found"

Check that audio files are in data/raw/
Ensure files have supported extensions (.mp3, .wav, .flac, .ogg, .m4a)
Verify file permissions

"Track ID not found"

Use exact filename without extension as track ID
List available tracks: check data/processed/features_v1.pkl or run data_loader.py

File Structure After Processing

data/
├── raw/
│   ├── song1.mp3
│   └── song2.wav
└── processed/
    ├── features_v1.pkl          # Raw features + track IDs
    ├── audio_dna_pca.npy        # PCA-transformed DNA vectors
    ├── pca_model.pkl            # Trained PCA model
    ├── scaler.pkl               # Feature scaler
    └── pca_variance_analysis.png # Variance plot

Advanced Usage

Custom Feature Extraction

Modify src/feature_extraction.py to add new features:

# Add spectral centroid features
def extract_spectral_centroid(audio, sr):
    centroid = librosa.feature.spectral_centroid(y=audio, sr=sr)
    return np.mean(centroid), np.std(centroid)

Different Similarity Metrics

The system uses cosine similarity, but you can modify src/query.py to use:

Euclidean distance
Manhattan distance
Pearson correlation

Batch Processing

For large datasets, modify src/process_features.py to:

Process files in batches
Save intermediate results
Resume processing on interruption

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Audio DNA Project Research Report.pdf		Audio DNA Project Research Report.pdf
Gemini_Generated_Image_t44mklt44mklt44m.png		Gemini_Generated_Image_t44mklt44mklt44m.png
README.md		README.md
UI.md		UI.md
create_sample_audio.py		create_sample_audio.py
database.md		database.md
example_commands.md		example_commands.md
plan.md		plan.md
requirements.txt		requirements.txt
teach.md		teach.md
technical_explanation.md		technical_explanation.md
test.md		test.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EigenBeats

Core Concepts

Technology Stack

Project Structure

Installation & Setup

Prerequisites

Step 1: Clone and Setup Environment

Step 2: Prepare Audio Data

Usage Guide

Complete Pipeline (3 Steps)

Step 1: Extract Audio Features

Step 2: Train PCA Model

Step 3: Find Similar Songs

Understanding the Output

Similarity Scores

Audio DNA Vectors

Feature Analysis

Quick Start with Sample Data

Troubleshooting

Common Issues

File Structure After Processing

Advanced Usage

Custom Feature Extraction

Different Similarity Metrics

Batch Processing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EigenBeats

Core Concepts

Technology Stack

Project Structure

Installation & Setup

Prerequisites

Step 1: Clone and Setup Environment

Step 2: Prepare Audio Data

Usage Guide

Complete Pipeline (3 Steps)

Step 1: Extract Audio Features

Step 2: Train PCA Model

Step 3: Find Similar Songs

Understanding the Output

Similarity Scores

Audio DNA Vectors

Feature Analysis

Quick Start with Sample Data

Troubleshooting

Common Issues

File Structure After Processing

Advanced Usage

Custom Feature Extraction

Different Similarity Metrics

Batch Processing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages