Beyond the Playlist: Using Linear Algebra to Discover a Song's "Audio DNA"
EigenBeats is a content-based music similarity engine that uses machine learning to find perceptually similar songs. Instead of relying on subjective genre tags or collaborative filtering, it analyzes the audio content of music directly to compute a unique mathematical "fingerprint" or "Audio DNA" for each track.
This project is inspired by the idea that the core essence of a song—its timbre, harmony, and energy—can be represented mathematically.
The project is divided into two main versions, as outlined in the research report:
-
EigenBeats v1.0: A foundational model using Principal Component Analysis (PCA). It extracts low-level audio features (MFCCs and Chroma) and uses PCA to find the principal components of musical variation across a dataset. This creates a compact and efficient "Audio DNA" for each song.
-
EigenBeats v2.0: An advanced model using a Deep Learning Autoencoder. This version learns a more nuanced and powerful representation of the audio by training a neural network to compress and reconstruct song spectrograms. The compressed representation from the autoencoder's "bottleneck" layer serves as the new "Audio DNA".
Similarity between songs is measured by calculating the Cosine Similarity between their "Audio DNA" vectors.
This project will be developed in Python 3.9+ and will primarily use the following libraries:
- Audio Processing:
librosa - Numerical & ML:
numpy,scikit-learn,pandas - Deep Learning (for v2.0):
tensorfloworpytorch - Data Handling:
tqdmfor progress bars,pickleorjoblibfor saving models.
.
├── data/
│ ├── raw/ # Raw audio files (e.g., from FMA)
│ └── processed/ # Processed features and DNA vectors
├── notebooks/ # Jupyter notebooks for exploration and analysis
├── src/ # Source code for the project
│ ├── __init__.py
│ ├── config.py # Configuration variables
│ ├── data_loader.py # Scripts for loading and processing data
│ ├── feature_extraction.py # Feature extraction logic
│ ├── models/ # PCA and Autoencoder model code
│ │ └── train_pca.py
│ └── query.py # Script to query for similar songs
├── tests/ # Test scripts
├── ARCHITECTURE.md
├── database.md
├── plan.md
├── README.md
├── test.md
├── UI.md
├── create_sample_audio.py # Sample data generator
└── teach.md # Teaching guide
- Python 3.9+
- pip package manager
- Audio files (MP3, WAV, FLAC, OGG, or M4A format)
# Clone the repository
git clone https://github.com/Chamath-Adithya/EigenBeats.git
cd EigenBeats
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install numpy librosa scikit-learn tqdm joblib matplotlib pydubAdd your audio files to the data/raw/ directory:
# Supported formats: MP3, WAV, FLAC, OGG, M4A
mkdir -p data/raw
# Copy your audio files here, e.g.:
# cp ~/Music/*.mp3 data/raw/For Testing: Use the included sample generator:
python create_sample_audio.pysource venv/bin/activate # Activate virtual environment
python -m src.process_featuresWhat it does:
- Scans
data/raw/for audio files - Extracts MFCCs (20 coefficients) and Chroma features (12 bins)
- Aggregates features using mean and standard deviation
- Saves 64-dimensional feature vectors to
data/processed/features_v1.pkl
Output:
Found 8000 audio files in /home/chamath-adithya/Documents/EigenBeats/data/raw
Sample files: ['000002', '000005', '000010', '000140', '000141']...
Extracting features from audio files...
100%|████████████████████████████████████████| 8000/8000 [18:21<00:00, 7.27it/s]
Successfully processed 7997 out of 8000 files
Saved features to /home/chamath-adithya/Documents/EigenBeats/data/processed/features_v1.pkl
Feature matrix shape: 7997 x 64
source venv/bin/activate # Activate virtual environment (if not already)
python -m src.models.train_pcaWhat it does:
- Loads extracted features
- Standardizes data (mean=0, variance=1)
- Determines optimal number of PCA components (95% variance)
- Trains PCA model and creates "Audio DNA" vectors
- Saves model, scaler, and DNA vectors
Output:
Starting PCA training for EigenBeats v1.0...
Loaded feature matrix: (10, 64)
Explained variance with 6 components: 0.965
Optimal number of PCA components: 6
Trained PCA with 6 components
Created Audio DNA with shape: (10, 6)
PCA training completed successfully!
# Activate virtual environment (if not already)
source venv/bin/activate
# Basic query (replace 000002 with a valid track ID from your dataset)
python -m src.query --track_id 000002
# Get top 5 similar songs
python -m src.query --track_id 000002 --top_k 5Sample Output:
🎵 EigenBeats: Songs similar to '000002'
1. 110263 - Similarity: 67.5%
2. 144941 - Similarity: 65.3%
3. 110261 - Similarity: 64.4%
4. 145703 - Similarity: 64.3%
1. sample_008 - Similarity: 10.6%
2. sample_003 - Similarity: 2.2%
3. sample_004 - Similarity: -4.1%
4. sample_005 - Similarity: -11.0%
5. sample_006 - Similarity: -16.9%
- 078848 - Similarity: 61.3%
==================================================
1. sample_008 - Similarity: 10.6%
2. sample_003 - Similarity: 2.2%
3. sample_004 - Similarity: -4.1%
4. sample_005 - Similarity: -11.0%
5. sample_006 - Similarity: -16.9%
==================================================
- Range: -100% to +100% (cosine similarity)
- Interpretation:
- High positive (>50%): Very similar songs
- Moderate positive (10-50%): Somewhat similar
- Near zero/low negative: Different songs
- High negative (<-50%): Very different/opposite characteristics
- Dimensions: Typically 6-10 components (captures 95%+ variance)
- Representation: Mathematical fingerprint of audio content
- Storage: NumPy array saved as
data/processed/audio_dna_pca.npy
- MFCCs: 20 coefficients × 2 stats (mean, std) = 40 features
- Chroma: 12 bins × 2 stats (mean, std) = 24 features
- Total: 64-dimensional feature space per song
# 1. Setup environment
python3 -m venv venv
source venv/bin/activate
pip install numpy librosa scikit-learn tqdm joblib matplotlib pydub
# 2. Generate sample audio
python create_sample_audio.py
# 3. Process features
python -m src.process_features
# 4. Train model
python -m src.models.train_pca
# 5. Query similar songs (replace 000002 with a valid track ID)
python -m src.query --track_id 000002"ModuleNotFoundError: No module named 'src'"
# Set Python path
export PYTHONPATH=/path/to/EigenBeats:$PYTHONPATH
# Or run with:
PYTHONPATH=/path/to/EigenBeats python src/process_features.py"No audio files found"
- Check that audio files are in
data/raw/ - Ensure files have supported extensions (.mp3, .wav, .flac, .ogg, .m4a)
- Verify file permissions
"Track ID not found"
- Use exact filename without extension as track ID
- List available tracks: check
data/processed/features_v1.pklor run data_loader.py
data/
├── raw/
│ ├── song1.mp3
│ └── song2.wav
└── processed/
├── features_v1.pkl # Raw features + track IDs
├── audio_dna_pca.npy # PCA-transformed DNA vectors
├── pca_model.pkl # Trained PCA model
├── scaler.pkl # Feature scaler
└── pca_variance_analysis.png # Variance plot
Modify src/feature_extraction.py to add new features:
# Add spectral centroid features
def extract_spectral_centroid(audio, sr):
centroid = librosa.feature.spectral_centroid(y=audio, sr=sr)
return np.mean(centroid), np.std(centroid)The system uses cosine similarity, but you can modify src/query.py to use:
- Euclidean distance
- Manhattan distance
- Pearson correlation
For large datasets, modify src/process_features.py to:
- Process files in batches
- Save intermediate results
- Resume processing on interruption