TangoPoemsAnalytics

Python AI pipeline decoding Tango history. Ingests lyrics into DuckDB, generates Nomic embeddings via llama.cpp, clusters by emotional profiles, analyzes Maestro DNA, and predicts missing song dates using Random Forest regression. Maps a century of evolution.

Overview

TangoPoemsAnalytics is an end-to-end Data Science and Natural Language Processing (NLP) pipeline designed to map the historical, thematic, and emotional evolution of Tango music.

By scraping historical lyrics, analyzing them with local Large Language Models (LLMs), generating semantic embeddings, and applying machine learning algorithms, this project reconstructs the timeline of Tango and identifies the psychological "DNA" of its greatest maestros.

Key Features

Automated Scraping: Crawls and downloads lyrics, dates, and metadata from TodoTango.
LLM-Powered Semantic Extraction: Uses a local LLM via llama.cpp to enforce a strict JSON schema, extracting specific emotions (e.g., sadness, longing) and themes from Spanish lyrics.
High-Performance Vector Database: Ingests structured data into DuckDB views and tables for rapid analytical querying.
Semantic Embeddings: Vectorizes lyrics into 768-dimensional space using the Nomic-Embed model.
Unsupervised Machine Learning: Applies K-Means clustering, t-SNE, and MDS dimensionality reduction to group songs into emotional and thematic archetypes.
Temporal Regression Modeling: Uses a RandomForestRegressor trained on known song years, author lifespans, and cluster distances to predict the release decades of undated tracks.

Pipeline Architecture

The project is structured into sequential scripts, orchestrated by a Makefile, moving data from raw HTML to final historical reports:

1_download_lyrics.py: Scrapes song URLs, titles, authors, composers, years, and lyrics, utilizing a caching session to avoid redundant requests.
2_analyze_lyrics.py: Feeds the downloaded lyrics into a local LLM to extract thematic labels and feelings arrays.
3_installdb.py: Initializes the tango_archive.db DuckDB database, transforming raw JSON into relational tables (tango_song, tango_song_feeling, tango_song_author) and pivoted "DNA" views. Includes poet_dates.csv for author metadata.
4_results_stg1.py: Generates initial preprocessed reports, including top 10 author analyses and "Tango Titans" composer DNA.
5_vectorize.py: Reads lyrics from DuckDB, generates Nomic embeddings, and stores them in the tango_analysis table.
6_clustering.py: Scales the emotional metadata, concatenates it with the 768-dim vectors, evaluates K-Means (from k=2 to 10) using silhouette scores, and plots t-SNE/MDS visualizations.
7_results_stg2.py: Exports cluster summaries and time distributions across 5-year bins based on the best K-Means configuration.
8_titans_analysis.py: Processes targeted maestro datasets (e.g., Di Sarli, D'Arienzo) to update missing years and extract their average emotional signatures.
9_regression.py: Imputes missing values, calculates distances to cluster centroids, and trains a Random Forest model to predict years for undated songs, saving predictions to tango_year_predictions.
A_results_stg3.py: Produces the final historical timeline by coalescing known years with ML predictions, exporting total tangos, emotional averages, and thematic shifts per 5-year era.
B_process_final_charts.py: Generates stacked bar charts to visualize thematic evolution and the full emotional spectrum over time.

Prerequisites & Installation

To run this pipeline, you will need Python 3.11+ and poetry for dependency management.

Models Used

This project relies on highly quantized, local GGUF models to run inference and embedding directly on Metal/CPU without relying on paid APIs:

Semantic Extraction (LLM): Qwen2.5-32B-Instruct-GGUF (qwen2.5-32b-instruct-q4_k_m)
Text Embedding: nomic-embed-text-v1.5.Q4_K_M.gguf

Setup Instructions

Clone the repository:

git clone [https://github.com/yourusername/TangoPoemsAnalytics.git](https://github.com/yourusername/TangoPoemsAnalytics.git)
cd TangoPoemsAnalytics

Install dependencies:
```
poetry install
```
Environment Variables: Create a .env file in the root directory and configure your paths:
- duckdBPath: Path to save the database (e.g., tango_archive.db)
- ModelAnalyzerPath: Path to your downloaded Qwen 2.5 32B model.
- TokenizerAnalyzePath: Path to your downloaded Nomic-Embed model.
Data Requirements: Ensure you have the poet_dates.csv file mapping authors to their biographical dates for database initialization. Ensure titans/*.csv exists for the titan analysis.

How to Use

TangoPoemsAnalytics is designed to run as a sequential pipeline using make.

To run the entire pipeline:

make all

To run individual steps:

make download          # Data Acquisition
make analyze           # AI Sentiment/DNA Analysis
make installdb         # Database Setup
make results_stg1      # Initial Reports
make vectorize         # Semantic Vectorization
make clustering        # K-Means Clustering
make results_stg2      # Cluster Summaries
make titans_analysis   # Maestro Profiles
make regression        # Date Prediction
make results_stg3      # Final Unified Results
make final_charts      # Matplotlib Visualization

Analytical Findings

1. ML Chronological Prediction Performance

Our Random Forest model achieved highly accurate chronological predictions, particularly during the strictly defined stylistic era of the "Golden Age."

The high error rates during the 1960s/70s proved to be historically accurate—reflecting the Tango Nuevo Vanguard Fracture (traditionalists vs. modernists) where the genre's cohesive stylistic DNA broke down.

Mean Absolute Error (MAE) per Decade:

1900s: +/- 32.1 years (Low sample size)
1910s: +/- 12.3 years
1920s: +/- 5.1 years
1930s: +/- 3.0 years (High Accuracy - The Infamous Decade)
1940s: +/- 3.4 years (High Accuracy - The Golden Age)
1950s: +/- 4.9 years
1960s: +/- 9.3 years (The Vanguard Fracture)
1970s: +/- 6.2 years
1980s: +/- 4.9 years
1990s: +/- 4.5 years
2000s: +/- 6.0 years
2010s: +/- 1.9 years (High Accuracy - Modern Era)

2. The Historical Heartbeat

Our final analysis successfully merged known dates with AI-predicted dates for unknowns, revealing the complete commercial journey of Tango.

3. The Emotional "DNA" of the Tango Titans

By calculating maestro-specific fingerprints based on their lyrical repertoires, we mathematically quantified their emotional legacies:

Aníbal Troilo: "The Melancholic Noble"—Highest score for Sadness (0.68) and Loneliness, zero humor. Pathos defined.
Juan D'Arienzo: "The Aggressive Cynic"—Highest Anger (0.10) and lowest hope.
Osvaldo Pugliese: "The Complex Intellectual"— Balanced sadness, regret, and high Social Critique (0.18).

Disclaimer: This is a research project utilizing AI for semantic analysis of cultural artifacts. Interpretations of findings should be made in context with established musicological history.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TangoPoemsAnalytics

Overview

Key Features

Pipeline Architecture

Prerequisites & Installation

Models Used

Setup Instructions

How to Use

Analytical Findings

1. ML Chronological Prediction Performance

2. The Historical Heartbeat

3. The Emotional "DNA" of the Tango Titans

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
charts		charts
results		results
titans		titans
.gitignore		.gitignore
1_download_lyrics.py		1_download_lyrics.py
2_analyze_lyrics.py		2_analyze_lyrics.py
3_installdb.py		3_installdb.py
4_results_stg1.py		4_results_stg1.py
5_vectorize.py		5_vectorize.py
6_clustering.py		6_clustering.py
7_results_stg2.py		7_results_stg2.py
8_titans_analysis.py		8_titans_analysis.py
9_regression.py		9_regression.py
A_results_stg3.py		A_results_stg3.py
B_process_final_charts.py		B_process_final_charts.py
Data.zip		Data.zip
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poet_dates.csv		poet_dates.csv
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

TangoPoemsAnalytics

Overview

Key Features

Pipeline Architecture

Prerequisites & Installation

Models Used

Setup Instructions

How to Use

Analytical Findings

1. ML Chronological Prediction Performance

2. The Historical Heartbeat

3. The Emotional "DNA" of the Tango Titans

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages