Python AI pipeline decoding Tango history. Ingests lyrics into DuckDB, generates Nomic embeddings via llama.cpp, clusters by emotional profiles, analyzes Maestro DNA, and predicts missing song dates using Random Forest regression. Maps a century of evolution.
TangoPoemsAnalytics is an end-to-end Data Science and Natural Language Processing (NLP) pipeline designed to map the historical, thematic, and emotional evolution of Tango music.
By scraping historical lyrics, analyzing them with local Large Language Models (LLMs), generating semantic embeddings, and applying machine learning algorithms, this project reconstructs the timeline of Tango and identifies the psychological "DNA" of its greatest maestros.
- Automated Scraping: Crawls and downloads lyrics, dates, and metadata from TodoTango.
- LLM-Powered Semantic Extraction: Uses a local LLM via
llama.cppto enforce a strict JSON schema, extracting specific emotions (e.g., sadness, longing) and themes from Spanish lyrics. - High-Performance Vector Database: Ingests structured data into DuckDB views and tables for rapid analytical querying.
- Semantic Embeddings: Vectorizes lyrics into 768-dimensional space using the Nomic-Embed model.
- Unsupervised Machine Learning: Applies K-Means clustering, t-SNE, and MDS dimensionality reduction to group songs into emotional and thematic archetypes.
- Temporal Regression Modeling: Uses a
RandomForestRegressortrained on known song years, author lifespans, and cluster distances to predict the release decades of undated tracks.
The project is structured into sequential scripts, orchestrated by a Makefile, moving data from raw HTML to final historical reports:
1_download_lyrics.py: Scrapes song URLs, titles, authors, composers, years, and lyrics, utilizing a caching session to avoid redundant requests.2_analyze_lyrics.py: Feeds the downloaded lyrics into a local LLM to extract thematic labels and feelings arrays.3_installdb.py: Initializes thetango_archive.dbDuckDB database, transforming raw JSON into relational tables (tango_song,tango_song_feeling,tango_song_author) and pivoted "DNA" views. Includespoet_dates.csvfor author metadata.4_results_stg1.py: Generates initial preprocessed reports, including top 10 author analyses and "Tango Titans" composer DNA.5_vectorize.py: Reads lyrics from DuckDB, generates Nomic embeddings, and stores them in thetango_analysistable.6_clustering.py: Scales the emotional metadata, concatenates it with the 768-dim vectors, evaluates K-Means (from k=2 to 10) using silhouette scores, and plots t-SNE/MDS visualizations.7_results_stg2.py: Exports cluster summaries and time distributions across 5-year bins based on the best K-Means configuration.8_titans_analysis.py: Processes targeted maestro datasets (e.g., Di Sarli, D'Arienzo) to update missing years and extract their average emotional signatures.9_regression.py: Imputes missing values, calculates distances to cluster centroids, and trains a Random Forest model to predict years for undated songs, saving predictions totango_year_predictions.A_results_stg3.py: Produces the final historical timeline by coalescing known years with ML predictions, exporting total tangos, emotional averages, and thematic shifts per 5-year era.B_process_final_charts.py: Generates stacked bar charts to visualize thematic evolution and the full emotional spectrum over time.
To run this pipeline, you will need Python 3.11+ and poetry for dependency management.
This project relies on highly quantized, local GGUF models to run inference and embedding directly on Metal/CPU without relying on paid APIs:
- Semantic Extraction (LLM):
Qwen2.5-32B-Instruct-GGUF(qwen2.5-32b-instruct-q4_k_m) - Text Embedding:
nomic-embed-text-v1.5.Q4_K_M.gguf
-
Clone the repository:
git clone [https://github.com/yourusername/TangoPoemsAnalytics.git](https://github.com/yourusername/TangoPoemsAnalytics.git) cd TangoPoemsAnalytics -
Install dependencies:
poetry install
-
Environment Variables: Create a
.envfile in the root directory and configure your paths:duckdBPath: Path to save the database (e.g.,tango_archive.db)ModelAnalyzerPath: Path to your downloaded Qwen 2.5 32B model.TokenizerAnalyzePath: Path to your downloaded Nomic-Embed model.
-
Data Requirements: Ensure you have the
poet_dates.csvfile mapping authors to their biographical dates for database initialization. Ensuretitans/*.csvexists for the titan analysis.
TangoPoemsAnalytics is designed to run as a sequential pipeline using make.
To run the entire pipeline:
make all
To run individual steps:
make download # Data Acquisition
make analyze # AI Sentiment/DNA Analysis
make installdb # Database Setup
make results_stg1 # Initial Reports
make vectorize # Semantic Vectorization
make clustering # K-Means Clustering
make results_stg2 # Cluster Summaries
make titans_analysis # Maestro Profiles
make regression # Date Prediction
make results_stg3 # Final Unified Results
make final_charts # Matplotlib Visualization
Our Random Forest model achieved highly accurate chronological predictions, particularly during the strictly defined stylistic era of the "Golden Age."
The high error rates during the 1960s/70s proved to be historically accurate—reflecting the Tango Nuevo Vanguard Fracture (traditionalists vs. modernists) where the genre's cohesive stylistic DNA broke down.
Mean Absolute Error (MAE) per Decade:
- 1900s: +/- 32.1 years (Low sample size)
- 1910s: +/- 12.3 years
- 1920s: +/- 5.1 years
- 1930s: +/- 3.0 years (High Accuracy - The Infamous Decade)
- 1940s: +/- 3.4 years (High Accuracy - The Golden Age)
- 1950s: +/- 4.9 years
- 1960s: +/- 9.3 years (The Vanguard Fracture)
- 1970s: +/- 6.2 years
- 1980s: +/- 4.9 years
- 1990s: +/- 4.5 years
- 2000s: +/- 6.0 years
- 2010s: +/- 1.9 years (High Accuracy - Modern Era)
Our final analysis successfully merged known dates with AI-predicted dates for unknowns, revealing the complete commercial journey of Tango.
By calculating maestro-specific fingerprints based on their lyrical repertoires, we mathematically quantified their emotional legacies:
- Aníbal Troilo: "The Melancholic Noble"—Highest score for Sadness (0.68) and Loneliness, zero humor. Pathos defined.
- Juan D'Arienzo: "The Aggressive Cynic"—Highest Anger (0.10) and lowest hope.
- Osvaldo Pugliese: "The Complex Intellectual"— Balanced sadness, regret, and high Social Critique (0.18).
Disclaimer: This is a research project utilizing AI for semantic analysis of cultural artifacts. Interpretations of findings should be made in context with established musicological history.