This project explores a dataset of Spotify songs to demonstrate SQL skills. The dataset, sourced from Kaggle, is cleaned and prepared using Python and uploaded to a SQL database. A variety of queries showcase analytical capabilities and data relationships.
- Data Preparation:
- Clean and organize raw data using Python.
- Split the data into normalized tables (
songs,artists,genres) for relational database design.
- SQL Mastery:
- Perform complex queries to extract insights, showcase database proficiency, and analyze song trends.
- Highlight Relationships:
- Use JOINs and relationships to derive meaningful information from the data.
- Source: Dataset from Kaggle.
- Steps:
- Cleaned and normalized data using a Jupyter notebook.
- Split the data into three tables:
- Artists:
artist_id,artist_name - Genres:
genre_id,genre_name - Songs:
song_id,title,artist_id,genre_id, and other song features.
- Artists:
- Added unique IDs to create relationships between the tables.
The data is organized into a relational schema:
- Artists Table:
- Contains unique artist names and IDs.
- Genres Table:
- Stores unique genres with corresponding IDs.
- Songs Table:
- Links songs to artists and genres using foreign keys and includes song attributes like BPM, energy, and popularity.
- Top Artists by Song Count: Identified the most prolific artists in the dataset.
- Top Genres by Popularity: Determined the most popular genres based on average song popularity.
- Longest and Shortest Songs: Retrieved the lengthiest and briefest tracks.
- Popularity Trends: Analyzed how song popularity evolved over the years.
- One-Hit Wonders: Highlighted artists with only one song in the dataset.
- Genre Diversity per Artist: Showcased artists with songs spanning multiple genres.
- Energy and Danceability: Found highly energetic and danceable tracks.
- BPM Ranges: Grouped songs into tempo categories (e.g., Slow, Moderate, Fast).
- Database Indexing: Optimized performance by creating indexes.
- Data Integrity Checks: Verified data consistency and removed duplicates.
- Visualization-Ready Insights:
- Suggested graphs and visualizations for query results, such as genre distributions and popularity trends.
- Python: For data preparation, cleaning, and exporting tables.
- Jupyter Notebook: For exploratory data analysis and processing.
- SQL: For relational database design and advanced queries.
- Spotify_SQL.ipynb: Jupyter notebook for cleaning and preparing the data, creating tables, and adding relationships.
- queries.sql: Collection of SQL queries used to analyze the data and demonstrate SQL skills.
- erd_screenshot.png: Visual representation of the database schema (ERD) showing relationships between tables.
- Clone this repository.
- Use the included Jupyter notebook to prepare the dataset (optional if starting from raw data).
- Load the prepared CSV files into your preferred SQL database.
- Execute the SQL queries in the provided script or your own database environment.
- Add more advanced SQL queries to showcase skills in:
- Window functions (e.g., running totals, rankings).
- Recursive queries (e.g., finding related artists by genre).
- Correlation analysis (e.g., linking BPM with popularity).
- Create visualizations of query results in Python or Tableau for deeper insights.
- Dataset: The dataset is sourced from Kaggle. Special thanks to the contributors of this dataset.
- Inspiration: Passion for music and data analysis.