A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis. More specifically, we need to create a database schema and ETL pipeline for this analysis.
The data in this project reside in two datasets:
A directory with JSON metadata on the songs in their app. This is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
A directory of JSON logs on user activity on the app. It is generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
The log files in the dataset are partitioned by year and month. For example, here are filepaths to two files in this dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
And below is an example of what the data in a log file, 2018-11-12-events.json, looks like:

A star schema is defined to optimize data write times by reducing duplicate data entris. It contains 1 fact table (songplays) and 4 dimension tables (songs, artists, time, and users):

(The schema diagram is created using dbdiagram.io)
- Ensure the data folder and all project files are downloaded and that all dependencies are met. Replace the given connection strings in
create_tables.py,etl.ipynbandetl.pywith your own, pointing to a postgres database server you have set up. - Run
create_tables.pyto create the database and tables. - Start and run
test.ipynbto confirm the creation of all tables with correct columns. Shut down the kernel and close the notebook. - Start and run
etl.ipynbto ensure the transformation and loading scripts work for a single song and log file. Shut down the kernel and close the notebook. - Run
create_tables.pyto reset the database. - Run
etl.pyto start the ETL data pipeline and process the entire JSON databases to the Sparkify database.
If connection error is raised, try the following in terminal:
psql postgres
Then CREATE ROLE student WITH LOGIN PASSWORD 'student';
ALTER ROLE student CREATEDB;
\du (to check if the role is created and the attributes)
\q
Then in create_tables.py change the connect part into conn = psycopg2.connect("host='localhost' dbname='postgres' user=student password=student") and run.
Ref: https://www.sqlshack.com/setting-up-a-postgresql-database-on-mac/