Data Modeling with Postgres

Introduction

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis. More specifically, we need to create a database schema and ETL pipeline for this analysis.

Datasets

The data in this project reside in two datasets:

Song dataset:

A directory with JSON metadata on the songs in their app. This is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log dataset:

A directory of JSON logs on user activity on the app. It is generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
The log files in the dataset are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

And below is an example of what the data in a log file, 2018-11-12-events.json, looks like:

Schema

A star schema is defined to optimize data write times by reducing duplicate data entris. It contains 1 fact table (songplays) and 4 dimension tables (songs, artists, time, and users):

(The schema diagram is created using dbdiagram.io)

Project Steps

Ensure the data folder and all project files are downloaded and that all dependencies are met. Replace the given connection strings in create_tables.py, etl.ipynb and etl.py with your own, pointing to a postgres database server you have set up.
Run create_tables.py to create the database and tables.
Start and run test.ipynb to confirm the creation of all tables with correct columns. Shut down the kernel and close the notebook.
Start and run etl.ipynb to ensure the transformation and loading scripts work for a single song and log file. Shut down the kernel and close the notebook.
Run create_tables.py to reset the database.
Run etl.py to start the ETL data pipeline and process the entire JSON databases to the Sparkify database.

Note

If connection error is raised, try the following in terminal:
psql postgres
Then CREATE ROLE student WITH LOGIN PASSWORD 'student';
ALTER ROLE student CREATEDB;
\du (to check if the role is created and the attributes)
\q
Then in create_tables.py change the connect part into conn = psycopg2.connect("host='localhost' dbname='postgres' user=student password=student") and run.
Ref: https://www.sqlshack.com/setting-up-a-postgresql-database-on-mac/

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Modeling with Postgres

Introduction

Datasets

Song dataset:

Log dataset:

Schema

Project Steps

Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Modeling with Postgres

Introduction

Datasets

Song dataset:

Log dataset:

Schema

Project Steps

Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages