Skip to content

zhao-li-github/DataModeling_postgres

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Modeling with Postgres

Introduction

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis. More specifically, we need to create a database schema and ETL pipeline for this analysis.

Datasets

The data in this project reside in two datasets:

Song dataset:

A directory with JSON metadata on the songs in their app. This is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log dataset:

A directory of JSON logs on user activity on the app. It is generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
The log files in the dataset are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

And below is an example of what the data in a log file, 2018-11-12-events.json, looks like:
image

Schema

A star schema is defined to optimize data write times by reducing duplicate data entris. It contains 1 fact table (songplays) and 4 dimension tables (songs, artists, time, and users):
image
(The schema diagram is created using dbdiagram.io)

Project Steps

  1. Ensure the data folder and all project files are downloaded and that all dependencies are met. Replace the given connection strings in create_tables.py, etl.ipynb and etl.py with your own, pointing to a postgres database server you have set up.
  2. Run create_tables.py to create the database and tables.
  3. Start and run test.ipynb to confirm the creation of all tables with correct columns. Shut down the kernel and close the notebook.
  4. Start and run etl.ipynb to ensure the transformation and loading scripts work for a single song and log file. Shut down the kernel and close the notebook.
  5. Run create_tables.py to reset the database.
  6. Run etl.py to start the ETL data pipeline and process the entire JSON databases to the Sparkify database.

Note

If connection error is raised, try the following in terminal:
psql postgres
Then CREATE ROLE student WITH LOGIN PASSWORD 'student';
ALTER ROLE student CREATEDB;
\du (to check if the role is created and the attributes)
\q
Then in create_tables.py change the connect part into conn = psycopg2.connect("host='localhost' dbname='postgres' user=student password=student") and run.
Ref: https://www.sqlshack.com/setting-up-a-postgresql-database-on-mac/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors