The cornerstone of the project is a startup called Sparkify, a start up which is streaming music to its customers. Coming from a sophisticated song database the company likes to move their entire processes and the respective data onto a professional cloud solution. Key facts are the useage of a AWS S3 data bucket in connection with JSON based log files. The JSON files are triggerinig the user's activities on the app plus the metadata of the songs.
As a data engineer the main tasks are:
- building an ETL pipeline that extracts their data from S3
- staging the data in AWS Redshift
- transforming the data into a set of dimensional tables for analytical reasons
- finding insights to what kind of songs the customers are listening to
The ETL pipeline consists of three Python file which are described below.
The primary file ist the the ETL pipeline builder
load_staging_tablesloading the raw data from the AWS S3 Bucket into the staging tables hosted in Redshiftinsert_tablestransformation process of the staging tables to dimensiojnal tables - needed to do the analytics
Focusing in the schema to create staging, fact, and dimension tables
drop_tablescreate_tables
Collection of certain SQL query statements for the before-mentioned files: 'create_tables' and 'etl.py'
*_table_drop*_table_createstaging_*_copy*_table_insert
The database schema consists of staging tables, a fact table and multiple dimension tables. IN detail the structure is as follows:
staging_events artist VARCHAR, auth VARCHAR, firstName VARCHAR, gender CHAR(1), itemInSession INT, lastName VARCHAR, length FLOAT, level VARCHAR, location TEXT, method VARCHAR, page VARCHAR, registration VARCHAR, sessionId INT, song VARCHAR, status INT, ts BIGINT, userAgent TEXT, userId INT
staging_songs artist_id VARCHAR, artist_latitude FLOAT, artist_location TEXT, artist_longitude FLOAT, artist_name VARCHAR, duration FLOAT, num_songs INT, song_id VARCHAR, title VARCHAR, year INT
songplays songplay_id INT IDENTITY(0,1), start_time TIMESTAMP, user_id INT, level VARCHAR, song_id VARCHAR, artist_id VARCHAR, session_id INT, location TEXT, user_agent TEXT
users user_id INT, first_name VARCHAR, last_name VARCHAR, gender CHAR(1), level VARCHAR
songs song_id VARCHAR, title VARCHAR, artist_id VARCHAR, year INT, duration FLOAT
artists artist_id VARCHAR, name VARCHAR, location TEXT , latitude FLOAT , longitude FLOAT
time start_time TIMESTAMP, hour INT, day INT, week INT, month INT, year INT, weekday VARCHAR
The following ER diagram gives an impression about the connection of the tables https://github.com/KCvW/DataEng/blob/main/ER%20Staging%20Tables.jpeg