Sparkify Song Data Analysis

Problem Statement

Sparkify is an online new music streaming app. They have collected user activity info and want to do analysis on what songs the users are listening to by doing analytical search on the data.As of now the information is store in JSON files and other metadata files and there is no easy way to search.

The query that the user wants to run

Solution

The solution that is designed to solve the above problem is to build a ETL pipleline by extracting the data from JSON files and then transforming it and storing it into a Postgres database.

Software

Postgres database
Python 3.6 with panda library
Jupyter Notebooks - optional for practive

Programming languages

SQL
Python
JSON

DataSet

Song Dataset The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.
Log Dataset The dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.

How to run from Udacity Workspace 👇

Run database_setup.ipynb Jupyter notebook
Run etl_process.ipynb Jupyter notebook

How to run from local 👇

Install Python 3.6 with pandas library.
Install Postgres 10 database
Create student user in Postgres database from postgres
Create studentdb database Postgres database from postgres
Download the sql_queries.py, create_tables.py, etl.py files in a single directory along with data folder.
Run create_tables.py
Run etl.py
Validate the tables songplays, users, songs, artists, time for data inserted from JSON files present

A Quick 🏃 of the flow 👇

Database is designed in Star Schema with Fact and Dimensions table.
psycopg2 is utilized to create the tables and insert data into postgres database.
Program is designed in a way to first import the dimension data into songs, artist, user, time and then from these table do the search query to fill the fact table songplays.
For identifying multiple json files in a directory os, glob python libraries are used To read more on glob https://swcarpentry.github.io/python-novice-inflammation/04-files/

    for root, dirs, files in os.walk(filepath):
        files = glob.glob(os.path.join(root, '*.json'))
        for f in files:
            all_files.append(os.path.abspath(f))

For importing and reading json for song records df = pd.read_json(filepath, typ='series')
For importing and reading json object per line for log jsons df = pd.read_json(filepath, lines=True)
After massaging the data into dataframes from song and log json files, data is inserted into database tables by calling sql queries designed in sql_queries.py from execute function eg. `cur.execute(song_table_insert, song_data)'

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
.gitattributes		.gitattributes
README.md		README.md
create_tables.py		create_tables.py
database_setup.ipynb		database_setup.ipynb
etl.ipynb		etl.ipynb
etl.py		etl.py
etl_process.ipynb		etl_process.ipynb
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Song Data Analysis

Problem Statement

Solution

Software

Programming languages

DataSet

How to run from Udacity Workspace 👇

How to run from local 👇

A Quick 🏃 of the flow 👇

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sparkify Song Data Analysis

Problem Statement

Solution

Software

Programming languages

DataSet

How to run from Udacity Workspace 👇

How to run from local 👇

A Quick 🏃 of the flow 👇

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages