Sparkify is an online new music streaming app. They have collected user activity info and want to do analysis on what songs the users are listening to by doing analytical search on the data.As of now the information is store in JSON files and other metadata files and there is no easy way to search.
The query that the user wants to run
The solution that is designed to solve the above problem is to build a ETL pipleline by extracting the data from JSON files and then transforming it and storing it into a Postgres database.
- Postgres database
- Python 3.6 with panda library
- Jupyter Notebooks - optional for practive
- SQL
- Python
- JSON
-
Song Dataset The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.
-
Log Dataset The dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.
- Run database_setup.ipynb Jupyter notebook
- Run etl_process.ipynb Jupyter notebook
- Install Python 3.6 with pandas library.
- Install Postgres 10 database
- Create student user in Postgres database from postgres
- Create studentdb database Postgres database from postgres
- Download the sql_queries.py, create_tables.py, etl.py files in a single directory along with data folder.
- Run create_tables.py
- Run etl.py
- Validate the tables
songplays,users,songs,artists,timefor data inserted from JSON files present
- Database is designed in Star Schema with Fact and Dimensions table.
- psycopg2 is utilized to create the tables and insert data into postgres database.
- Program is designed in a way to first import the dimension data into
songs,artist,user,timeand then from these table do the search query to fill the fact tablesongplays. - For identifying multiple json files in a directory
os,globpython libraries are used To read more on glob https://swcarpentry.github.io/python-novice-inflammation/04-files/
for root, dirs, files in os.walk(filepath):
files = glob.glob(os.path.join(root, '*.json'))
for f in files:
all_files.append(os.path.abspath(f))
- For importing and reading json for song records
df = pd.read_json(filepath, typ='series') - For importing and reading json object per line for log jsons
df = pd.read_json(filepath, lines=True) - After massaging the data into dataframes from song and log json files, data is inserted into
database tables by calling sql queries designed in
sql_queries.pyfrom execute function eg. `cur.execute(song_table_insert, song_data)'