Data Science

This repository features a number of examples of work I did while pursuing a Master of Information and Data Science at UC Berkeley. I completed the MIDS degree in August 2016.

Information regarding the final Capstone Project for the degree can be found here: https://www.ischool.berkeley.edu/projects/2016/brand-buzz-reddit. This was a group project completed with my colleagues Vincent Chio, Filip Krunic and Ritesh Soni. Our final presentable, which can be accessed through the previous link, can also be viewed directly here: http://team-gilded-dashboard.herokuapp.com/

The contents of this repo are divided as follows:

Audio Quality Experiment: My colleague Jasen Jones and I designed an experiment with the aim to answer the question of whether audio quality affects a user’s enjoyment of music. Included is a link to our final report findings, the data we collected as well as some of the code for the analysis I did using R.
Machine Learning Examples: These examples are presented using Python in Jupyter notebooks and they make use of re (regex module), numpy, matplotlib and scikit-learn
- Digit Classification of MNIST data: Using the popular MNIST database of handwritten digits, I implement a image recognition system using the k-nearest neighbors algorithm (KNN) and the naive Bayes classifier (NB).
- Topic Classication: Using text data from newsgroup postings, I implement classifiers to distinguish between the topics based on the text using KNN, NB and logistic regression (LR).
- Exploring Properties of Mushrooms: Using a classic dataset featuring over 8000 observations of mushrooms described using a variety of features and whether the mushrooms are poisonous or not, I use principal component analysis (PCA) to reduce the dimensionality to visualize the data. I then experiment with clustering using KMeans and density estimation using Gaussian Mixture Models (GMM).
- Influencers in Social Networks: This was a group project done based on the a Kaggle competition for predicting which people are influential in a social network. Our submission makes use of PCA, LR and ensemble methods such as bagging, boosting and random forests.
Streaming Tweet Processing: Having set up a Spark cluster on SoftLayer, I use Scala and Spark streaming to build a Twitter popular topic and user reporting system.
Click-Through Rate (CTR) Prediction: Using a Jupyter notebook template with Python and Spark, I use the Criteo Labs dataset from a Kaggle competion for featurizing categorical data using one-hot encoding (OHE) and predicting CTR.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Audio Quality Experiment		Audio Quality Experiment
CTR Prediction		CTR Prediction
Machine Learning		Machine Learning
Streaming Tweet Processing		Streaming Tweet Processing
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Science

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages