Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
The task is to predict the rating a user will give to a song (https://www.kaggle.com/c/MusicHackathon).
The interesting part is that this problem provides us with tremendous amount of data, including users's rating, profile, preferences etc.. And they are in various format, ratings, words, binary... So the big challange here is how to select features, which turns out to be the key to this problem.
The basic idea of my approach is to create models for each artist (rather than each artist, track pair). For a particular artist, we extract all its ratings from train.csv, and the features for each user we create from both users.csv and words.csv. I first extract features from users.csv (the file contains users' profiles) for each user, the feature includes age, sex, and the answer for their habbit questions. And then from words.csv (survey for users), I use the score this user give to this song as additional features. Basically I combine this two, and use Lasso regression (L1 norm) to build model.
Due to time issue, I do not fully optimize the algorithm and there are lots of work remains to be done. I finally got rmse 16.68 and the leader got 13.24.