The notebooks used to compile the report can be found under the notebooks folder.
The report can be found uner notebooks/Reports/skills_test.md
The attached data file (skills_test.csv) contains text data pertaining to Wells Fargo, a US bank. This data includes documents sourced from a variety of platforms including Twitter, Facebook, blogs and forums.
There are two columns in the data file. Text contains exactly that. label
contains a binary label indicating whether the document is about pricing (1)
or not (0).
You have two tasks:
- classify documents based on whether or not they discuss pricing
- summarise the dataset
These tasks give us insight into your technical and methodological knowledge as well as your curiosity, creativity and ability to pick up new methods quickly. Good luck!
Your task, should you choose to accept it, is to correctly classify the documents into classes 0 and 1.
It is worth trying not only a variety of classification algorithms (SVM, RandomForest, NaiveBayes, Neural Net etc.) but also a variety of tokenization approaches!
We want to see a summary of the techniques you tried and the results you received. We're interested in how you quantify the results of the various techniques that you try too please so don't forget to report that. The F1 score (harmonic mean) is a good option here.
We are also interested in your ability to make sense of the data. Please tell us what's in this dataset? Here are some areas that you might consider looking into to get you started: frequent terms, topics, users, sentiment, etc.
Be creative and have fun!
Live long and prosper 🖖