GitHub - neeleshn/Data-Mining-Project: Predict Restaurant hygiene through public reviews from social media

Goal: Predict Hygiene of a restaurant from reviews on Social Media.

Dataset:

Violations by all Restaurants from Boston Public Health Department.
Yelp Reviews.

Result: 78% accuracy is predicting if a restaurant is hygienic or unhygienic.

Data: The data is available on the following link: https://drive.google.com/open?id=0B_LGUXrleYirYkJRc29pLTg0OTQ Create a data folder in the project outside src folder and unzip the data from the link and add it to the data folder.

allReview.out: The output from the Hadoop program which is a Map which relates each yelp reviewId with a sentiment score AllViolations.csv: The data provided by the city of Boston for the past violations provided for restaurents. restaurent_ids_to_yelp_ids.csv: maps yelp business id to the restuarent id for the city of Boston train_review.json: Manually labelled data from yelp_academic_dataset_review.json, Used this data to train the Naive bayes text classifier. test_review.json: The remaining data from yelp_academic_dataset_review.json used to predict the hygiene relation in a review. Used as input to the Naive bayes text classifier. yelp_academic_dataset_review.json: The data provided by yelp for all the reviews. yelp_academic_dataset_business.json: The data provided by yelp about businesses.

Hadoop Program:

Software required : Gradle
Build jar file by running "gradle build" on terminal in the directory "hadoop".
Put yelp_academic_dataset_review.json in S3. Run the jar on AWS EMR.
The output of this program is downloaded in hadoop/postprocessing/output.
run Main.java in hadoop/postprocessing to get allReviews.out which needs to be placed in data directory to execution of rest of the program.

Project Build:

Open the source code in an IDE.
Configure the program to a Maven Configuration
The pom.xml is provided with the source code which should automatically build the dependencies into the project.
run "mvn clean install" in terminal for building the project.

Running the code:

--correlation package

WekaPearsonCorealtion.java: Run the program directly to find the pearson corelation for features.
SpearmanCorelation.java: Run the program directly to find the spearmann corelation for features.

-- linearregression

Programs which consider only text as a feature, run each of them directly
- J48ClassifierHygieneRelated.java
- NaiveBayesHygieneRelatedClassfier.java
- RandomForestClassifierForHygieneRelated.java
Programs which consider features along with text and from business.json
- J48FeatureDemo.java
- NaiveBayesClassifier.java
- RandonForestDemo.java
Run RegressionDemo.java to see the results of Ridge Regression.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
hadoop		hadoop
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages