Problem Statement:

The track chosen for this project is Track 1: Filtering the Noise: ML for Trustworthy Location Reviews

Process:

To begin, our dataset was scraped from Google Maps, taking 20 reviews or less per location, over 181 locations spanning restaurants and shops across Singapore.

We cleaned the data, and categorised the columns datatype into numeric or categorical values. There are many columns in the dataset, and they were grouped into ‘Similar data’, ‘Miscellaneous’ or ‘Tentative’, each of which is handled differently. ‘Similar data’ are columns which provide similar information as each other, and hence some of these columns are combined to give a more comprehensive information, while others are dropped. ‘Miscellaneous’ are columns we feel do not provide much insight for analysis and we decide to drop them. We handled missing values, and converted booleans to binary 1, 0. Finally, we conducted some exploratory data analysis to find the correlation between numeric columns, conducted association analysis between the ‘label’ column and the variables before finally settling on the columns we would like to use for our training data.

We also conducted feature engineering by finding the length of reviews, sentiment polarity and sentiment subjectivity. We also attempted to find the presence of links, phone numbers and emails, the number of all capitalised words etc. Our reasoning for this is as follows:

  • Length of reviews: Advertisements may have much shorter or longer lengths than normal reviews.
  • Sentiment polarity and subjectivity: Overly negative sentiments may likely be a rant.

Afterwards, we labelled our data into the categories: 'Relevant and quality', 'Relevant', 'no review','Rants without visit', 'Irrelevant content', 'Vague', and 'Advertisement'. Data was labelled using a combination of ChatGPT and manual checking to relabel any inconsistencies.

To find out which model to use LLM or traditional ML, we played around with both. We settled on using LLMs, because LLMs are trained with a lot more data than if we started from scratch, and hence we believe using pre-trained LLMs would be more accurate in identifying and classifying reviews. We also decided to experiment with few shot LLM as fine tuning a pre-trained model was very time-consuming.

How our chosen model works to address the quality and relevancy:

The solution applies a fine-tuned pre-trained large language model (LLM) and a few-shot LLM, implemented with Hugging Face’s AutoTokenizer and model utilities, to classify location-based reviews into four categories: advertisements, irrelevant content, rants without visits, and relevant content.

How fine-tuned pre-trained LLM directly improves the quality and relevancy of reviews by:

  • Advanced Pattern Recognition: They can distinguish subtle linguistic patterns that separate genuine experiences ("waited 20 minutes, coffee was lukewarm") from promotional content ("Amazing deals! Visit now!") or fake reviews, using sophisticated language understanding developed during pretraining.
  • Contextual Understanding: Unlike simple keyword filters, pretrained LLMs grasp context and intent. They recognize when someone describes an actual visit versus speculation, and can identify relevant location-specific details versus generic complaints.
  • Consistent Classification: They apply the same quality standards across thousands of reviews without human fatigue or bias, ensuring systematic removal of advertisements, spam, and irrelevant content while preserving authentic user experiences.

For our model, we assumed the inputs are the 'review_text', 'rating_person', 'main_category', 'can_claim', 'is_local_guide', 'sentiment_polarity', 'sentiment_subjectivity'.

Additionally, we ran the model a few times and chose the one that has the highest precision

Web Application Development:

We also developed an interactive platform to connect with our model to display the likely category of an inputted review. It takes in a json file of a review and flags out reviews that do not follow the guidelines (advertisement, irrelevant, or rants without visiting). The dashboard tab displays an analysis of the model, as well as the dataset that was used in training the model.

Development Tools Used:

Coding Editor: VSCode, Jupyter Notebook

Version Control: Git, GitHub

Libraries and Frameworks used:

  • LLM: Hugging Face Transformer, Pytorch, pandas, numpy, scipy, textblob, scikit-learn, accelerate, safetensors, huggingface-hub, tokenizers
  • Web Application: React, Vite, Recharts (for visualisation)
  • Data Analysis: pandas, numpy, scipy

Assets and Datasets used: Scraped Google Map Location reviews

Built With

Share this project:

Updates