Seeping Through the Noise

Screenshot of our frontend

Problem Statement

Location-based platforms such as Google Maps, Yelp, and TikTok Places face a growing problem: an overwhelming number of spammy, irrelevant, and low-quality reviews. Users depend on these reviews for decisions about dining, shopping, and travel, but misleading or off-topic content erodes trust. Our challenge: automatically filter unreliable reviews while surfacing those that are authentic, concise, and useful.

Solution Overview

We built an unsupervised anomaly-detection pipeline that:

Processes raw location reviews with text cleaning and feature preparation (numeric and categorical metadata).
Generates semantic embeddings from review text using a frozen DistilBERT encoder.
Combines embeddings with metadata (scaled numeric features and one-hot categorical features) into a hybrid feature space.
Trains an IsolationForest model to detect anomalous reviews (advertisements, irrelevant content, rants without visits).
Evaluates using LLM-derived policy labels (if available), tuning thresholds to maximize F1-score, precision, and recall for reliable anomaly detection.

Solution Pipeline

1) Input Data

Raw Google location reviews from McAuley Labs

2) Text Preprocessing

Cleaning, token normalization, lemmatization (spaCy, regex, langid for language detection)
Embedding with DistilBERT (CLS token, no fine-tuning)

3) Metadata Features

Numeric: n_chars, n_words, hour, dayofweek, month, user_review_count, user_avg_rating, sent_vader
Categorical: geo_cluster
Transformed via StandardScaler and OneHotEncoder

Feature Fusion

Concatenate [BERT embeddings] + [numeric features] + [categorical features] → hybrid feature matrix

Model Training

Train an IsolationForest anomaly detector (unsupervised)

Scoring & Thresholding

If labels exist: tune threshold for maximum F1-score
If not: flag top-k% most anomalous reviews

Evaluation & Output

Metrics: Accuracy, Precision, Recall, F1-score, Confusion Matrix
Output flagged reviews for moderation and downstream filtering

Features & Functionality

Language Detection: filter out non-target languages (langid)
Text Preprocessing: remove noise, normalize tokens, lemmatize with spaCy and regex
Semantic Embeddings: DistilBERT generates meaningful vector representations
Unsupervised Anomaly Detection: IsolationForest flags anomalous reviews at scale
Evaluation Metrics: scikit-learn classification metrics for reliability
Visual Insights: seaborn and matplotlib plots of data distribution, anomaly scores, and performance

Development Tools

Google Colab for prototyping and model training
Jupyter Notebook for exploratory analysis
VSCode for code cleanup and integration testing

APIs and Extensions

Current version: no external APIs required Future integrations:
Google Maps / Places API for live review ingestion
OpenAI GPT-4o for semantic enrichment and contextual evaluation

Libraries & Frameworks

Machine Learning & NLP: scikit-learn, spaCy, regex, langid
Deep Learning: Hugging Face Transformers, PyTorch
Data Handling: pandas, numpy
Visualization: seaborn, matplotlib

Assets & Datasets

Google Local Reviews Dataset (JSON/Gzip) processed in Colab
Manually labeled data samples for ground-truth validation
Colab Drive storage for dataset management

Impact

By combining linguistic analysis and machine learning, our system provides platforms with a scalable and reliable way to filter review quality. This strengthens user trust and ensures that only the most relevant and authentic reviews surface for decision-making.

Built With

huggingface
python
pytorch
streamlit

Submitted to

TikTok TechJam 2025

Created by

I worked on the pre-processing of the datasets, assisted with data labelling and training of ML techniques

Erika Chia
Joash Goh
Jerome Chua
Yixiang Xu
kayla yulianto

Updates

Yixiang Xu started this project — Aug 30, 2025 02:54 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.