Problem Statement

Location-based platforms such as Google Maps, Yelp, and TikTok Places face a growing problem: an overwhelming number of spammy, irrelevant, and low-quality reviews. Users depend on these reviews for decisions about dining, shopping, and travel, but misleading or off-topic content erodes trust. Our challenge: automatically filter unreliable reviews while surfacing those that are authentic, concise, and useful.

Solution Overview

We built an unsupervised anomaly-detection pipeline that:

  • Processes raw location reviews with text cleaning and feature preparation (numeric and categorical metadata).
  • Generates semantic embeddings from review text using a frozen DistilBERT encoder.
  • Combines embeddings with metadata (scaled numeric features and one-hot categorical features) into a hybrid feature space.
  • Trains an IsolationForest model to detect anomalous reviews (advertisements, irrelevant content, rants without visits).
  • Evaluates using LLM-derived policy labels (if available), tuning thresholds to maximize F1-score, precision, and recall for reliable anomaly detection.

Solution Pipeline

1) Input Data

  • Raw Google location reviews from McAuley Labs

2) Text Preprocessing

  • Cleaning, token normalization, lemmatization (spaCy, regex, langid for language detection)
  • Embedding with DistilBERT (CLS token, no fine-tuning)

3) Metadata Features

  • Numeric: n_chars, n_words, hour, dayofweek, month, user_review_count, user_avg_rating, sent_vader
  • Categorical: geo_cluster
  • Transformed via StandardScaler and OneHotEncoder
  1. Feature Fusion
  • Concatenate [BERT embeddings] + [numeric features] + [categorical features] → hybrid feature matrix
  1. Model Training
  • Train an IsolationForest anomaly detector (unsupervised)
  1. Scoring & Thresholding
  • If labels exist: tune threshold for maximum F1-score
  • If not: flag top-k% most anomalous reviews
  1. Evaluation & Output
  • Metrics: Accuracy, Precision, Recall, F1-score, Confusion Matrix
  • Output flagged reviews for moderation and downstream filtering

Features & Functionality

  • Language Detection: filter out non-target languages (langid)
  • Text Preprocessing: remove noise, normalize tokens, lemmatize with spaCy and regex
  • Semantic Embeddings: DistilBERT generates meaningful vector representations
  • Unsupervised Anomaly Detection: IsolationForest flags anomalous reviews at scale
  • Evaluation Metrics: scikit-learn classification metrics for reliability
  • Visual Insights: seaborn and matplotlib plots of data distribution, anomaly scores, and performance

Development Tools

  • Google Colab for prototyping and model training
  • Jupyter Notebook for exploratory analysis
  • VSCode for code cleanup and integration testing

APIs and Extensions

  • Current version: no external APIs required Future integrations:
  • Google Maps / Places API for live review ingestion
  • OpenAI GPT-4o for semantic enrichment and contextual evaluation

Libraries & Frameworks

  • Machine Learning & NLP: scikit-learn, spaCy, regex, langid
  • Deep Learning: Hugging Face Transformers, PyTorch
  • Data Handling: pandas, numpy
  • Visualization: seaborn, matplotlib

Assets & Datasets

  • Google Local Reviews Dataset (JSON/Gzip) processed in Colab
  • Manually labeled data samples for ground-truth validation
  • Colab Drive storage for dataset management

Impact

By combining linguistic analysis and machine learning, our system provides platforms with a scalable and reliable way to filter review quality. This strengthens user trust and ensures that only the most relevant and authentic reviews surface for decision-making.

Built With

Share this project:

Updates