Problem Statement
Location-based platforms such as Google Maps, Yelp, and TikTok Places face a growing problem: an overwhelming number of spammy, irrelevant, and low-quality reviews. Users depend on these reviews for decisions about dining, shopping, and travel, but misleading or off-topic content erodes trust. Our challenge: automatically filter unreliable reviews while surfacing those that are authentic, concise, and useful.
Solution Overview
We built an unsupervised anomaly-detection pipeline that:
- Processes raw location reviews with text cleaning and feature preparation (numeric and categorical metadata).
- Generates semantic embeddings from review text using a frozen DistilBERT encoder.
- Combines embeddings with metadata (scaled numeric features and one-hot categorical features) into a hybrid feature space.
- Trains an IsolationForest model to detect anomalous reviews (advertisements, irrelevant content, rants without visits).
- Evaluates using LLM-derived policy labels (if available), tuning thresholds to maximize F1-score, precision, and recall for reliable anomaly detection.
Solution Pipeline
1) Input Data
- Raw Google location reviews from McAuley Labs
2) Text Preprocessing
- Cleaning, token normalization, lemmatization (spaCy, regex, langid for language detection)
- Embedding with DistilBERT (CLS token, no fine-tuning)
3) Metadata Features
- Numeric: n_chars, n_words, hour, dayofweek, month, user_review_count, user_avg_rating, sent_vader
- Categorical: geo_cluster
- Transformed via StandardScaler and OneHotEncoder
- Feature Fusion
- Concatenate [BERT embeddings] + [numeric features] + [categorical features] → hybrid feature matrix
- Model Training
- Train an IsolationForest anomaly detector (unsupervised)
- Scoring & Thresholding
- If labels exist: tune threshold for maximum F1-score
- If not: flag top-k% most anomalous reviews
- Evaluation & Output
- Metrics: Accuracy, Precision, Recall, F1-score, Confusion Matrix
- Output flagged reviews for moderation and downstream filtering
Features & Functionality
- Language Detection: filter out non-target languages (langid)
- Text Preprocessing: remove noise, normalize tokens, lemmatize with spaCy and regex
- Semantic Embeddings: DistilBERT generates meaningful vector representations
- Unsupervised Anomaly Detection: IsolationForest flags anomalous reviews at scale
- Evaluation Metrics: scikit-learn classification metrics for reliability
- Visual Insights: seaborn and matplotlib plots of data distribution, anomaly scores, and performance
Development Tools
- Google Colab for prototyping and model training
- Jupyter Notebook for exploratory analysis
- VSCode for code cleanup and integration testing
APIs and Extensions
- Current version: no external APIs required Future integrations:
- Google Maps / Places API for live review ingestion
- OpenAI GPT-4o for semantic enrichment and contextual evaluation
Libraries & Frameworks
- Machine Learning & NLP: scikit-learn, spaCy, regex, langid
- Deep Learning: Hugging Face Transformers, PyTorch
- Data Handling: pandas, numpy
- Visualization: seaborn, matplotlib
Assets & Datasets
- Google Local Reviews Dataset (JSON/Gzip) processed in Colab
- Manually labeled data samples for ground-truth validation
- Colab Drive storage for dataset management
Impact
By combining linguistic analysis and machine learning, our system provides platforms with a scalable and reliable way to filter review quality. This strengthens user trust and ensures that only the most relevant and authentic reviews surface for decision-making.
Log in or sign up for Devpost to join the conversation.