Verefy

Process Workflow

Inspiration

As university students, we often rely on location-based reviews to make everyday decisions, from choosing where to eat, which service is reliable, to which places to explore when travelling. However, we quickly noticed that a portion of reviews were spammy, not genuine, irrelevant or written by people who never visited or received service from the place.

This inspired us to come up with Verefy, a machine-learning based system that improves the reliability of online reviews by automatically detecting and filtering out low-quality or irrelevant content. Our goal is to help users like ourselves make better decision from verified reviews and give businesses an honest representation. Ultimately, we aim to reduce manual review efforts and enhancing credibility.

What it does

Verefy is an ML-powered pipeline designed to evaluate the quality and relevancy of location-based reviews, mainly Google location reviews. Our system performs three main tasks:

Assess Relevancy: Check if the review is genuinely about the location.
Gauge Review Quality: Detect spam, advertisement, irrelevant content, and rants from non-visitors.
Enforce Policies: Apply decision rules to automatically flag or down-rank low-quality reviews, while surfacing credible ones.

How we built it

Verefy is designed as a modular pipeline:

Data Collection: Sampled ~100k datapoints from UCSD’s Alaska Google Reviews dataset (metadata + text).
Feature Engineering: Derived explainable features — text length, emoji/URL presence, sentiment cues, timing stats, and place/user aggregates.
Rules Baseline (Silver Labels): Applied heuristics (URLs, phone numbers, repetition, off-topic patterns) to auto-label “silver” data.
Gold Labeling (Manual): Built a sampled pack for human annotation, using stratified and deduplicated subsets.
Silver Labelling via LLMs: Used lightweight LLMs (Gemma) to refine silver scores on sampled subsets.
Evaluation: We generated classified reports, confusion matrices, and additional metrics such as precision, recall, F1-score to evaluate quality.
Dataset Split: Performed group-aware stratified splits (train/val/test) to ensure fair evaluation across places/users.
Model Training: Trained multiple LightGBM classifiers/regressors on silver+engineered features to predict:
- Ads/Promo
- Spam/Low-quality
- Irrelevant
- Rant/No-visit
- Relevancy Score
- Visit Likelihood
Evaluation: Generated precision/recall/F1, ROC-AUC, PR-AUC, calibration plots, and markdown 10. reports.
Policy Enforcement: Combined predictions + thresholds into final decisions: hide spam, de-rank ads, flag rants, and mark high-likelihood relevant reviews as “genuine”.

Challenges we ran into

No true ground truth: due to time constraints, we had to rely on feature-engineered heuristics and lightweight LLM outputs as “silver” truth.
Data imbalance: certain labels like “irrelevant” or “ads” were very rare, which made training difficult.
Limited gold labels: manual annotation was too exhausting, so we only built a pipeline to sample stratified packs for potential future gold annotation.

Accomplishments that we're proud of

Teamwork and Collaboration: Despite working with a new combination of people, we were able to exchange our opinions and work together effectively.
Reproducible pipelines: every stage (data → silver labels → splits → training → evaluation → enforcement) is modular, config-driven (YAML), and ready for extension.
Classification without ground truth: we managed to bootstrap training entirely from silver labels, combining LGBM classifiers with LLM-based labelling, which required creative problem-solving.
Pipeline thinking: our approach follows good coding practices, ensuring the system can be developed further by others.

What we learned

Data quality matters most: having clean, engineered features plus reliable labelling is more impactful than complex models.
Bootstrapping ground truth is possible: even without gold labels, silver labels from heuristics + LLMs can power a functioning system.
Modularity is key: building modular, reproducible pipelines let us test, debug, and showcase progress quickly.

What's next for Verefy

Better ground truth generation: experiment with more powerful LLMs and systematically evaluate the silver labels against curated gold annotations.
Model diversity & tuning: expand beyond LightGBM (e.g., XGBoost, Random Forests, transformer-based classifiers) and perform proper hyperparameter tuning.
Stronger evaluation: improve evaluation metrics, especially around calibration, imbalanced classes, and policy alignment.
Refined policy enforcement: move from vague thresholds to clear, interpretable business rules, leveraging our existing enforcement pipeline.