Skip to content

MisbahAN/PropComp-AI

Repository files navigation

PropComp AI

A machine learning "Property Recommendation" system that identifies and explains the best comparable properties for real estate appraisals. Built during the first sprint of Headstarter SWE Residency.

Project Overview

This system processes property appraisal data through several stages:

  1. Data cleaning and standardization
  2. Address geocoding and location-based features
  3. Feature engineering
  4. Training a ranking model
  5. Generating human-readable explanations for recommendations
  6. Interactive feedback collection and model retraining

📺 Watch Demo

Watch Demo

📊 Model Performance

The system uses an XGBoost ranking model with pairwise ranking objective to identify comparable properties. Here's the performance on held-out test data:

Evaluation Metrics

  • Metric: Top-3 Precision (per appraisal)
  • Total Top-3 Predictions: 264
  • Correctly Identified Comps: 254
  • False Positives: 10
  • Top-3 Precision: 96.21%

False Positive Analysis

The model incorrectly identified these properties as top-3 comparables:

  • 177 Bramco Lane
  • 242 Coville Circle NE
  • 240 Nelson Street
  • 871 Crestwood Ave
  • 239 Kinniburgh Loop
  • 16 James St
  • 36 Hidden Spring Place NW
  • 6555 Third Line
  • 161 Everoak Circle SW

This high precision indicates that when the model suggests a property as comparable, it is very likely to be a good match for the subject property.

🎯 Project Milestones

This project successfully fulfills all three core requirements set by Automax.ai:

✅ Statistical Modeling

  • XGBoost ranking model with pairwise learning
  • Engineered statistical features (GLA, lot size, bath score differences)
  • Distance-based features and supervised learning

✅ Explainability

  • SHAP values for feature importance
  • GPT-3.5 integration for natural language explanations
  • Human-readable justifications for each recommendation

✅ Self-Improving System

  • Interactive Streamlit UI for user feedback
  • Feedback storage and processing pipeline
  • Automatic model retraining with updated data
  • Real-time explanation regeneration

📁 Project Structure

Directory Layout

📁 property-recommendation-system/
│
├── 📁 data/
│   ├── raw/              # Original appraisal data
│   ├── cleaned/          # Standardized data
│   ├── engineered/       # Feature-enhanced data
│   ├── training/         # ML-ready data
│   ├── geocoded-data/    # Location coordinates
│   └── README.md         # Data documentation
│
├── 📁 frontend/          # Streamlit web interface
│   ├── app.py           # Main application
│   └── feedback/        # User feedback storage
│
├── 📁 models/            # Trained ML models
├── 📁 scripts/           # Python processing scripts
├── 📁 outputs/           # Generated explanations
│
├── .gitignore
├── requirements.txt
└── README.md

Generated Files & Their Purpose

  • data/raw/appraisals_dataset.json

    • Raw input JSON of appraisals with subject, comps, and candidate properties.
  • data/cleaned/cleaned_appraisals_dataset.json

    • Output of clean_initial_data.py: all fields (age, GLA, lot, rooms, baths) standardized to numeric values.
  • data/geocoded-data/geocoded_addresses.json

    • Maps property addresses to latitude/longitude coordinates
    • Used for distance-based feature engineering
    • Generated by geocode_all_addresses.py
  • data/geocoded-data/missing_addresses.txt

    • List of addresses that failed to geocode
    • Used for tracking and manual review
  • data/engineered/feature_engineered_appraisals_dataset.json

    • Output of features.py: adds engineered fields (diffs, binary flags, location features) to each record.
  • data/training/training_data.csv

    • Output of training_data.py: a flat table where each row is one candidate property, labeled is_comp (1 = true comp, 0 = not), with all signed and absolute feature diffs plus flags.
  • data/training/training_data_with_feedback.csv

    • Enhanced training data incorporating user feedback
    • Used for model retraining after user feedback
    • Maintains same structure as training_data.csv
  • models/xgb_rank_model.json

    • Trained XGBoost pairwise ranking model saved by train_model.py.
  • outputs/top3.csv

    • Raw top 3 comparable properties for each appraisal
    • Contains the model's raw predictions and scores
    • Used as input for generating explanations
  • outputs/top3_gpt_explanations.csv

    • Explains top 3 comps per appraisal using SHAP + GPT
    • Generated by top3_explanations.py
    • Requires OpenAI API key

🛠️ Implementation Details

Features Used

The model uses the following features, all computed as differences between subject and candidate properties:

Physical Characteristics

  • Gross Living Area (GLA)

    • gla_diff: Difference in square footage
    • abs_gla_diff: Absolute difference in square footage
  • Lot Size

    • lot_size_sf_diff: Difference in square feet
    • abs_lot_size_sf_diff: Absolute difference in square feet
  • Age

    • effective_age_diff: Difference in effective age
    • subject_age_diff: Difference in subject age
    • abs_effective_age_diff: Absolute difference in effective age
    • abs_subject_age_diff: Absolute difference in subject age
  • Rooms

    • room_count_diff: Difference in total rooms
    • bedrooms_diff: Difference in number of bedrooms
    • abs_room_count_diff: Absolute difference in total rooms
    • abs_bedrooms_diff: Absolute difference in bedrooms
  • Bathrooms

    • bath_score_diff: Difference in bathroom score
    • full_baths_diff: Difference in full bathrooms
    • half_baths_diff: Difference in half bathrooms
    • abs_bath_score_diff: Absolute difference in bathroom score
    • abs_full_bath_diff: Absolute difference in full bathrooms
    • abs_half_bath_diff: Absolute difference in half bathrooms

Location Features

  • distance_km: Straight-line distance between properties in kilometers
  • same_neighborhood: Binary flag (1/0) if properties are within 1km
  • same_city: Binary flag (1/0) if properties are in the same city

Property Type & Timing

  • same_property_type: Binary flag (1/0) if property types match
  • sold_recently: Binary flag (1/0) if sold within 90 days of subject's effective date

Data Processing Pipeline

  1. data_pipeline.py (89 lines)

    • Coordinates the entire pipeline
    • Conditionally runs geocoding based on cache status
    • Executes all processing steps in sequence
  2. clean_initial_data.py (364 lines)

    • Standardizes property data
    • Handles missing values
    • Normalizes property features (ages, sizes, rooms)
    • Outputs cleaned data to data/cleaned/
  3. geocode_all_addresses.py (146 lines)

    • Geocodes property addresses using Nominatim
    • Uses GPT for address cleaning when needed
    • Caches results to avoid repeated lookups
    • Outputs to data/geocoded-data/
  4. features.py (519 lines)

    • Adds engineered features
    • Implements property type matching
    • Creates time-based flags
    • Adds location-based features
    • Outputs enhanced data to data/engineered/
    • Modify this file to change the features used in the model
  5. training_data.py (194 lines)

    • Converts data to ML format
    • Creates positive/negative examples
    • Prepares data for model training
    • Outputs to data/training/
  6. train_model.py (142 lines)

    • Trains XGBoost ranking model
    • Implements cross-validation
    • Saves model to models/
    • Modify this file to change model parameters (learning rate, depth, etc.)
  7. top3_explanations.py (267 lines)

    • Generates human-readable explanations
    • Uses SHAP for feature importance
    • Integrates with GPT for natural language explanations
    • Outputs to outputs/
    • Modify this file to change the explanation format and GPT prompts
  8. frontend/app.py (212 lines)

    • Streamlit web interface for property comparisons
    • Displays model predictions and explanations
    • Collects user feedback
    • Triggers model retraining with feedback

🔑 Key Concepts Learned

1. Text Processing

a) Regular Expressions

Extract structured data from messy strings.

import re
# Extract year from "built in 2023"
match = re.search(r"(\d{1,4})", "built in 2023")
year  = int(match.group(1))  # → 2023

b) Tokenization

Split a cleaned string into meaningful parts.

# Separate number + unit
val    = "1200 sqft"
tokens = val.split()        # → ["1200", "sqft"]

c) Fuzzy String Matching

Map free-text to a small set of known categories.

from fuzzywuzzy import process
CANONICAL_TYPES = ["Townhouse", "Detached", ...]
match, score = process.extractOne("semi detached", CANONICAL_TYPES)
# If score ≥ 80, accept `match`; otherwise None

2. Data Splitting & Grouping

a) Train/Test Split

Hold out data for honest evaluation.

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(
    df, test_size=0.2, stratify=df["label"], random_state=42
)

b) Group Creation

Tell the ranker which rows belong to each query (appraisal).

# Count candidates per appraisal (orderID)
groups = df_train.groupby("orderID").size().to_list()

3. XGBoost Ranking

a) DMatrix

XGBoost's optimized input format (features + labels + groups).

import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)
dtrain.set_group(groups)

b) Pairwise Ranking Objective

Learn to order items by comparing pairs within each group.

params = {
    "objective":  "rank:pairwise",
    "eval_metric":"ndcg",        # Normalized Discounted Cumulative Gain
    "eta":         0.1,          # learning rate
    "max_depth":   6
}
model = xgb.train(params, dtrain, num_boost_round=100)

4. Evaluation Metric: Top-K Precision

"Of my top K suggestions, how many are correct?"

def precision_at_k(group, k):
    X      = xgb.DMatrix(group[feature_cols])
    group["score"] = model.predict(X)
    topk   = group.nlargest(k, "score")
    return topk["label"].sum() / k

# Example: average Precision@3 across all test appraisals
overall = df_test.groupby("orderID").apply(lambda g: precision_at_k(g, 3))
print("Avg Precision@3:", overall.mean())

5. SHAP Explanations

Break down each model prediction into feature-level contributions.

import shap

# Wrap the trained model
def model_predict(X_df):
    return model.predict(xgb.DMatrix(X_df))

# Use background data for reference
explainer = shap.Explainer(model_predict, df[feature_cols])

# For a single candidate row:
row_df   = candidate_row[feature_cols].to_frame().T
shap_vals = explainer(row_df).values[0]          # one SHAP value per feature
shap_items = list(zip(feature_cols, shap_vals))
# Positive contributors
pos = [(f, v) for f, v in shap_items if v > 0]
# Negative contributors
neg = [(f, v) for f, v in shap_items if v < 0]

🚀 Getting Started

Prerequisites

  1. Python 3.8 or higher
  2. OpenAI API Key
    1. Sign up or log in at OpenAI Platform.
    2. In your account settings, create a new API key.
    3. In your project root, create a file named .env with the following content:
      OPENAI_API_KEY=your_api_key_here

Installation

  1. Clone the repository:

    git clone https://github.com/MisbahAN/Property-Recommendation-System.git
    cd Property-Recommendation-System
  2. Create and activate a virtual environment:

    # On macOS/Linux
    python -m venv venv
    source venv/bin/activate
    
    # On Windows
    python -m venv venv
    venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

Running the Pipeline

  1. Ensure you have the raw data file in data/raw/appraisals_dataset.json

  2. Run the complete pipeline:

    cd scripts
    python data_pipeline.py
  3. Start the frontend interface:

    cd ../frontend
    streamlit run app.py

The pipeline will:

  1. Clean and standardize the raw data
  2. Geocode addresses (if needed)
  3. Generate features
  4. Train the model
  5. Create explanations

The frontend will be available at http://localhost:8501

Requirements

  • Python 3.8+
  • OpenAI API key (for explanations)
  • Dependencies (see requirements.txt)

Author

Misbah Ahmed Nauman

  • Portfolio: misbahan.com
  • Built during Headstarter SWE Residency Sprint 1

About

(aka “Appraisal Evaluator”) - AI-powered selection & explainability of top 3 comparables

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages