PropComp AI

A machine learning "Property Recommendation" system that identifies and explains the best comparable properties for real estate appraisals. Built during the first sprint of Headstarter SWE Residency.

Project Overview

This system processes property appraisal data through several stages:

Data cleaning and standardization
Address geocoding and location-based features
Feature engineering
Training a ranking model
Generating human-readable explanations for recommendations
Interactive feedback collection and model retraining

📺 Watch Demo

📊 Model Performance

The system uses an XGBoost ranking model with pairwise ranking objective to identify comparable properties. Here's the performance on held-out test data:

Evaluation Metrics

Metric: Top-3 Precision (per appraisal)
Total Top-3 Predictions: 264
Correctly Identified Comps: 254
False Positives: 10
Top-3 Precision: 96.21%

False Positive Analysis

The model incorrectly identified these properties as top-3 comparables:

177 Bramco Lane
242 Coville Circle NE
240 Nelson Street
871 Crestwood Ave
239 Kinniburgh Loop
16 James St
36 Hidden Spring Place NW
6555 Third Line
161 Everoak Circle SW

This high precision indicates that when the model suggests a property as comparable, it is very likely to be a good match for the subject property.

🎯 Project Milestones

This project successfully fulfills all three core requirements set by Automax.ai:

✅ Statistical Modeling

XGBoost ranking model with pairwise learning
Engineered statistical features (GLA, lot size, bath score differences)
Distance-based features and supervised learning

✅ Explainability

SHAP values for feature importance
GPT-3.5 integration for natural language explanations
Human-readable justifications for each recommendation

✅ Self-Improving System

Interactive Streamlit UI for user feedback
Feedback storage and processing pipeline
Automatic model retraining with updated data
Real-time explanation regeneration

📁 Project Structure

Directory Layout

📁 property-recommendation-system/
│
├── 📁 data/
│   ├── raw/              # Original appraisal data
│   ├── cleaned/          # Standardized data
│   ├── engineered/       # Feature-enhanced data
│   ├── training/         # ML-ready data
│   ├── geocoded-data/    # Location coordinates
│   └── README.md         # Data documentation
│
├── 📁 frontend/          # Streamlit web interface
│   ├── app.py           # Main application
│   └── feedback/        # User feedback storage
│
├── 📁 models/            # Trained ML models
├── 📁 scripts/           # Python processing scripts
├── 📁 outputs/           # Generated explanations
│
├── .gitignore
├── requirements.txt
└── README.md

Generated Files & Their Purpose

data/raw/appraisals_dataset.json
- Raw input JSON of appraisals with subject, comps, and candidate properties.
data/cleaned/cleaned_appraisals_dataset.json
- Output of clean_initial_data.py: all fields (age, GLA, lot, rooms, baths) standardized to numeric values.
data/geocoded-data/geocoded_addresses.json
- Maps property addresses to latitude/longitude coordinates
- Used for distance-based feature engineering
- Generated by geocode_all_addresses.py
data/geocoded-data/missing_addresses.txt
- List of addresses that failed to geocode
- Used for tracking and manual review
data/engineered/feature_engineered_appraisals_dataset.json
- Output of features.py: adds engineered fields (diffs, binary flags, location features) to each record.
data/training/training_data.csv
- Output of training_data.py: a flat table where each row is one candidate property, labeled is_comp (1 = true comp, 0 = not), with all signed and absolute feature diffs plus flags.
data/training/training_data_with_feedback.csv
- Enhanced training data incorporating user feedback
- Used for model retraining after user feedback
- Maintains same structure as training_data.csv
models/xgb_rank_model.json
- Trained XGBoost pairwise ranking model saved by train_model.py.
outputs/top3.csv
- Raw top 3 comparable properties for each appraisal
- Contains the model's raw predictions and scores
- Used as input for generating explanations
outputs/top3_gpt_explanations.csv
- Explains top 3 comps per appraisal using SHAP + GPT
- Generated by top3_explanations.py
- Requires OpenAI API key

🛠️ Implementation Details

Features Used

The model uses the following features, all computed as differences between subject and candidate properties:

Physical Characteristics

Gross Living Area (GLA)
- gla_diff: Difference in square footage
- abs_gla_diff: Absolute difference in square footage
Lot Size
- lot_size_sf_diff: Difference in square feet
- abs_lot_size_sf_diff: Absolute difference in square feet
Age
- effective_age_diff: Difference in effective age
- subject_age_diff: Difference in subject age
- abs_effective_age_diff: Absolute difference in effective age
- abs_subject_age_diff: Absolute difference in subject age
Rooms
- room_count_diff: Difference in total rooms
- bedrooms_diff: Difference in number of bedrooms
- abs_room_count_diff: Absolute difference in total rooms
- abs_bedrooms_diff: Absolute difference in bedrooms
Bathrooms
- bath_score_diff: Difference in bathroom score
- full_baths_diff: Difference in full bathrooms
- half_baths_diff: Difference in half bathrooms
- abs_bath_score_diff: Absolute difference in bathroom score
- abs_full_bath_diff: Absolute difference in full bathrooms
- abs_half_bath_diff: Absolute difference in half bathrooms

Location Features

distance_km: Straight-line distance between properties in kilometers
same_neighborhood: Binary flag (1/0) if properties are within 1km
same_city: Binary flag (1/0) if properties are in the same city

Property Type & Timing

same_property_type: Binary flag (1/0) if property types match
sold_recently: Binary flag (1/0) if sold within 90 days of subject's effective date

Data Processing Pipeline

data_pipeline.py (89 lines)
- Coordinates the entire pipeline
- Conditionally runs geocoding based on cache status
- Executes all processing steps in sequence
clean_initial_data.py (364 lines)
- Standardizes property data
- Handles missing values
- Normalizes property features (ages, sizes, rooms)
- Outputs cleaned data to data/cleaned/
geocode_all_addresses.py (146 lines)
- Geocodes property addresses using Nominatim
- Uses GPT for address cleaning when needed
- Caches results to avoid repeated lookups
- Outputs to data/geocoded-data/
features.py (519 lines)
- Adds engineered features
- Implements property type matching
- Creates time-based flags
- Adds location-based features
- Outputs enhanced data to data/engineered/
- Modify this file to change the features used in the model
training_data.py (194 lines)
- Converts data to ML format
- Creates positive/negative examples
- Prepares data for model training
- Outputs to data/training/
train_model.py (142 lines)
- Trains XGBoost ranking model
- Implements cross-validation
- Saves model to models/
- Modify this file to change model parameters (learning rate, depth, etc.)
top3_explanations.py (267 lines)
- Generates human-readable explanations
- Uses SHAP for feature importance
- Integrates with GPT for natural language explanations
- Outputs to outputs/
- Modify this file to change the explanation format and GPT prompts
frontend/app.py (212 lines)
- Streamlit web interface for property comparisons
- Displays model predictions and explanations
- Collects user feedback
- Triggers model retraining with feedback

🔑 Key Concepts Learned

1. Text Processing

a) Regular Expressions

Extract structured data from messy strings.

import re
# Extract year from "built in 2023"
match = re.search(r"(\d{1,4})", "built in 2023")
year  = int(match.group(1))  # → 2023

b) Tokenization

Split a cleaned string into meaningful parts.

# Separate number + unit
val    = "1200 sqft"
tokens = val.split()        # → ["1200", "sqft"]

c) Fuzzy String Matching

Map free-text to a small set of known categories.

from fuzzywuzzy import process
CANONICAL_TYPES = ["Townhouse", "Detached", ...]
match, score = process.extractOne("semi detached", CANONICAL_TYPES)
# If score ≥ 80, accept `match`; otherwise None

2. Data Splitting & Grouping

a) Train/Test Split

Hold out data for honest evaluation.

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(
    df, test_size=0.2, stratify=df["label"], random_state=42
)

b) Group Creation

Tell the ranker which rows belong to each query (appraisal).

# Count candidates per appraisal (orderID)
groups = df_train.groupby("orderID").size().to_list()

3. XGBoost Ranking

a) DMatrix

XGBoost's optimized input format (features + labels + groups).

import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)
dtrain.set_group(groups)

b) Pairwise Ranking Objective

Learn to order items by comparing pairs within each group.

params = {
    "objective":  "rank:pairwise",
    "eval_metric":"ndcg",        # Normalized Discounted Cumulative Gain
    "eta":         0.1,          # learning rate
    "max_depth":   6
}
model = xgb.train(params, dtrain, num_boost_round=100)

4. Evaluation Metric: Top-K Precision

"Of my top K suggestions, how many are correct?"

def precision_at_k(group, k):
    X      = xgb.DMatrix(group[feature_cols])
    group["score"] = model.predict(X)
    topk   = group.nlargest(k, "score")
    return topk["label"].sum() / k

# Example: average Precision@3 across all test appraisals
overall = df_test.groupby("orderID").apply(lambda g: precision_at_k(g, 3))
print("Avg Precision@3:", overall.mean())

5. SHAP Explanations

Break down each model prediction into feature-level contributions.

import shap

# Wrap the trained model
def model_predict(X_df):
    return model.predict(xgb.DMatrix(X_df))

# Use background data for reference
explainer = shap.Explainer(model_predict, df[feature_cols])

# For a single candidate row:
row_df   = candidate_row[feature_cols].to_frame().T
shap_vals = explainer(row_df).values[0]          # one SHAP value per feature
shap_items = list(zip(feature_cols, shap_vals))
# Positive contributors
pos = [(f, v) for f, v in shap_items if v > 0]
# Negative contributors
neg = [(f, v) for f, v in shap_items if v < 0]

🚀 Getting Started

Prerequisites

Python 3.8 or higher
OpenAI API Key
1. Sign up or log in at OpenAI Platform.
2. In your account settings, create a new API key.
3. In your project root, create a file named .env with the following content:
```
OPENAI_API_KEY=your_api_key_here
```

Installation

Clone the repository:

git clone https://github.com/MisbahAN/Property-Recommendation-System.git
cd Property-Recommendation-System

Create and activate a virtual environment:

# On macOS/Linux
python -m venv venv
source venv/bin/activate

# On Windows
python -m venv venv
venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Running the Pipeline

Ensure you have the raw data file in data/raw/appraisals_dataset.json
Run the complete pipeline:
```
cd scripts
python data_pipeline.py
```
Start the frontend interface:
```
cd ../frontend
streamlit run app.py
```

The pipeline will:

Clean and standardize the raw data
Geocode addresses (if needed)
Generate features
Train the model
Create explanations

The frontend will be available at http://localhost:8501

Requirements

Python 3.8+
OpenAI API key (for explanations)
Dependencies (see requirements.txt)

Author

Misbah Ahmed Nauman

Portfolio: misbahan.com
Built during Headstarter SWE Residency Sprint 1

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
frontend		frontend
models		models
outputs		outputs
scripts		scripts
.gitignore		.gitignore
PropComp-AI-Intelligent-Real-Estate-Appraisal-System.pdf		PropComp-AI-Intelligent-Real-Estate-Appraisal-System.pdf
README.md		README.md
project_structure.txt		project_structure.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PropComp AI

Project Overview

📺 Watch Demo

📊 Model Performance

Evaluation Metrics

False Positive Analysis

🎯 Project Milestones

✅ Statistical Modeling

✅ Explainability

✅ Self-Improving System

📁 Project Structure

Directory Layout

Generated Files & Their Purpose

🛠️ Implementation Details

Features Used

Physical Characteristics

Location Features

Property Type & Timing

Data Processing Pipeline

🔑 Key Concepts Learned

1. Text Processing

a) Regular Expressions

b) Tokenization

c) Fuzzy String Matching

2. Data Splitting & Grouping

a) Train/Test Split

b) Group Creation

3. XGBoost Ranking

a) DMatrix

b) Pairwise Ranking Objective

4. Evaluation Metric: Top-K Precision

5. SHAP Explanations

🚀 Getting Started

Prerequisites

Installation

Running the Pipeline

Requirements

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages