A machine learning "Property Recommendation" system that identifies and explains the best comparable properties for real estate appraisals. Built during the first sprint of Headstarter SWE Residency.
This system processes property appraisal data through several stages:
- Data cleaning and standardization
- Address geocoding and location-based features
- Feature engineering
- Training a ranking model
- Generating human-readable explanations for recommendations
- Interactive feedback collection and model retraining
The system uses an XGBoost ranking model with pairwise ranking objective to identify comparable properties. Here's the performance on held-out test data:
- Metric: Top-3 Precision (per appraisal)
- Total Top-3 Predictions: 264
- Correctly Identified Comps: 254
- False Positives: 10
- Top-3 Precision: 96.21%
The model incorrectly identified these properties as top-3 comparables:
- 177 Bramco Lane
- 242 Coville Circle NE
- 240 Nelson Street
- 871 Crestwood Ave
- 239 Kinniburgh Loop
- 16 James St
- 36 Hidden Spring Place NW
- 6555 Third Line
- 161 Everoak Circle SW
This high precision indicates that when the model suggests a property as comparable, it is very likely to be a good match for the subject property.
This project successfully fulfills all three core requirements set by Automax.ai:
- XGBoost ranking model with pairwise learning
- Engineered statistical features (GLA, lot size, bath score differences)
- Distance-based features and supervised learning
- SHAP values for feature importance
- GPT-3.5 integration for natural language explanations
- Human-readable justifications for each recommendation
- Interactive Streamlit UI for user feedback
- Feedback storage and processing pipeline
- Automatic model retraining with updated data
- Real-time explanation regeneration
📁 property-recommendation-system/
│
├── 📁 data/
│ ├── raw/ # Original appraisal data
│ ├── cleaned/ # Standardized data
│ ├── engineered/ # Feature-enhanced data
│ ├── training/ # ML-ready data
│ ├── geocoded-data/ # Location coordinates
│ └── README.md # Data documentation
│
├── 📁 frontend/ # Streamlit web interface
│ ├── app.py # Main application
│ └── feedback/ # User feedback storage
│
├── 📁 models/ # Trained ML models
├── 📁 scripts/ # Python processing scripts
├── 📁 outputs/ # Generated explanations
│
├── .gitignore
├── requirements.txt
└── README.md
-
data/raw/appraisals_dataset.json- Raw input JSON of appraisals with subject, comps, and candidate properties.
-
data/cleaned/cleaned_appraisals_dataset.json- Output of clean_initial_data.py: all fields (age, GLA, lot, rooms, baths) standardized to numeric values.
-
data/geocoded-data/geocoded_addresses.json- Maps property addresses to latitude/longitude coordinates
- Used for distance-based feature engineering
- Generated by geocode_all_addresses.py
-
data/geocoded-data/missing_addresses.txt- List of addresses that failed to geocode
- Used for tracking and manual review
-
data/engineered/feature_engineered_appraisals_dataset.json- Output of features.py: adds engineered fields (diffs, binary flags, location features) to each record.
-
data/training/training_data.csv- Output of training_data.py: a flat table where each row is one candidate property, labeled is_comp (1 = true comp, 0 = not), with all signed and absolute feature diffs plus flags.
-
data/training/training_data_with_feedback.csv- Enhanced training data incorporating user feedback
- Used for model retraining after user feedback
- Maintains same structure as training_data.csv
-
models/xgb_rank_model.json- Trained XGBoost pairwise ranking model saved by train_model.py.
-
outputs/top3.csv- Raw top 3 comparable properties for each appraisal
- Contains the model's raw predictions and scores
- Used as input for generating explanations
-
outputs/top3_gpt_explanations.csv- Explains top 3 comps per appraisal using SHAP + GPT
- Generated by top3_explanations.py
- Requires OpenAI API key
The model uses the following features, all computed as differences between subject and candidate properties:
-
Gross Living Area (GLA)
gla_diff: Difference in square footageabs_gla_diff: Absolute difference in square footage
-
Lot Size
lot_size_sf_diff: Difference in square feetabs_lot_size_sf_diff: Absolute difference in square feet
-
Age
effective_age_diff: Difference in effective agesubject_age_diff: Difference in subject ageabs_effective_age_diff: Absolute difference in effective ageabs_subject_age_diff: Absolute difference in subject age
-
Rooms
room_count_diff: Difference in total roomsbedrooms_diff: Difference in number of bedroomsabs_room_count_diff: Absolute difference in total roomsabs_bedrooms_diff: Absolute difference in bedrooms
-
Bathrooms
bath_score_diff: Difference in bathroom scorefull_baths_diff: Difference in full bathroomshalf_baths_diff: Difference in half bathroomsabs_bath_score_diff: Absolute difference in bathroom scoreabs_full_bath_diff: Absolute difference in full bathroomsabs_half_bath_diff: Absolute difference in half bathrooms
distance_km: Straight-line distance between properties in kilometerssame_neighborhood: Binary flag (1/0) if properties are within 1kmsame_city: Binary flag (1/0) if properties are in the same city
same_property_type: Binary flag (1/0) if property types matchsold_recently: Binary flag (1/0) if sold within 90 days of subject's effective date
-
data_pipeline.py(89 lines)- Coordinates the entire pipeline
- Conditionally runs geocoding based on cache status
- Executes all processing steps in sequence
-
clean_initial_data.py(364 lines)- Standardizes property data
- Handles missing values
- Normalizes property features (ages, sizes, rooms)
- Outputs cleaned data to
data/cleaned/
-
geocode_all_addresses.py(146 lines)- Geocodes property addresses using Nominatim
- Uses GPT for address cleaning when needed
- Caches results to avoid repeated lookups
- Outputs to
data/geocoded-data/
-
features.py(519 lines)- Adds engineered features
- Implements property type matching
- Creates time-based flags
- Adds location-based features
- Outputs enhanced data to
data/engineered/ - Modify this file to change the features used in the model
-
training_data.py(194 lines)- Converts data to ML format
- Creates positive/negative examples
- Prepares data for model training
- Outputs to
data/training/
-
train_model.py(142 lines)- Trains XGBoost ranking model
- Implements cross-validation
- Saves model to
models/ - Modify this file to change model parameters (learning rate, depth, etc.)
-
top3_explanations.py(267 lines)- Generates human-readable explanations
- Uses SHAP for feature importance
- Integrates with GPT for natural language explanations
- Outputs to
outputs/ - Modify this file to change the explanation format and GPT prompts
-
frontend/app.py(212 lines)- Streamlit web interface for property comparisons
- Displays model predictions and explanations
- Collects user feedback
- Triggers model retraining with feedback
Extract structured data from messy strings.
import re
# Extract year from "built in 2023"
match = re.search(r"(\d{1,4})", "built in 2023")
year = int(match.group(1)) # → 2023Split a cleaned string into meaningful parts.
# Separate number + unit
val = "1200 sqft"
tokens = val.split() # → ["1200", "sqft"]Map free-text to a small set of known categories.
from fuzzywuzzy import process
CANONICAL_TYPES = ["Townhouse", "Detached", ...]
match, score = process.extractOne("semi detached", CANONICAL_TYPES)
# If score ≥ 80, accept `match`; otherwise NoneHold out data for honest evaluation.
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(
df, test_size=0.2, stratify=df["label"], random_state=42
)Tell the ranker which rows belong to each query (appraisal).
# Count candidates per appraisal (orderID)
groups = df_train.groupby("orderID").size().to_list()XGBoost's optimized input format (features + labels + groups).
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
dtrain.set_group(groups)Learn to order items by comparing pairs within each group.
params = {
"objective": "rank:pairwise",
"eval_metric":"ndcg", # Normalized Discounted Cumulative Gain
"eta": 0.1, # learning rate
"max_depth": 6
}
model = xgb.train(params, dtrain, num_boost_round=100)"Of my top K suggestions, how many are correct?"
def precision_at_k(group, k):
X = xgb.DMatrix(group[feature_cols])
group["score"] = model.predict(X)
topk = group.nlargest(k, "score")
return topk["label"].sum() / k
# Example: average Precision@3 across all test appraisals
overall = df_test.groupby("orderID").apply(lambda g: precision_at_k(g, 3))
print("Avg Precision@3:", overall.mean())Break down each model prediction into feature-level contributions.
import shap
# Wrap the trained model
def model_predict(X_df):
return model.predict(xgb.DMatrix(X_df))
# Use background data for reference
explainer = shap.Explainer(model_predict, df[feature_cols])
# For a single candidate row:
row_df = candidate_row[feature_cols].to_frame().T
shap_vals = explainer(row_df).values[0] # one SHAP value per feature
shap_items = list(zip(feature_cols, shap_vals))
# Positive contributors
pos = [(f, v) for f, v in shap_items if v > 0]
# Negative contributors
neg = [(f, v) for f, v in shap_items if v < 0]- Python 3.8 or higher
- OpenAI API Key
- Sign up or log in at OpenAI Platform.
- In your account settings, create a new API key.
- In your project root, create a file named
.envwith the following content:OPENAI_API_KEY=your_api_key_here
-
Clone the repository:
git clone https://github.com/MisbahAN/Property-Recommendation-System.git cd Property-Recommendation-System -
Create and activate a virtual environment:
# On macOS/Linux python -m venv venv source venv/bin/activate # On Windows python -m venv venv venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Ensure you have the raw data file in
data/raw/appraisals_dataset.json -
Run the complete pipeline:
cd scripts python data_pipeline.py -
Start the frontend interface:
cd ../frontend streamlit run app.py
The pipeline will:
- Clean and standardize the raw data
- Geocode addresses (if needed)
- Generate features
- Train the model
- Create explanations
The frontend will be available at http://localhost:8501
- Python 3.8+
- OpenAI API key (for explanations)
- Dependencies (see requirements.txt)
Misbah Ahmed Nauman
- Portfolio: misbahan.com
- Built during Headstarter SWE Residency Sprint 1