Classification into Categories - by bedrooms

Impact

Real estate platforms can enhance user experience by integrating this model to better understand their users preferences and constraints. Furthermore, governments can increase access to affordable houses by studying the demand and supply of specific areas. This way helping people have homes which in turn helps the government in many ways.

I've built a classification model to classify data into one of the set categories, which is the number of bedrooms. Given other features like price, bathrooms, acre lot size, house size, city, and state. Essentially, the model can predict what number of bedrooms will you be able to get given these criteria aka features.

Visualization and Pre-processing

Firstly, I removed all the rows with null values from our dataset. Then I encoded the city and state columns as they were strings. Now I checked the correlation between average price and number of bedrooms. However, we found a lot of outliers as can be seen below.

After further zooming into our data by only choosing to see data from 0 to 15 bedrooms, we get the following. From here it is clear that the data starts to go off trend after 7 bedrooms thus we remove all data after that.

Next, we check the correlation between bedrooms and bathrooms without filtering, we get a scatter plot as seen below. However, we have too much noisy data.

After zooming in to only 0 to 7 bedrooms, we get the average number of bathrooms for each "number of bedrooms". And we get 5, I thought 5 would be a good checkpoint and to remove everything after, but the line plot we will see next to say otherwise.

Here we make a line plot of bathrooms and their average prices, this way we can truly see whether the noisy data is after 5 or some other checkpoint. And as expected there will be a lot of outliers and we must zoom in further to see in detail.

After zooming in to 0 to 15 bathrooms, we get the following. After noticing the trend here, we can easily say that the real outliers are the one that break the trend aka above 11. These rows are removed from our dataset.

Lastly, I removed all the duplicate rows that would play a hand in giving wrong predictions. Moreover, I shuffled the dataset to eliminate bias.

Training and Testing Classification Models

Starting with the Gaussian Naive Bayes, after the pre-processing steps I trained and tested this model and it got an accuracy score of 47.8%. This model is known for its categorization abilities. It is always better to check other models to ensure we get the best result.

So I checked the Random Forest Classifier, and after training and testing, I got an accuracy of 64.3%. Much better than the GaussianNB.

Just to make sure I'm not missing out on a better opportunity I tested the Decision Tree Classifier too. But, it got an accuracy score of 53.6% which is, although higher than the GaussianNB, still not better than the Random Forest.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
vis_class		vis_class
README.md		README.md
class_predict.ipynb		class_predict.ipynb
real estate dataset		real estate dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classification into Categories - by bedrooms

Impact

Visualization and Pre-processing

Training and Testing Classification Models

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Classification into Categories - by bedrooms

Impact

Visualization and Pre-processing

Training and Testing Classification Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages