🏠 Melbourne Housing Price Predictor
1. Project Overview
Goal: To gather data about Melbourne housing market, determine the best model for representing it and finally making it all work as accurately as possible.
Data Source
https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market
2. Data Engineering & Cleaning
To stabilize the model, several critical data fixes were performed:
Handling Missing Values
- **Dropped rows where
PriceorRegionnamewere null as it was not feasible to fill them with median data, the price column was missing about 22% of data. - Imputation: Used
SimpleImputer(strategy="median")for numerical features likeBuildingAreaandCarto ensure all data is present for the model.
Outlier Management
The raw data contained luxury "mansions" up to $11M and very cheap places that skewed the model's logic. To reduce that, I found 99th percetile and removed everything above that, I also set a low threshold of $250K to remove the very cheap options.
- Result: Reduced the "Honest Error" (CV) significantly by narrowing the scope to the general market. It also significantly improved the data visualization as the $11M masions were making everything else look the same in the eyes of the colourmap, hence it was very difficult to read anything but the very expensive housing patterns from the plot.
Feature Engineering
- House Age Calculation: Derived
AgefromYearBuilt(2026 - YearBuilt) in hopes it would improve the prediction accuarcy, which it did, marginally. - Location Density: Replaced broad
Regionnamewith specificSuburbdata to capture more localized prices. Suburbs category is more constrained than Regionname, thus my resoning was that it will be more accurate for pricing prediction.
3. The ML Pipeline
The preprocessing was automated using a Scikit-Learn ColumnTransformer to ensure consistency between training and testing.
| Transformer | Columns | Description |
|---|---|---|
| Numerical | Rooms, Distance, Landsize, Age... | Median Imputation + Standard Scaling |
| Categorical | Suburb, Type | One-Hot Encoding |
4. Model Configuration
Algorithm: RandomForestRegressor
To prevent the model from "memorizing" specific houses, the following constraints were applied:
* n_estimators=100
* max_depth=15 (Preventing overly complex trees)
* min_samples_leaf=5 (Ensuring rules are based on groups, not single houses)
5. Results & Evaluation
Error Metrics
- Training RMSE: ~$166,719.77
- Cross-Validation Mean: ~$208,265 (The "Honest" Error)
Success "Performance Insight" By cleaning the data and constraining the model, reduced the Cross-Validation error from $300k+ to $220k, creating a more stable and reliable predictor.
Top Price Drivers
- Distance: (31%) - Proximity to Melbourne CBD is the #1 predictor.
- Type of dwelling (house vs apartment): (18%) - Distinguishes between Houses, Units, and Townhouses.
- Post Code: (17%) - Distinguishes between Suburbs really.
6. Future Improvements
- School Zones: Incorporate school catchment data.
- Renovation Status: Add a feature for "Newly Renovated" vs "Original Condition."
- Hyperparameter Tuning: Use
GridSearchCVto find the optimal balance of depth and leaf size.
6. Conclusion (Until improvements are made)
While the model is currently fairly stable, it still has a huge margin of error by real world standard. There is no way anybody can use this model to buy a property when the property might be ~$200k AUD more expensive than assumed.