House Price Prediction Using Machine Learning - A Review
Document Details
Uploaded by AmicableOrange7579
Lovely Professional University
Ayush Singh Bhati, Faleshwar Singh, Gojja Niveditha, Harshit Sharma, Tarun Dubey, Vasamsetti Vamsi Pradeep, Sufanpreet Kaur
Tags
Summary
This research paper explores the application of machine learning models to predict house prices, utilizing the KC House Dataset. The study examines three predictive models: Linear Regression, Gradient Boosting, and LightGBM. A key aspect of the study is the use of SHAP analysis to understand the feature importance in predicting house prices.
Full Transcript
**Enhancing House Price Prediction through Stacking ensemble models and SHAP analysis** **Authors:** Ayush Singh Bhati¹ Faleshwar Singh² ¹B. Tech Scholar, Computer Science ²B.Tech Scholar, Computer Science and Engineering, Lovely Professional University, Engineering, Lovely Professional Universit...
**Enhancing House Price Prediction through Stacking ensemble models and SHAP analysis** **Authors:** Ayush Singh Bhati¹ Faleshwar Singh² ¹B. Tech Scholar, Computer Science ²B.Tech Scholar, Computer Science and Engineering, Lovely Professional University, Engineering, Lovely Professional University Jalandhar, India Jalandhar, India, Gojja Niveditha³ Harshit Sharma⁴ ³B.Tech Scholar, Computer Science, and ⁴B.Tech Scholar, Computer Science and Engineering, Lovely Professional University, Engineering, Lovely Professional University, Jalandhar, India Jalandhar, India Tarun Dubey⁵ Vasamsetti Vamsi Pradeep⁶ ⁵B.Tech Scholar, Computer Science, and ⁶B.Tech Scholar, Computer Science and Engineering, Lovely Professional University, Engineering, Lovely Professional University Jalandhar, India Jalandhar, India Sufanpreet Kaur Assistant Professor, Computer Science and Engineering, Lovely Professional University,Jalandhar,India **Abstract:** The application of machine learning models for predicting house prices is explored in this study, utilizing the **KC House Dataset**, which contains a variety of housing attributes such as square footage, number of bedrooms, location, and house condition. As the real estate market grows increasingly dynamic, there is a need for accurate predictive models to assist stakeholders such as investors, buyers, and real estate professionals. We explore three predictive models: Linear Regression, Gradient Boosting, and LightGBM. The data preprocessing pipeline included handling missing values, outlier removal, feature engineering, and encoding categorical features. We evaluated model performance using metrics such as Mean Absolute Error (MAE) and the R² score. Additionally, we performed SHAP (SHapley Additive exPlanations) analysis to interpret model predictions, highlighting the importance of features such as square footage, location, and house condition on pricing. Results indicate that the LightGBM model achieves superior accuracy and explainability. Our findings provide insights for deploying robust, interpretable models in real estate applications, assisting with informed decision-making for housing valuation.**Keywords:** Machine Learning, House Price Prediction, Gradient Boosting, LightGBM, Model Evaluation, SHAP Analysis, Real Estate Pricing **1. Introduction:** The real estate market is complex and influenced by numerous dynamic factors, making accurate house price prediction an invaluable tool for stakeholders like investors, buyers, and real estate professionals. Traditional methods of valuation, which often rely on subjective appraisals, can lack consistency and scalability. Machine learning (ML) models, however, offer data-driven solutions that can identify subtle patterns within large datasets, providing more reliable predictions \[1\]. Although models like **Linear Regression** are commonly used in real estate price prediction due to their simplicity and interpretability, their performance tends to suffer when the relationships within the data are complex and non-linear. To address these limitations, **ensemble methods** such as **Gradient Boosting** and **Random Forest** combine multiple decision trees to improve prediction accuracy by capturing intricate patterns in the data. These models have been shown to outperform simpler methods in complex datasets like those in real estate \[2\]\[3\]. One model that stands out for its performance and efficiency is **LightGBM**, a gradient boosting framework optimized for speed and memory efficiency. LightGBM has demonstrated superior results, particularly in settings with large datasets containing many features. It is especially useful in real estate applications, where the data can be vast and varied, thus requiring a model that can efficiently handle such complexity \[6\]\[7\]. **LightGBM\'s** ability to scale quickly while maintaining high accuracy makes it an ideal choice for predicting house prices, where features like neighborhood quality, proximity to schools, and property condition interact in complicated ways. In the context of real estate price prediction, **model interpretability** is also critical. Tools like **SHAP (SHapley Additive exPlanations)** allow stakeholders to understand how each feature contributes to the model\'s prediction. This transparency is essential in real estate, as buyers and sellers need to see how factors like location, square footage, and property condition influence the predicted price \[5\]. **SHAP** ensures that users can trust the predictions, offering a deeper understanding of the data\'s impact on the model. The evaluation of model performance in this study is based on two primary metrics: **Mean Absolute Error (MAE)** and **R²**. **MAE** measures the average magnitude of the errors between predicted and actual house prices, without considering direction (i.e., whether the prediction is too high or too low). A lower MAE indicates better model performance. **R²**, or the coefficient of determination, indicates how well the model explains the variance in the data. A higher R² score means the model can explain a larger portion of the price variation in the dataset, signaling better overall accuracy \[19\]\[21\]. These metrics are essential in determining the real-world applicability of the models, especially in scenarios with diverse property features. This paper focuses on evaluating the performance of three ML models---**Linear Regression**, **Gradient Boosting**, and **LightGBM**---in predicting house prices. We also assess the interpretability of these models using **SHAP analysis** to gain insights into which features most influence the predictions. The results aim to provide valuable insights for deploying more accurate and interpretable models in real estate applications, helping stakeholders make better-informed decisions. By improving model accuracy and transparency, this study contributes to the broader goal of making real estate price prediction more reliable and actionable. **2. Related Work:** Previous studies in real estate pricing have shown success with various ML techniques. Linear Regression has historically been a popular choice due to its interpretability and simplicity \[1\]\[9\]. However, as real estate data complexity increases, ensemble methods like Gradient Boosting and tree-based approaches such as Random Forest and LightGBM have gained popularity for their ability to model non-linear relationships \[3\]\[6\]\[10\]\[17\]. Recent research has emphasized the importance of model interpretability. Tools like SHAP (SHapley Additive exPlanations) have become valuable in understanding how individual features influence predictions \[7\]\[14\]\[21\]. Such transparency is increasingly vital, as it allows non-technical stakeholders to trust and understand model outputs, which can enhance adoption in industries like real estate \[5\]\[12\]\[18\]. Our study builds on these foundations by comparing traditional and advanced models on the same dataset and performing SHAP analysis to highlight critical features affecting house prices. This approach aligns with recent work that combines model performance with interpretability to provide actionable insights \[8\]\[13\]\[20\]. **3. Methodology** **3.1 Data Collection and Preprocessing:** [3.1.1 Dataset Description: ] The dataset used is \`kc\_house\_data.csv\`, a well-known dataset in the housing prediction domain. It contains over 20 features, including quantitative attributes such as the number of bedrooms, square footage, and qualitative attributes like condition and grade. Each feature holds potential predictive power, though preprocessing is required to ensure data quality and model compatibility. [3.1.2 Data Cleaning and Handling Missing Values: ] Data cleaning involved handling missing values and ensuring consistency in data types. Initially, columns with all missing values were removed, and remaining missing values were imputed using forward filling. This approach preserves the dataset\'s integrity without introducing biases. [3.1.3 Outlier Detection and Removal: ] Outliers were detected using the Interquartile Range (IQR) method. This method allows for robust identification of extreme values in each numeric feature, reducing their influence on model training and enhancing generalization. Outlier removal was particularly effective in improving the stability and reliability of our predictions. **3.2 Feature Engineering and Encoding** After preprocessing, non-numeric columns were removed to simplify modeling. Categorical features were converted to numeric formats using one-hot encoding, which is particularly effective for machine learning algorithms. Additionally, the date column was dropped after preliminary analysis revealed that it did not significantly impact prediction accuracy. These transformations ensured compatibility with predictive models while retaining valuable information. **3.3 Model Selection and Implementation** [3.3.1 Model Selection Rationale] - Linear Regression: Selected for its simplicity and interpretability, serving as a benchmark model. - Gradient Boosting: An ensemble-based technique that combines decision trees iteratively to correct prediction errors, offering high accuracy in many regression tasks. - LightGBM: Chosen for its efficiency and ability to handle large datasets with complex interactions. This gradient boosting variant is optimized for speed and accuracy. Fig. 1 Comparing Prediction Accuracy Across Regression Models [3.3.2 Model Training and Evaluation] - - **3.4 Explainability with SHAP Analysis** To enhance interpretability, SHAP (SHapley Additive exPlanations) analysis was applied to the LightGBM model. SHAP values quantified the influence of individual features on predictions, revealing the importance of attributes like square footage, property location, and house condition. The analysis clarified the positive or negative impacts of these features on predicted prices, making the model\'s decisions more transparent and actionable for stakeholders. ![](media/image2.png) Fig. 2 SHAP Values Visualization for Feature Impact on Model Output Figure 2 , the horizontal bar plot illustrates the SHAP (SHapley Additive exPlanations) values for various features in a predictive model. Each feature is represented along the y-axis, with corresponding SHAP values on the x-axis indicating their impact on the model\'s predictions. Colors range from blue (low feature value) to pink (high feature value), highlighting how different attributes, such as living area size and property grade, contribute to the predicted outcomes, with more significant impacts seen on one side of the graph. **4. Results and Discussion** **- A. Model Performance Comparison:** Model performance was assessed through MAE and R² scores. Results showed that LightGBM achieved the best performance, with the highest R² score and lowest MAE, followed by Gradient Boosting and Linear Regression. LightGBM's success is attributed to its ability to handle complex patterns and interactions in the data, while Linear Regression, although interpretable, struggled to capture non-linear relationships inherent in real estate data. A bar chart visualization illustrated model performance metrics, helping to highlight LightGBM's relative superiority. This quantitative comparison confirms LightGBM as the most suitable model for this problem. Fig. 3 Comparison of Model Performance Metrics Figure 3, this bar chart compares the performance of three regression models---Linear Regression, Gradient Boosting, and LightGBM---based on two metrics: Mean Absolute Error (MAE) and R² score. The MAE, represented by the blue bars, indicates the average absolute errors made by each model, with higher values suggesting poorer performance, particularly for Linear Regression. The chart highlights the differences in predictive accuracy among the models, showcasing the strengths and weaknesses of each approach in a clear and concise manner. **- B. Prediction Visualization:** Scatter plots of true vs. predicted prices for each model provided further insight into each model\'s predictive quality. LightGBM's scatter plot demonstrated predictions that closely aligned with actual values, indicating high model accuracy. Linear Regression's scatter plot, however, showed greater dispersion, highlighting its limitations in modelling complex dependencies. **- C. SHAP Analysis for Feature Importance:** SHAP analysis on LightGBM revealed the most influential features in price determination. The summary plot identified square footage, location, and house condition as key drivers of price. SHAP values further helped clarify the impact direction for each feature, where higher square footage, for instance, positively correlated with higher prices. This interpretability is essential in real estate, where understanding the \"why\" behind model decisions is often as valuable as the prediction itself. **5. Conclusion and Future Work** **Conclusion:** This study thoroughly evaluated three machine learning models---Linear Regression, Gradient Boosting, and LightGBM---for their effectiveness in predicting house prices using a comprehensive real estate dataset. Among these, **LightGBM demonstrated the best overall performance**, achieving superior predictive accuracy and computational efficiency. Metrics such as **Mean Absolute Error (MAE)** and **R²** confirmed its ability to model complex, non-linear relationships within the data more effectively than Linear Regression and Gradient Boosting. The inclusion of SHAP analysis provided additional value by offering insights into feature importance, thereby improving the interpretability of the LightGBM model. For instance, the SHAP summary plot revealed that features like square footage, property location, and the number of bathrooms significantly influenced the predicted prices. This interpretability bridges the gap between advanced machine learning models and their practical applicability, making them more accessible and trustworthy for stakeholders such as real estate professionals and investors. Additionally, data preprocessing steps, including handling missing values, removing outliers using the interquartile range (IQR) method, and encoding categorical features with one-hot encoding, ensured that the dataset was clean and prepared for effective modeling. These steps were critical in enabling LightGBM to leverage its strengths, such as speed and robustness, for optimal results. By incorporating multiple evaluation metrics and interpretability tools, this study not only validates LightGBM\'s predictive power but also highlights its practical advantages for real-world applications in dynamic markets like real estate. The ability to balance high accuracy with computational efficiency makes it a reliable choice for professionals looking to deploy data-driven strategies for property valuation. **Future Work:** Future studies can explore the integration of additional advanced models, such as CatBoost or XGBoost, to compare their performance against LightGBM further. Another avenue for exploration is the inclusion of external factors like economic indicators, zoning laws, and market trends to create even more robust prediction frameworks. Moreover, combining LightGBM with deep learning approaches could potentially enhance its predictive capabilities for datasets with more intricate feature interactions. Finally, while this study focused on interpretability through SHAP, future work can investigate other tools, such as LIME or global surrogate models, to provide diverse perspectives on model transparency and feature influence. These efforts will continue to refine the balance between prediction accuracy and interpretability, further empowering stakeholders in the real estate sector. **References:** \[1\] Li, D., Hu, Y., & Zhu, D. *House Price Prediction Using Machine Learning: A Comparative Analysis of Different Models*. IEEE Transactions on Emerging Topics in Computational Intelligence, 4(2), 240-251, 2020. \[2\] Bogin, A., Doerner, W., & Larson, W. *Machine Learning Techniques for Real Estate Property Valuation: A Comparative Study*. Regional Science and Urban Economics, 76, 84-99, 2019. \[3\] Zhang, Y., & Zhang, S. *Gradient Boosting Machine for House Price Prediction: A Case Study in China*. Procedia Computer Science, 162, 250-255, 2019. \[4\] Chien, T., & Chiu, C. *Enhancing Real Estate Price Prediction through Ensemble Learning Methods*. Expert Systems with Applications, 172, 114625, 2021. \[5\] Ribeiro, M., Singh, S., & Guestrin, C. *Why Should I Trust You? Explaining the Predictions of Any Classifier*. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. \[6\] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., & Liu, T.-Y. *LightGBM: A Highly Efficient Gradient Boosting Decision Tree*. Advances in Neural Information Processing Systems, 30, 3146-3154, 2017. \[7\] Lundberg, S. M., & Lee, S.-I. *A Unified Approach to Interpreting Model Predictions*. Advances in Neural Information Processing Systems, 30, 4765-4774, 2017. \[8\] Ribeiro, T., & Da Silva, E. *Interpretable Machine Learning for Real Estate Valuation Using SHAP and LIME*. Journal of Real Estate Research and Review, 45(1), 10-23, 2021. \[9\] Kumar, V., & Anuradha, D. *Prediction of Real Estate Prices Using Machine Learning Algorithms*. International Journal of Advanced Computer Science and Applications, 10(5), 524-530, 2019. \[10\] Chen, T., & Guestrin, C. *XGBoost: A Scalable Tree Boosting System*. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. \[11\] Lundberg, S., Erion, G., & Lee, S.-I. *Consistent Individualized Feature Attribution for Tree Ensembles*. arXiv preprint arXiv:1802.03888, 2018. \[12\] Ribeiro, M. T., Singh, S., & Guestrin, C. *LIME: Local Interpretable Model-Agnostic Explanations for Machine Learning Models*. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. \[13\] Lundberg, S. M., & Lee, S.-I. *Explainable AI: Interpreting, Explaining, and Visualizing Deep Learning Models*. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. \[14\] Chen, X., Zhou, Y., & Luo, Z. *SHAP Analysis of Factors Influencing Real Estate Prices*. Applied Artificial Intelligence, 34(7), 523-532, 2020. \[15\] Silva, J. D., & Wu, J. *Comparative Analysis of House Price Prediction Using Various Machine Learning Techniques*. Journal of Property Research, 35(4), 428-452, 2019. \[16\] Wang, S., & Zhou, Y. *Random Forest vs. Gradient Boosting: Comparative Performance for Real Estate Prediction*. Journal of Machine Learning Research, 45(2), 313-328, 2020. \[17\] Li, Y., Feng, W., & Cheng, X. *Application of Machine Learning in Real Estate Pricing: A Case Study Using LightGBM*. Expert Systems, 38(2), 1128-1139, 2021. \[18\] Chen, J., & Chang, C. *A Review of Machine Learning Models for Real Estate Valuation*. Artificial Intelligence Review, 2021. \[19\] Ding, C., & Sohn, B. *House Price Forecasting Using Ensemble Models: A LightGBM Approach*. Procedia Computer Science, 135, 350-358, 2020. \[20\] Gao, F., & Sun, Y. *Model Interpretability in Machine Learning Applications for Real Estate*. Journal of Applied Artificial Intelligence, 34(3), 201-216, 2019. \[21\] Chen, R., & Zhu, Z. *Real Estate Price Prediction Using Machine Learning Models: LightGBM, XGBoost, and SHAP Analysis*. Journal of Data Science and Machine Learning, 17(1), 72-85, 2020.