Predicting housing sale prices in Germany by application of machine learning models and methods of data exploration

Main Article Content

Chong Dae Kim
Nils Bedorf

Abstract

The prediction of real estate prices is a popular problem in the field of machine learning and often demonstrated in literature. In contrast to other approaches, which regularly focus on the US market, this paper investigates the biggest, German real estate dataset, with more than 1.5 million unique samples and more than 20 features. In this paper we implement and compare different machine learning models in respect to performance and interpretability to give insight in the most important properties, which contribute to the sale price. Our experiments suggest that the prediction of sale prices in a realworld scenario is achievable yet limited by the quality of data rather than quantity. The results show promising prediction scores but are also heavily dependent on the location, which leaves room for further evaluation.

Downloads

Download data is not yet available.

Article Details

How to Cite
Kim, C. D., & Bedorf, N. (2024). Predicting housing sale prices in Germany by application of machine learning models and methods of data exploration. Kwartalnik Nauk O Przedsiębiorstwie, 71(1), 107–122. https://doi.org/10.33119/KNoP.2024.71.1.7
Section
Articles

References

Bedorf N. [2021], XAI–Modellagnostische Verfahren zur Erklärbarkeit von Machine Learning Algorithmen, mimeo.

Bundesamt für Justiz [2019], Federal Data Protection Act, https://www.gesetze-im-internet.de/englisch_bdsg/englisch_bdsg.html (accessed: 9.10.2021).

Chen T., Guestrin C. [2016], XGBoost: A Scalable Tree Boosting System, DOI: 10.1145/2939672.2939785.

Choi S. [2009], Performance evaluation of RANSAC family, British Machine Vision Conference, BMVC, London, UK, September 7–10, DOI: 10.5244/C.23.81.

Daniel S. [2021], Difference between RMSE and RMSLE, https://www.datascienceland.com/blog/difference-between-rmse-and-rmsle-656/ (accessed: 4.10.2021).

Empirica ag [2019], General Information, https://www.empirica-institut.de/thema/regionaldatenbank/datenbank-regionaldaten/ (accessed: 9.10.2021).

Fawcett A. [2021], Data Science in 5 Minutes: What is One Hot Encoding? https://www.educative.io/blog/one-hot-encoding (accessed: 22.09.2021).

Federal Statistical Office Germany [2021], Public datasets, https://www-genesis.destatis.de/genesis/online (accessed: 18.09.2021).

Géron A. [2019], Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition – O’Reilly, ISBN: 9781492032649.

Gonzalez S., Garcia S., Del Ser J., Rokach L., Herrera F. [2020], A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities, DOI: https://doi.org/10.1016/j.inffus.2020.07.007.

He L., Zhang H. [2018], Kernel K-Means Sampling for Nyström Approximation, DOI:10.1109/TIP.2018.2796860.

Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T. [2017], LightGBM: A Highly

Efficient Gradient Boosting Decision Tree, NIPS.

Laerd Statistics [2021], Pearson correlation, https://statistics.laerd.com/statistical-guides/pearsoncorrelation-coefficient-statistical-guide.php (accessed: 12.09.2021).

Li L. [2018], CMU, Massively Parallel Hyperparameter Optimization, https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/ (accessed: 2.10.2021).

Magiya J. [2019], Kendal Rank Correlation Explained, https://towardsdatascience.com/kendallrank-correlation-explained-dee01d99c535 (accessed: 20.09.2021).

OpenStreetMap contributors [2017], Planet dump, https://www.planet.osm.org (accessed: 9.10.2021).

Pargent F. [2021], Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features, arXiv:2104.00629 [stat.ML].

Park B., Bae J. K. [2015], Using machine learning algorithms for housing price prediction: The case of Fairfax County, Virginia housing data, DOI: https://doi.org/10.1016/j.eswa.2014.11.040.

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E. [2011], Scikit-learn: Machine learning in Python, “Journal of Machine Learning Research”, no. 12 (Oct), pp. 2825–2830.

Pow N., Janulewicz E., Liu L. [2014], Applied Machine Learning Project 4 Prediction of real estate property prices in Montreal.

RUser4512 [2018], Computational complexity of machine learning algorithms, https://www.thekerneltrip.com/machine/learning/computational-complexity-learning-algorithms/ (accessed: 4.09.2021).

Seger C. [2018], KTH, EECS, An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing, OAI:DiVA.org:kth-237426.

Shashanka M. [2019], What is a Pipeline in Machine Learning? How to create one? https://medium.com/analytics-vidhya/what-is-a-pipeline-in-machine-learning-how-to-create-one-bda91d-0ceaca (accessed: 14.09.2021).

The Economist [2017], On almost every indicator, Germany’s south is doing better than its north, https://www.economist.com/kaffeeklatsch/2017/08/20/on-almost-every-indicator-germanyssouth-is-doing-better-than-its-north (accessed: 25.09.2021).

Viktorovich P. A., Aleksandrovich P. V., Leopoldovich K. I., Vasilevna P. I. [2018], Predicting Sales Prices of the Houses Using Regression Methods of Machine Learning, DOI: 10.1109/RPC.2018.8482191.

Wu J. Y. [2017], Housing Price prediction Using Support Vector Regression, DOI: https://doi.org/10.31979/etd.vpub-6bgs.

Yang T., Li Y., Mahdavi M., Jin R., Zhou Z. [2012], Nystroem Method vs Random Fourier Features: A Theoretical and Empirical Comparison, Advances in Neural Information Processing Systems.

Yeo I. K., Johnson R. A. [2000], A new family of power transformations to improve normality or symmetry, “Biometrika”, vol. 87 (4), pp. 954–959.

Zhang C., Ma Y. [2012], Ensemble Machine Learning, Methods and Applications, Springer, ISBN: 978-1-4419-9326-7.