Predicting housing sale prices in Germany by application of machine learning models and methods of data exploration
Main Article Content
Abstract
The prediction of real estate prices is a popular problem in the field of machine learning and often demonstrated in literature. In contrast to other approaches, which regularly focus on the US market, this paper investigates the biggest, German real estate dataset, with more than 1.5 million unique samples and more than 20 features. In this paper we implement and compare different machine learning models in respect to performance and interpretability to give insight in the most important properties, which contribute to the sale price. Our experiments suggest that the prediction of sale prices in a realworld scenario is achievable yet limited by the quality of data rather than quantity. The results show promising prediction scores but are also heavily dependent on the location, which leaves room for further evaluation.
Downloads
Article Details
This work is licensed under a Creative Commons Attribution 4.0 International License.
The author of the article declares that the submitted article does not infringe the copyrights of third parties. The author agrees to subject the article to the review procedure and to make editorial changes. The author transfers, free of charge, to SGH Publishing House the author's economic rights to the work in the fields of exploitation listed in the Article 50 of the Act of 4 February 1994 on Copyright and Related Rights – provided that the work has been accepted for publication and published.
SGH Publishing House holds economic copyrights to all content of the journal. Placing the text of the article in a repository, on the author's home page or on any other page is allowed as long as it does not involve obtaining economic benefits, and the text will be provided with source information (including the title, year, number and internet address of the journal).
References
Bedorf N. [2021], XAI–Modellagnostische Verfahren zur Erklärbarkeit von Machine Learning Algorithmen, mimeo.
Bundesamt für Justiz [2019], Federal Data Protection Act, https://www.gesetze-im-internet.de/englisch_bdsg/englisch_bdsg.html (accessed: 9.10.2021).
Chen T., Guestrin C. [2016], XGBoost: A Scalable Tree Boosting System, DOI: 10.1145/2939672.2939785.
Choi S. [2009], Performance evaluation of RANSAC family, British Machine Vision Conference, BMVC, London, UK, September 7–10, DOI: 10.5244/C.23.81.
Daniel S. [2021], Difference between RMSE and RMSLE, https://www.datascienceland.com/blog/difference-between-rmse-and-rmsle-656/ (accessed: 4.10.2021).
Empirica ag [2019], General Information, https://www.empirica-institut.de/thema/regionaldatenbank/datenbank-regionaldaten/ (accessed: 9.10.2021).
Fawcett A. [2021], Data Science in 5 Minutes: What is One Hot Encoding? https://www.educative.io/blog/one-hot-encoding (accessed: 22.09.2021).
Federal Statistical Office Germany [2021], Public datasets, https://www-genesis.destatis.de/genesis/online (accessed: 18.09.2021).
Géron A. [2019], Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition – O’Reilly, ISBN: 9781492032649.
Gonzalez S., Garcia S., Del Ser J., Rokach L., Herrera F. [2020], A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities, DOI: https://doi.org/10.1016/j.inffus.2020.07.007.
He L., Zhang H. [2018], Kernel K-Means Sampling for Nyström Approximation, DOI:10.1109/TIP.2018.2796860.
Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T. [2017], LightGBM: A Highly
Efficient Gradient Boosting Decision Tree, NIPS.
Laerd Statistics [2021], Pearson correlation, https://statistics.laerd.com/statistical-guides/pearsoncorrelation-coefficient-statistical-guide.php (accessed: 12.09.2021).
Li L. [2018], CMU, Massively Parallel Hyperparameter Optimization, https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/ (accessed: 2.10.2021).
Magiya J. [2019], Kendal Rank Correlation Explained, https://towardsdatascience.com/kendallrank-correlation-explained-dee01d99c535 (accessed: 20.09.2021).
OpenStreetMap contributors [2017], Planet dump, https://www.planet.osm.org (accessed: 9.10.2021).
Pargent F. [2021], Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features, arXiv:2104.00629 [stat.ML].
Park B., Bae J. K. [2015], Using machine learning algorithms for housing price prediction: The case of Fairfax County, Virginia housing data, DOI: https://doi.org/10.1016/j.eswa.2014.11.040.
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E. [2011], Scikit-learn: Machine learning in Python, “Journal of Machine Learning Research”, no. 12 (Oct), pp. 2825–2830.
Pow N., Janulewicz E., Liu L. [2014], Applied Machine Learning Project 4 Prediction of real estate property prices in Montreal.
RUser4512 [2018], Computational complexity of machine learning algorithms, https://www.thekerneltrip.com/machine/learning/computational-complexity-learning-algorithms/ (accessed: 4.09.2021).
Seger C. [2018], KTH, EECS, An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing, OAI:DiVA.org:kth-237426.
Shashanka M. [2019], What is a Pipeline in Machine Learning? How to create one? https://medium.com/analytics-vidhya/what-is-a-pipeline-in-machine-learning-how-to-create-one-bda91d-0ceaca (accessed: 14.09.2021).
The Economist [2017], On almost every indicator, Germany’s south is doing better than its north, https://www.economist.com/kaffeeklatsch/2017/08/20/on-almost-every-indicator-germanyssouth-is-doing-better-than-its-north (accessed: 25.09.2021).
Viktorovich P. A., Aleksandrovich P. V., Leopoldovich K. I., Vasilevna P. I. [2018], Predicting Sales Prices of the Houses Using Regression Methods of Machine Learning, DOI: 10.1109/RPC.2018.8482191.
Wu J. Y. [2017], Housing Price prediction Using Support Vector Regression, DOI: https://doi.org/10.31979/etd.vpub-6bgs.
Yang T., Li Y., Mahdavi M., Jin R., Zhou Z. [2012], Nystroem Method vs Random Fourier Features: A Theoretical and Empirical Comparison, Advances in Neural Information Processing Systems.
Yeo I. K., Johnson R. A. [2000], A new family of power transformations to improve normality or symmetry, “Biometrika”, vol. 87 (4), pp. 954–959.
Zhang C., Ma Y. [2012], Ensemble Machine Learning, Methods and Applications, Springer, ISBN: 978-1-4419-9326-7.