Challenges and solutions in data collection and model evaluation in supervised machine learning: a review article

How to Cite:Aliakbari S, Hejazi P, Hormozi-Moghaddam Z. Challenges and solutions in data collection and model evaluation in supervised machine learning: a review article.koomesh.2023;25(6):e152854.

Abstract

Introduction: The main purpose of machine learning is a complex process that is carried out by determining the model and training it using a large volume of data. In the past, the main focus in this field was more on improving the structures of models and algorithms, but recently more emphasis has been placed on the quality and quantity of data. This article aims to provide an overview of the problems in data collection and offer a solution for them. Materials and Methods: In this study, the challenges faced by researchers in collecting data and evaluating supervised machine-learning models were examined through a review method. Documentation from PubMed, Scopus, Science Direct databases, and Google Scholar search engine from 2001 to 2023 was retrieved. After screening, a total of 17 full articles were reviewed and included in the study. Results: The findings indicate that researchers in supervised machine learning studies face four challenges in data collection, which are: insufficient number of samples, unrepresentative training data, poor data quality, and irrelevant features, and in model evaluation, they face four challenges: overfitting, lack of generalizability, lack of sufficient data for validation, and mismatched data. Conclusion: Increasing the sample size, utilizing a random selection algorithm, data cleansing, using the correct statistical test, feature selection, feature extraction, using a simpler model, the K-fold technique, and data processing are among the factors that contribute to achieving a model with better performance.

Keywords

Supervised Machine Learning

Data Collection

Model Evaluation

یادگیری ماشین نظارت شده

جمع‌آوری داده

ارزیابی مدل

References

1.
Flasiski M, Flasiski M. Symbolic artificial intelligence. Introduc Art Intell 2016; 15-22.##https://doi.org/10.1007/978-3-319-40022-8_2.
2.
Badillo S, Banfai B, Birzele F, Davydov II, Hutchinson L, KamThong T, et al. An introduction to machine learning. Clin Pharmacol Ther 2020; 107: 871-885.
3.
Gron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: O'Reilly Media, Inc 2022.
4.
Jain A, Patel H, Nagalapatti L, Gupta N, Mehta S, Guttula S, et al, editors. Overview and importance of data quality for machine learning tasks. Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining; 2020.##https://doi.org/10.1145/3394486.3406477.
5.
Budach L, Feuerpfeil M, Ihde N, Nathansen A, Noack N, Patzlaff H, et al. The effects of data quality on machine learning performance. ArXiv (preprint) 2022.
6.
Kariluoto A, Kultanen J, Soininen J, Prnnen A, Abrahamsson P, editors. Quality of data in machine learning. 2021 IEEE 21st international conference on software quality, reliability and security companion (QRS-C); 2021: IEEE.##https://doi.org/10.1109/QRS-C55045.2021.00040.
7.
Sarker IH. Machine learning: Algorithms, real-world applications and research directions. SN Comput Sci 2021; 2: 160.
8.
Banko M, Brill E, editors. Mitigating the paucity-of-data problem: Exploring the effect of training corpus size on classifier performance for natural language processing. Proceedings of the first international conference on Human language technology research; 2001.
9.
Jin J, Yin F, Xu Y, Zhang J, editors. Learning a model with the most generality for small-sample problems. proceedings of the 2022 5th international conference on algorithms, computing and artificial intelligence; 2022.##https://doi.org/10.1145/3579731.3579814.
10.
Kim JY, Cho SB. An information theoretic approach to reducing algorithmic bias for machine learning. Neurocomputing 2022; 500: 26-38.##https://doi.org/10.1016/j.neucom.2021.09.081.
11.
Chen M, Cheng H, Du Y, Xu M, Jiang W, Wang C, editors. Two wrongs don't make a right: Combating confirmation bias in learning with label noise. Proceedings of the AAAI Conference on Artificial Intelligence; 2023.##https://doi.org/10.1609/aaai.v37i12.26725.
12.
Akhatov A, Ulugmurodov SA. Training data selection and labeling for machine learning braille recognition models. Int J Contem Sci Tech Res 2023; 15-21.
13.
Whang SE, Roh Y, Song H, Lee JG. Data collection and quality challenges in deep learning: A data-centric ai perspective. VLDB J 2023; 32: 791-813.##https://doi.org/10.1007/s00778-022-00775-9.
14.
Angloher G, Banik S, Bartolot D, Benato G, Bento A, Bertolini A, Breier R, Bucci C, Burkhart J, Canonica L. Towards an automated data cleaning with deep learning in CRESST. Eur Phys J Plus 2023; 138: 1-11.
15.
Chu X, Ilyas IF, Krishnan S, Wang J, editors. Data cleaning: Overview and emerging challenges. Proceedings of the 2016 international conference on management of data; 2016.##https://doi.org/10.1145/2882903.2912574.
16.
Singh A, Thakur N, Sharma A, editors. A review of supervised machine learning algorithms. 2016 3rd international conference on computing for sustainable global development (INDIACom); 2016: IEEE.
17.
Eminaga O, Abbas M, Shen J, Laurie M, Brooks JD, Liao JC, Rubin DL. PlexusNet: A neural network architectural concept for medical image classification. Comput Biol Med 2023; 154: 106594.
18.
Gupta A, Chaithra N, Jha J, Sayal A, Gupta V, Memoria M, editors. Machine learning algorithms for disease diagnosis using medical records: a comparative analysis. 2023 4th International Conference on Intelligent Engineering and Management (ICIEM); 2023: IEEE.##https://doi.org/10.1109/ICIEM59379.2023.10165850.
19.
Kaur P, Singh RK. A review on optimization techniques for medical image analysis. Concur Comput Pract Exp 2023; 35: e7443.##https://doi.org/10.1002/cpe.7443.
20.
Shanbehzadeh M, Valinejadi A, Afrah R, Kazemi AH, Orooji A, Kaffashian MR. Comparison of machine-learning algorithms efficiency to build a predictive model for mortality risk in COVID-19 hospitalized patients. Koomesh 2021; 24: 128-138. (Persian).
21.
Tanhapour M KL, Maghooli K, Rostam Niakan Kalhori S. Determining the progression stages of liver fibrosis in patients with chronic hepatitis B. Koomesh 2022; 24: 639-647. (Persian).
22.
Ying X, editor. An overview of overfitting and its solutions. Journal of physics: Conference series; 2019: IOP Publishing.##https://doi.org/10.1088/1742-6596/1168/2/022022.
23.
Nazha A, Elemento O, McWeeney SK, Miles M, Haferlach T. How I read an article that uses machine learning methods. Blood Adv 2023; 2023010140.
24.
Jabbar H, Khan RZ. Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Comput Sci Commun Instru Devic 2015; 70: 978-981.##https://doi.org/10.3850/978-981-09-5247-1_017.
25.
Swathi P. Analysis on solutions for over-fitting and under-fitting in machine learning algorithms. Int J Innov Res Sci Eng Technol 2018; 7: 10.15680.
26.
Uar MK, Nour M, Sindi H, Polat K. The effect of training and testing process on machine learning in biomedical datasets. Mathem Prob Engin 2020; 2020.##https://doi.org/10.1155/2020/2836236.
27.
Avulu E, Elen A. Evaluation of train and test performance of machine learning algorithms and Parkinson diagnosis with statistical measurements. Med Biol Eng Comput 2020; 58: 2775-2788.
28.
Anguita D, Ghelardoni L, Ghio A, Oneto L, Ridella S, editors. The 'K' in K-fold Cross Validation. ESANN 2012.
29.
Wong TT, Yeh PY. Reliable accuracy estimates from k-fold cross validation. IEEE Trans Knowledge Data Eng 2019; 32: 1586-1594.##https://doi.org/10.1109/TKDE.2019.2912815.
30.
Fushiki T. Estimation of prediction error by using K-fold cross-validation. Stat Comput 2011; 21: 137-146.##https://doi.org/10.1007/s11222-009-9153-8.
31.
Lewis GA, Bellomo S, Ozkaya I, editors. Characterizing and detecting mismatch in machine-learning-enabled systems. 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering for AI (WAIN); 2021: IEEE.##https://doi.org/10.1109/WAIN52551.2021.00028.
32.
Althnian A, AlSaeed D, Al-Baity H, Samha A, Dris AB, Alzakari N, et al. Impact of dataset size on classification performance: an empirical evaluation in the medical domain. Appl Sci 2021; 11: 796.##https://doi.org/10.3390/app11020796.
33.
Kavzoglu T. Increasing the accuracy of neural network classification using refined training data. Environ Model Software 2009; 24: 850-858.##https://doi.org/10.1016/j.envsoft.2008.11.012.
34.
Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: A new perspective. Neurocomputing 2018; 300: 70-79.##https://doi.org/10.1016/j.neucom.2017.11.077.
35.
Guyon I, Elisseeff A. An introduction to feature extraction. Feature extraction: foundations and applications: Springer; 2006. p. 1-25.##https://doi.org/10.1007/978-3-540-35488-8_1.
36.
Veeramachaneni S, Olivetti E, Avesani P, editors. Active sampling for detecting irrelevant features. Proceedings of the 23rd international conference on machine learning; 2006.##https://doi.org/10.1145/1143844.1143965.

Import into EndNote Import into BibTex

Share on

Comments

Number of Comments:0

Metrics

Purchasing Reprints

Copyright Clearance Center (CCC) handles bulk orders for article reprints for Brieflands. To place an order for reprints, please click here ( https://www.copyright.com/landing/reprintsinquiryform/ ). Clicking this link will bring you to a CCC request form where you can provide the details of your order. Once complete, please click the ‘Submit Request’ button and CCC’s Reprints Services team will generate a quote for your review.

Search Relations

Author(s):

Saeedeh Aliakbari:[PubMed][Scholar]
Payman Hejazi:[PubMed][Scholar]
Zeinab Hormozi-Moghaddam:[PubMed][Scholar]