Challenges and solutions in data collection and model evaluation in supervised machine learning: a review article

authors:

avatar Saeedeh Aliakbari , * , avatar Payman Hejazi ORCID , avatar Zeinab Hormozi-Moghaddam


how to cite: Aliakbari S, Hejazi P, Hormozi-Moghaddam Z. Challenges and solutions in data collection and model evaluation in supervised machine learning: a review article. koomesh. 2023;25(6):e152854. 

Abstract

Introduction: The main purpose of machine learning is a complex process that is carried out by determining the model and training it using a large volume of data. In the past, the main focus in this field was more on improving the structures of models and algorithms, but recently more emphasis has been placed on the quality and quantity of data. This article aims to provide an overview of the problems in data collection and offer a solution for them. Materials and Methods: In this study, the challenges faced by researchers in collecting data and evaluating supervised machine-learning models were examined through a review method. Documentation from PubMed, Scopus, Science Direct databases, and Google Scholar search engine from 2001 to 2023 was retrieved. After screening, a total of 17 full articles were reviewed and included in the study. Results: The findings indicate that researchers in supervised machine learning studies face four challenges in data collection, which are: insufficient number of samples, unrepresentative training data, poor data quality, and irrelevant features, and in model evaluation, they face four challenges: overfitting, lack of generalizability, lack of sufficient data for validation, and mismatched data. Conclusion: Increasing the sample size, utilizing a random selection algorithm, data cleansing, using the correct statistical test, feature selection, feature extraction, using a simpler model, the K-fold technique, and data processing are among the factors that contribute to achieving a model with better performance.

References

  • 1.

    Flasiski M, Flasiski M. Symbolic artificial intelligence. Introduc Art Intell 2016; 15-22.##https://doi.org/10.1007/978-3-319-40022-8_2.

  • 2.

    Badillo S, Banfai B, Birzele F, Davydov II, Hutchinson L, KamThong T, et al. An introduction to machine learning. Clin Pharmacol Ther 2020; 107: 871-885. [PubMed ID: 32128792 PMCid:PMC7189875]. https://doi.org/https://doi.org/10.1002/cpt.1796.

  • 3.

    Gron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: O'Reilly Media, Inc 2022.

  • 4.

    Jain A, Patel H, Nagalapatti L, Gupta N, Mehta S, Guttula S, et al, editors. Overview and importance of data quality for machine learning tasks. Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining; 2020.##https://doi.org/10.1145/3394486.3406477.

  • 5.

    Budach L, Feuerpfeil M, Ihde N, Nathansen A, Noack N, Patzlaff H, et al. The effects of data quality on machine learning performance. ArXiv (preprint) 2022.

  • 6.

    Kariluoto A, Kultanen J, Soininen J, Prnnen A, Abrahamsson P, editors. Quality of data in machine learning. 2021 IEEE 21st international conference on software quality, reliability and security companion (QRS-C); 2021: IEEE.##https://doi.org/10.1109/QRS-C55045.2021.00040.

  • 7.

    Sarker IH. Machine learning: Algorithms, real-world applications and research directions. SN Comput Sci 2021; 2: 160. [PubMed ID: 33778771 PMCid:PMC7983091]. https://doi.org/https://doi.org/10.1007/s42979-021-00592-x.

  • 8.

    Banko M, Brill E, editors. Mitigating the paucity-of-data problem: Exploring the effect of training corpus size on classifier performance for natural language processing. Proceedings of the first international conference on Human language technology research; 2001. [PubMed ID: 22478353 PMCid:PMC2731358]. https://doi.org/https://doi.org/10.3115/1072133.1072204.

  • 9.

    Jin J, Yin F, Xu Y, Zhang J, editors. Learning a model with the most generality for small-sample problems. proceedings of the 2022 5th international conference on algorithms, computing and artificial intelligence; 2022.##https://doi.org/10.1145/3579731.3579814.

  • 10.

    Kim JY, Cho SB. An information theoretic approach to reducing algorithmic bias for machine learning. Neurocomputing 2022; 500: 26-38.##https://doi.org/10.1016/j.neucom.2021.09.081.

  • 11.

    Chen M, Cheng H, Du Y, Xu M, Jiang W, Wang C, editors. Two wrongs don't make a right: Combating confirmation bias in learning with label noise. Proceedings of the AAAI Conference on Artificial Intelligence; 2023.##https://doi.org/10.1609/aaai.v37i12.26725.

  • 12.

    Akhatov A, Ulugmurodov SA. Training data selection and labeling for machine learning braille recognition models. Int J Contem Sci Tech Res 2023; 15-21.

  • 13.

    Whang SE, Roh Y, Song H, Lee JG. Data collection and quality challenges in deep learning: A data-centric ai perspective. VLDB J 2023; 32: 791-813.##https://doi.org/10.1007/s00778-022-00775-9.

  • 14.

    Angloher G, Banik S, Bartolot D, Benato G, Bento A, Bertolini A, Breier R, Bucci C, Burkhart J, Canonica L. Towards an automated data cleaning with deep learning in CRESST. Eur Phys J Plus 2023; 138: 1-11. [PubMed ID: 36741916 PMCid:PMC9886615]. https://doi.org/https://doi.org/10.1140/epjp/s13360-023-03674-2.

  • 15.

    Chu X, Ilyas IF, Krishnan S, Wang J, editors. Data cleaning: Overview and emerging challenges. Proceedings of the 2016 international conference on management of data; 2016.##https://doi.org/10.1145/2882903.2912574.

  • 16.

    Singh A, Thakur N, Sharma A, editors. A review of supervised machine learning algorithms. 2016 3rd international conference on computing for sustainable global development (INDIACom); 2016: IEEE.

  • 17.

    Eminaga O, Abbas M, Shen J, Laurie M, Brooks JD, Liao JC, Rubin DL. PlexusNet: A neural network architectural concept for medical image classification. Comput Biol Med 2023; 154: 106594. [PubMed ID: 36753979]. https://doi.org/https://doi.org/10.1016/j.compbiomed.2023.106594.

  • 18.

    Gupta A, Chaithra N, Jha J, Sayal A, Gupta V, Memoria M, editors. Machine learning algorithms for disease diagnosis using medical records: a comparative analysis. 2023 4th International Conference on Intelligent Engineering and Management (ICIEM); 2023: IEEE.##https://doi.org/10.1109/ICIEM59379.2023.10165850.

  • 19.

    Kaur P, Singh RK. A review on optimization techniques for medical image analysis. Concur Comput Pract Exp 2023; 35: e7443.##https://doi.org/10.1002/cpe.7443.

  • 20.

    Shanbehzadeh M, Valinejadi A, Afrah R, Kazemi AH, Orooji A, Kaffashian MR. Comparison of machine-learning algorithms efficiency to build a predictive model for mortality risk in COVID-19 hospitalized patients. Koomesh 2021; 24: 128-138. (Persian).

  • 21.

    Tanhapour M KL, Maghooli K, Rostam Niakan Kalhori S. Determining the progression stages of liver fibrosis in patients with chronic hepatitis B. Koomesh 2022; 24: 639-647. (Persian).

  • 22.

    Ying X, editor. An overview of overfitting and its solutions. Journal of physics: Conference series; 2019: IOP Publishing.##https://doi.org/10.1088/1742-6596/1168/2/022022.

  • 23.

    Nazha A, Elemento O, McWeeney SK, Miles M, Haferlach T. How I read an article that uses machine learning methods. Blood Adv 2023; 2023010140. [PubMed ID: 37276509 PMCid:PMC10425665]. https://doi.org/https://doi.org/10.1182/bloodadvances.2023010140.

  • 24.

    Jabbar H, Khan RZ. Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Comput Sci Commun Instru Devic 2015; 70: 978-981.##https://doi.org/10.3850/978-981-09-5247-1_017.

  • 25.

    Swathi P. Analysis on solutions for over-fitting and under-fitting in machine learning algorithms. Int J Innov Res Sci Eng Technol 2018; 7: 10.15680.

  • 26.

    Uar MK, Nour M, Sindi H, Polat K. The effect of training and testing process on machine learning in biomedical datasets. Mathem Prob Engin 2020; 2020.##https://doi.org/10.1155/2020/2836236.

  • 27.

    Avulu E, Elen A. Evaluation of train and test performance of machine learning algorithms and Parkinson diagnosis with statistical measurements. Med Biol Eng Comput 2020; 58: 2775-2788. [PubMed ID: 32920727]. https://doi.org/https://doi.org/10.1007/s11517-020-02260-3.

  • 28.

    Anguita D, Ghelardoni L, Ghio A, Oneto L, Ridella S, editors. The 'K' in K-fold Cross Validation. ESANN 2012.

  • 29.

    Wong TT, Yeh PY. Reliable accuracy estimates from k-fold cross validation. IEEE Trans Knowledge Data Eng 2019; 32: 1586-1594.##https://doi.org/10.1109/TKDE.2019.2912815.

  • 30.

    Fushiki T. Estimation of prediction error by using K-fold cross-validation. Stat Comput 2011; 21: 137-146.##https://doi.org/10.1007/s11222-009-9153-8.

  • 31.

    Lewis GA, Bellomo S, Ozkaya I, editors. Characterizing and detecting mismatch in machine-learning-enabled systems. 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering for AI (WAIN); 2021: IEEE.##https://doi.org/10.1109/WAIN52551.2021.00028.

  • 32.

    Althnian A, AlSaeed D, Al-Baity H, Samha A, Dris AB, Alzakari N, et al. Impact of dataset size on classification performance: an empirical evaluation in the medical domain. Appl Sci 2021; 11: 796.##https://doi.org/10.3390/app11020796.

  • 33.

    Kavzoglu T. Increasing the accuracy of neural network classification using refined training data. Environ Model Software 2009; 24: 850-858.##https://doi.org/10.1016/j.envsoft.2008.11.012.

  • 34.

    Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: A new perspective. Neurocomputing 2018; 300: 70-79.##https://doi.org/10.1016/j.neucom.2017.11.077.

  • 35.

    Guyon I, Elisseeff A. An introduction to feature extraction. Feature extraction: foundations and applications: Springer; 2006. p. 1-25.##https://doi.org/10.1007/978-3-540-35488-8_1.

  • 36.

    Veeramachaneni S, Olivetti E, Avesani P, editors. Active sampling for detecting irrelevant features. Proceedings of the 23rd international conference on machine learning; 2006.##https://doi.org/10.1145/1143844.1143965.