The Cross Industry Standard Process for Data Mining (CRISP-DM) is a process model designed for data mining that can be applied across various industries. This model encompasses six sequential phases, executed iteratively from understanding the business requirements to the final deployment and implementation of the data mining solution (
12).
To conduct our study, we gathered the electronic health records, medical data, and demographic information of 1,000 patients who were admitted to the ED of a hospital in Tehran. The data were retrospectively collected using the Hospital Information System unit during a one-month timeframe.
We initially removed patients with missing data from the study during the data preparation phase. Additionally, we employed the Interquartile Ranges (IQRs) to detect and eliminate outliers. As a result, 200 patients were excluded from the complete dataset. The IQR is a measure of statistical dispersion that quantifies the spread of a data set. It is defined as the difference between the third quartile (Q3) and the first quartile (Q1) in a data set (
13). We employed a label encoder for the target column to represent binary categories, where class 0 signifies discharged, and class 1 signifies expired. After considering the research by Newaz et al. (
14), which explored the model's accuracy with over-sampling and under-sampling, we concluded that over-sampling would be the most suitable approach for balancing the classes in the target column.
Feature selection is a crucial step in analyzing data as it involves selecting a concise group of pertinent features. The RF classifier serves as a critical foundation for wrapper algorithms, effectively addressing all significant issues by offering a measure of variable importance (
15). To prevent overfitting, RF feature selection was employed. Expert judgment was utilized to eliminate features with an importance score below 0.0095. Subsequently, the models were created using the remaining features. In the modeling phase, the decision was made to use ensemble models due to their relatively good accuracy.
Ensemble models combine multiple models that work together to make predictions. These models can be of the same type or different types, and by leveraging the strengths of each individual model, ensemble models can often outperform any single model. Ensemble models have become popular in various domains, including machine learning and data science because they can improve the overall performance and robustness of a prediction system. They reduce bias and variance, increase model generalization, and mitigate the risk of overfitting. By aggregating the predictions from multiple base models, ensemble models can capture a wider range of patterns and improve the accuracy of predictions (
16). The commonly used ensemble techniques are bagging, boosting, and stacking (
17):
- Bagging involves training multiple decision trees on various subsets of the same dataset and then averaging their predictions.
- Boosting, on the other hand, works by sequentially adding ensemble members that improve upon the predictions of prior models, ultimately resulting in a weighted average of all predictions.
- Stacking involves training multiple models of different types on the same data and utilizing another model to learn the most effective way to combine these predictions.
The RF algorithm is a widely known supervised ML technique used in both classification and regression problems. This algorithm leverages a collection of decision trees, each trained on different subsets of the dataset, and combines their predictions through averaging to enhance the overall predictive accuracy. This approach, known as bagging, has contributed to the algorithm's popularity. Notably, empirical studies have shown that the Random Forest (RF) classifier outperforms individual classifiers regarding classification rates. Furthermore, it demonstrates shorter training time than Decision Tree and SVM algorithms (
18).
Cat Boost (CB) is a GB framework developed by Yandex, a Russian search engine company. It is specifically designed to work with categorical features in the dataset and provides superior performance compared to other traditional gradient-boosting models. Cat Boost can automatically handle categorical features without requiring explicit feature engineering or encoding, making it a convenient choice for working with datasets containing categorical variables. It uses a novel algorithm called "Ordered Boosting" that reduces the impact of the order of categorical features on model performance (
19).
Some key features of CB include (
20):
- Handling of categorical features
- Improved accuracy
- Fast training time
- Robustness to outliers
Hence, we employed RF and CB models in this study to predict mortality and assess their relative efficacy.
In the evaluation phase, accuracy, precision, recall, and the F1-score are essential criteria for evaluating classification problems. These metrics are calculated as follows (
21):
A true positive ) occurs when both the actual and predicted classes of data points are labeled as 1. Conversely, a true negative () occurs when both the actual and predicted classes of data points are labeled as 0. On the other hand, a false positive () happens when the actual class of the data point is 0, but the predicted class is 1. Finally, a false negative () refers to the scenario where the true class of the data point is 1, but the predicted class is 0.
K-fold cross-validation is a popular technique used in ML to evaluate the performance of a model on a limited dataset. It helps estimate how well the trained model performs on unseen data. In K-fold cross-validation, the dataset is divided into k equal-sized subsets or folds. The model is then trained on k-1 folds and tested on the remaining fold. This process is repeated k times, each time using a different fold as the test set and the remaining folds as the training set. The model's performance is averaged over all k iterations to obtain a more reliable estimate (
22). To ensure a more precise assessment, we employed 5-fold cross-validation.
The receiver operating characteristic (ROC) curve visually depicts how well a binary classifier system performs as its threshold for decision-making is adjusted. It is commonly used in data mining and ML to assess the classifier's performance. The area beneath this curve serves as a measure to evaluate the classifier, and a higher area indicates a better-performing model (
23).