1. Background
2. Objectives
3. Methods
3.1. Data Description
3.2. Machine Learning Methods
| ID | Methods Name | Description | Parameters | Advantages | Disadvantages |
|---|---|---|---|---|---|
| 1. | K-NN | K-NN is used to classify or predict an instance by using the class labels of its nearest neighbors. | The number of neighbors (K) is an important parameter. | Simple and effective. | Computationally expensive for large datasets. |
| 2. | LR | Logistic regression is a linear model used for classification problems. | It can be controlled by parameters like regularization term. | Simple, interpretable, and fast. | Limited in handling complexity. |
| 3. | FOREST | Random Forest classifies by combining many decision trees. | Important parameters include the number of trees, feature selection, etc. | Strong, high generalization, and resistant to overfitting. | Complex internal structure. |
| 4. | SVM | SVM tries to find the best separating hyperplane between two classes. | Important parameters include kernel type, C (error tolerance), etc. | Effective in high-dimensional data, especially successful with limited datasets. | Long training time for large datasets. |
| 5. | TREE | Decision tree is used for classification or regression tasks using a tree structure. | Important parameters include tree depth, minimum sample split, etc. | Easy to understand structures, low data preprocessing requirement. | Prone to overfitting. |
| 6. | LDA | LDA finds axes that best express the difference between classes. | Few default parameters. | Emphasizes differences between classes, provides dimensionality reduction. | Assumes equal covariances between classes by default. |
| 7. | GNB | Probability-based classification algorithm based on Bayes' theorem, assuming independence between features | Few default parameters. | Simple, fast, often successful in tasks like text classification. | Independence assumption may not hold in the real world. |
| 8. | EXTRA | Similar to Random Forest, but selects split points in trees more randomly. | Important parameters include the number of trees, feature selection, etc. | Resistant to overfitting, low variance due to random feature selection. | Complex internal structure. |
| 9. | GRADIENT | A community learning algorithm that combines weak learners (often decision trees) to create a strong model. | Important parameters include learning rate, number of trees, etc. | High generalization ability, successful in many datasets. | May require more training time and tuning. |
| 10. | ADA | Combines weak classifiers to create a strong classifier by focusing on misclassified examples. | Important parameters include the type of weak learner, learning rate, etc. | Resistant to overfitting, high generalization ability. | Sensitive to tuning. |
| 11. | XGB | Tree-based learning algorithm using gradient boosting technique, known for its speed and performance. | Important parameters include learning rate, number of trees, etc. | Fast, high-performance, successful in many data science competitions. | May require more tuning and hyperparameter selection. |
| 12. | BGC | Bagging is a method of improving a model's performance by training on different subsamples. | Important parameters include the type of base learner, sampling strategy, etc. | Resistant to overfitting, low variance | Often depends on the type of base learner. |
| 13. | MLP | A type of artificial neural network with multiple layers that updates weights during the learning process. | Important parameters include the number of layers, number of hidden neurons, etc. | Ability to learn complex relationships, suitable for large datasets. | Long training time for large datasets, tendency for overfitting. |
Abbreviations: K-NN, K-nearest neighbors; LR, logistic regression; FOREST, random forest; SVM, support vector machine; TREE, decision tree; LDA, linear discriminant analysis; GNB, gaussian naive bayes; EXTRA, extra tree classifier; GRADIENT, gradient boosting classifier; ADA, AdaBoosting classifier; XGB, XGBoost classifier; BGC, bagging classifier; MLP, multilayer perceptron.
3.3. Performance Metrics
3.3.1. Confusion Matrix
3.3.2. Accuracy
3.3.3. Precision
3.3.4. Recall (Sensitivity)
3.3.5. F1 Score
3.3.6. Confidence Interval
4. Results
| Metrics (%) | K-NN | LR | RF | SVM | DTREE | LDA | GNB | EXTRA | GRA | ADA | XGB | BGC | MLP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.703 | 0.730 | 0.730 | 0.946 | 0.694 | 0.721 | 0.667 | 0.892 | 0.730 | 0.658 | 0.802 | 0.685 | 0.577 |
| Confusionmetrics | [27 - 11] | [18 - 20] | [13 - 25] | [32 - 6] | [33 - 5] | [17 - 21] | [7 - 31] | [34 - 4] | [20 - 18] | [5 - 33] | [34 - 4] | [10 - 28] | [32 - 6] |
| [22 - 51] | [10 - 63] | [5 - 68] | [0 - 73] | [29 - 44] | [10 - 63] | [6 - 67] | [8 - 65] | [12 - 61] | [5 - 68] | [18 - 55] | [7 - 66] | [41 - 32] | |
| Precision | 0.730 | 0.719 | 0.728 | 0.950 | 0.773 | 0.709 | 0.634 | 0.897 | 0.722 | 0.614 | 0.837 | 0.663 | 0.704 |
| Recall | 0.703 | 0.730 | 0.730 | 0.946 | 0.694 | 0.721 | 0.667 | 0.892 | 0.730 | 0.658 | 0.802 | 0.685 | 0.577 |
| F1 score | 0.709 | 0.718 | 0.698 | 0.945 | 0.700 | 0.707 | 0.609 | 0.893 | 0.723 | 0.585 | 0.807 | 0.644 | 0.577 |
| Confidence-up | 0.753 | 0.797 | 0.805 | 1.000 | 0.733 | 0.789 | 0.746 | 0.948 | 0.794 | 0.740 | 0.850 | 0.761 | 0.607 |
| Confidence-down | 0.652 | 0.662 | 0.654 | 0.882 | 0.654 | 0.652 | 0.587 | 0.836 | 0.666 | 0.576 | 0.754 | 0.608 | 0.546 |
Abbreviations: K-NN, K-nearest neighbors; LR, logistic regression; SVM, support vector machine; LDA, linear discriminant analysis; GNB, gaussian naive bayes; EXTRA, extra tree classifier; GRA, gradient; ADA, AdaBoosting classifier; XGB, XGBoost classifier; BGC, bagging classifier; MLP, multilayer perceptron.






