Predicting Factors Affecting Lymph Node Involvement in Breast Cancer Using Random Forest Approaches

Objectives: The objective of this study was to utilize random forest methodology to develop a practical diagnostic function for predicting lymph node metastasis in patients diagnosed with breast cancer. Methods: The research data of this retrospective cohort study was obtained through a comprehensive analysis of telephone interviews and medical records of 241 patients with breast cancer referred to the hospitals affiliated with Mazandaran University of Medical Sciences between 2016 and 2022. The data analysis method used in this study was random forest analysis to identify the influential factors associated with lymph node metastasis using R software. Results: The mean age of diagnosis for patients was 52.03 ± 10.932. Based on the random forest analysis outcomes, an accuracy rate of 72.2% has been attained. The influential factors in our study included grade, tubule formation, skin involvement, p53 marker, margin involvement, nuclear pleomorphism, Ki67, tumor location, estrogen receptor (ER), and (progesterone receptor) PR markers. These factors were determined to have a significant impact based on the mean accuracy reduction index. Furthermore, the variables that demonstrated significance based on the mean Gini reduction index included age, grade, tubule formation, tumor size, nuclear pleomorphism, disease level, mitosis, skin involvement, tumor location, and margin involvement. Conclusions: The utilization of the random forest algorithm, which demonstrates a favorable level of discriminative capability, may serve as a suitable approach for predicting metastasis in patients with breast cancer. Furthermore, by identifying these factors, experts can employ effective strategies to mitigate the condition.

The medical term for the abnormal proliferation of cells within the human body is known as cancer.Every year, a significant number of women succumb to breast cancer.Cancer cells undergo unregulated cellular division and proliferation, forming an anomalous mass called a tumor.Tumors can be classified as either malignant or benign (1).Metastasis, the dissemination of cancer cells to distant tissues, is a significant concern across various types of cancer.Metastasis is the process by which primary tumor cells disseminate and form secondary tumors, subsequently developing additional tumors in different tissues (2).Cancer cells that are present in the breast can invade the lymphatic vessels and initiate growth within the lymph nodes (3).As a result, lymph node evaluation is essential, as the condition of axillary lymph nodes has the greatest Int J Cancer Manag.2024; 17(1): e140283.influence on cancer recurrence and survival (4).Furthermore, axillary lymph node metastasis is a critical determinant in treatment decision-making and prognosis (5).The literature indicates that patients with breast cancer who have lymph node metastasis have a 40% lower 5-year overall survival rate compared to those without lymph node metastasis.Consequently, precise lymph node status evaluation is critical for the prognosis and treatment of patients with breast cancer (6,7).This article focused solely on breast cancer, which is widely recognized as the most prevalent health concern affecting women globally.Breast cancer is a leading cause of mortality among women in developed and underdeveloped nations.Consequently, the timely identification and diagnosis of this disease are of utmost importance and should be prioritized (8).According to the estimates provided by the World Health Organization (WHO), breast cancer accounted for approximately one in six global fatalities.In 2018, there were approximately two million new breast cancer cases, making it the most prevalent cancer among women and the second most prevalent cancer globally, following lung cancer (9).
Breast cancer constitutes over 24% of all cancer cases in Iran, with a prevalence rate ranging from 24.8 to 34 per 100,000 women.In 2018, the mortality rate for breast cancer in Iran was less than 10.2 per 100,000 women.In Iran, invasive ductal carcinoma is the prevailing form of breast cancer (10).The standard treatment for breast cancer typically involves a multimodal approach consisting of surgical intervention, radiation therapy, and pharmacological interventions such as hormonal therapy, chemotherapy, and targeted biological therapy (11).In recent years, a significant focus has been on statistical models used to classify medical data based on various diseases and their associated outcomes (12).When employed with data mining techniques, machine learning algorithms have demonstrated the ability to yield significant advancements in medical research, particularly in predicting and diagnosing breast cancer at an early stage (13,14).Traditional regression techniques often necessitate the fulfillment of specific conditions prior to conducting a comprehensive regression analysis.In recent decades, there has been a notable rise in the adoption of alternative methodologies, such as decision trees and random forests, in medical research.This trend can be attributed to their ability to overcome the limitations commonly associated with classical statistical models and the challenges posed by result interpretation complexity (15).
The random forest algorithm is a commonly used machine learning technique.The random forest algorithm utilizes a multitude of decision trees.It can be stated that a collection of decision trees constitutes a random forest (16).The random forest algorithm offers a potential solution to the overfitting issue commonly encountered in decision trees, resulting in improved accuracy (17).The random forest algorithm performs sampling from the dataset with replacement, where the sample size is equal to the initial volume of the data.A portion of the data is absent from the algorithm, typically accounting for approximately one-third of the total dataset.This subset of data serves as a means to evaluate the algorithm's performance.Randomization is also performed for the variables.Each time this process is executed, a decision tree is generated.A random forest can be created by iteratively performing the decision tree generation process multiple times, such as 400 (18).
The random forest algorithm can enhance the previous method by employing N bootstrap sampling from the dataset.In simpler terms, this algorithm utilizes sampling with replacement to create a sample the same size as the original dataset.As a result, approximately one-third of the total dataset is not included in the algorithm.This particular subset of data is utilized to evaluate and validate the algorithm.A decision tree is generated in each iteration of this process.Repeating the process mentioned above multiple times, specifically 400 times, creates a random forest.In the process of constructing a decision tree, a random sample of m variables is chosen.When splitting the tree, only one of these m variables is utilized rather than all variables.This selection of m variables is performed for each split of the tree.By employing this approach, it becomes feasible to prevent the formation of decision trees where the higher levels consistently incorporate a particular variable solely due to the dominance of that variable, thereby leading to improved outcomes (18,19).
In the random forest algorithm, randomization is applied to variables and observations, enhancing its robustness against noise and overfitting.The random forest algorithm is employed to develop a reduced complexity model while maintaining effective Int J Cancer Manag.2024; 17(1): e140283.3 diagnostic performance for detecting lymph node metastasis in patients with breast cancer (20).

Methods
The present cohort study (21)  A checklist validated by thoracic surgeons and breast cancer oncologists was used to collect data.Furthermore, the validity of this checklist has been examined by relevant specialists, and their ideas have been incorporated to correct any flaws and improve the checklist's quality.This checklist has the potential to help us achieve our objectives.The independent and background variables in the checklist were age, marital status, estrogen receptor (ER), progesterone receptor (PR), tumor size, tumor location, stage, P53, Ki67, HER2, skin involvement, margin, DCIS (ductal carcinoma in situ), grade, tubule formation, nuclear pleomorphism, mitosis, and lymph node metastasis.In this investigation, the sample size is at least 200 (five to ten times the number of variables) (22).The final sample size employed in the technique is large and adequate due to bootstrap sampling in the essence of random forest (10).The random forest technique is a nonparametric way of group learning approaches, such as classification and regression trees, and was initially published by Breiman (2001) (23) in the context of machine learning.
In the random forest algorithm, randomization is applied to variables and observations, enhancing its resistance to noise and overfitting (24).Random forest exhibits exceptional performance when it comes to the selection of critical variables or when two indicators, mean decrease accuracy (MDA) and mean decrease Gini (MDG), are utilized (25,26).
The software utilized in this article is R software.The random forest model was fitted using the randomForest package and the randomForest command.The randomForest package and the rfImpute command were also utilized for imputing missing data.The rfImpute function utilizes a random forest model to train on the available data and subsequently replaces the missing values with the predicted values in an iterative manner.The initial step involves selecting the mean, median, or mode as the initial value for the algorithm used to impute missing data.Subsequently, the random forest fitting process is iterated to obtain the most accurate prediction values for the missing data.Typically, the optimal value for missing data is achieved by applying 5 -6 iterations of random forest fitting.The crucial aspect of this command is ensuring no missing data in the response variable (27).This study employed random forest techniques to identify factors influencing metastasis to lymph nodes by eliminating irrelevant variables using MDA and MDG indicators.

Results
The data for this study were obtained through a comprehensive review of the medical records of 241 female patients diagnosed with breast cancer.The patients were selected from hospitals in Sari, which are affiliated with Mazandaran University of Medical Sciences.As depicted in Figure 1, out of the 241 cases that were examined, 18 were deemed unsuitable for the study on the grounds of omission of essential information, significant data gaps, and other similar factors.A total of 223 cases, accounting for approximately 15% of the cases, were assessed.The mean age at diagnosis for patients was 52.03 ± 10.932.Out of 223 patients diagnosed with breast cancer, 111 had lymph node metastasis, while the remaining 112 had no evidence of metastasis in their lymph nodes.The comparison of the average age of breast cancer diagnosis between the two groups, one with metastasis to lymph nodes and the other without, did not yield a statistically significant result at the 95% confidence level (P = 0.195).This suggests no discernible difference in the average age between the two groups.The descriptive information of the variables is shown in Table 1.The relationship between each variable and the response variable was assessed by conducting the chi-square test for qualitative variables and the t-test for quantitative variables.The P-value for each test is provided in Table 1.
To ensure the optimal fit of the random forest model on the data, it is imperative to determine the optimal values for specific parameters, such as mtry and ntree.The "mtry" argument is a required parameter in the "randomForest" function.As previously stated, randomization is employed in the random forest algorithm, where a subset of m variables is selected from a pool of M variables (M > m).This argument presents the number of variables that have been selected.As previously stated in the introduction, the random forest algorithm comprises many decision trees.The ntree parameter in the randomForest function is utilized to specify the number of trees that constitute the random forest.To effectively determine the optimal parameters for RF (mtry, ntree), the algorithm was executed with varying numbers of variables (features) and trees.Compared to other configurations, A combination was chosen based on its ability to minimize the out-of-bag (OOB) error.Berriman demonstrated the convergence of error as the number of decision trees increases, utilizing the law of large numbers (28).Figure 1 shows that the augmentation in the quantity of decision trees results in a decrease in the OOB error.Figure 2 shows that the OOB error remains constant after approximately 400 trees.This constancy becomes evident at the point of 500 trees.Therefore, running a random forest with 500 trees is adequate for the data.In order to determine the optimal number of variables, a random forest model was executed with 500 iterations.Table 2 shows that utilizing four variables yields a lower OOB error.
Finally, two optimal values of 500 for the number of trees and 4 for the number of variables were selected, and the random forest was fitted on the data.Table 3 indicates that this random forest achieves an accuracy of 70%.One notable accomplishment of the random forest algorithm is its ability to identify significant variables (24).Comparisons are conducted between the variables using two criteria: MDG and MDA.When evaluating the Int J Cancer Manag.2024; 17(1): e140283.significance of a variable, it is observed that a lower value of MDG or MDA corresponds to a lower level of significance.In comparison, a higher value of MDG or MDA indicates a higher significance level for that particular variable (29).Figure 3 shows that the variables of marital status and DCIS exhibit the lowest MDG and MDA values, respectively.
Int J Cancer Manag.2024; 17(1): e140283.7 It is anticipated that improved accuracy can be attained by eliminating the two variables mentioned earlier and re-fitting the random forest algorithm using the reduced data set.To determine whether the model is predictive, 70% of the data is designated as the training set, and the remaining 30% is designated as the test set.In order to assess the predictive capability of the model, the random forest is fitted to the training and test data in the absence of the two variables mentioned.The accuracy, specificity, and sensitivity of each random forest model are detailed in Table 4, and the disturbance matrix is presented in Table 5.
The appropriate predictive potential of the model is demonstrated in Table 5.Therefore, the random forest is re-fitted by removing the two variables of marital status and ductal carcinoma in situ (DCIS) from the entire data set.To compare the performance of two random forests fitted on the original data set with reduced variables and the data set with all variables, the accuracy index was computed and is presented in Table 6.The accuracy index was derived from the disturbance matrices of the two models.The improved accuracy of the random forest with reduced variables (72.2%), as shown in Table 7 and Figure 4, indicates that this version of the random forest is more accurate than the one that includes all variables.In establishing the priority of influential elements, MDA is superior and more consistent than MDG (30).
As depicted in Figure 6, the RF method reveals that the grade variable holds the highest MDA value (7.06), indicating its utmost significance, followed by tubule formation (5.45), skin involvement (2.60), p53 (1.95), and other influencing variables.According to the MDG, the age variable exhibited the lowest mean decrease and is therefore considered the most significant (4.5).Following this, the variables grade (2.67), tubule formation (2.38), tumor size (2.19), nuclear pleomorphism (1.31), and others were recognized as influential variables.

Discussion
Factors accurately predicting a patient's treatment response or progression are paramount in disease treatment studies.As a result, doctors can prescribe medications with more favorable effects and flexibility in treating various disorders.Disease progression can be halted through the management of modifiable risk factors.Classical statistical analysis is frequently used to identify potential dangers.However, there may be restrictions on their use, such as a lack of complete data or an insufficient sample size.Machine learning-based techniques are one novel approach to these issues.This study identified the factors influencing breast cancer metastasis to lymph nodes by fitting the best random forest model to the data.The accuracy of the fitted forest with its corrections was 72.2 percent.
The factors influencing lymph node metastasis in breast cancer were identified based on the results obtained from two indexes, namely MDA and MDG.
According to MDA, the initial ten influential factors are grade, tubule formation, skin involvement, p53, peripheral involvement, nuclear pleomorphism, Ki67, tumor location, ER, and PR.According to MDG, the primary factors that influence lymph node metastasis in breast cancer are as follows: age, grade, tubule formation, tumor size, nuclear pleomorphism, level of disease, mitosis, skin involvement, tumor location, and margin involvement.Identifying the factors influencing lymph node metastasis in this article is also supported by the results of other studies.In summary, we refer to the following studies: In a study conducted by Kang Jiang et al., machine learning and Shapley algorithms were employed to analyze a cohort of 1 405 breast cancer patients.The findings revealed that tumor size, age, Her2 marker, ER marker, and PR marker were identified as significant factors influencing breast cancer metastasis to lymph nodes.According to the findings of the present study, as indicated by the MDG index, five specific factors have been identified as influential in the process of lymph node metastasis in breast cancer (31).In a study conducted by Purushotham et al. in 2021, 100 breast cancer patients were examined.The findings revealed a significant correlation between tumor size, grade, and stage and the occurrence of metastasis to lymph nodes in breast cancer.Specifically, the study found that an increase in these three factors was associated with an elevated likelihood of metastasis to lymph nodes (32).In 2021, a cross-sectional study was conducted by Hermansyah et al. to analyze the data from 51 medical records of breast cancer patients.The study revealed a significant relationship between the grade variable and the occurrence of metastasis to lymph nodes in breast cancer, as determined by the chi-square test results.In the present study, the variable "grade" is identified as one of the ten factors influencing breast cancer metastasis to lymph nodes (33).In their study, Li et  Consequently, it is probable to consider that the factors that impact the survival of breast cancer patients may also influence the occurrence of lymph node metastasis in these individuals (37).According to Shahrbanu Keyhanian et al., breast cancer is the predominant cancer among women, a significant cause of cancerrelated mortality globally.The study revealed that factors such as tumor size and type, histological grade, and the status of estrogen and progesterone receptors were identified as significant determinants of lymph node involvement.Additionally, it was determined that there is no significant correlation between age and the combined status of estrogen and progesterone receptors concerning lymph node involvement.The present study has identified these factors as influential factors in breast cancer (38).
In their study, Dolatkhahi et al. examined the medical records of 5 208 patients at the Cancer Research Center of Shahid Beheshti University of Medical Sciences and Health Services.The researchers employed decision trees, random forests, and support vector machines as machine learning techniques.Their findings indicate that the random forest method achieved the highest level of performance, with an accuracy of 94.75% and a reliability of 97.26%, surpassing the results obtained from the other two methods (39).Kabir Ahmad et al. analyzed a dataset consisting of 700 samples.This dataset included 458 cases classified as benign and 241 as malignant.The objective of their research was to employ random forest as a method for accurately classifying breast cancer lesions through fine needle aspiration (FNA).The researchers discovered that the random forest method, with a precision rate of 72%, demonstrated the ability to effectively classify different types of breast cancer.This approach demonstrates significant potential as a valuable tool for early cancer detection, facilitating the differentiation between malignant and benign tumors (40).In the study conducted by Olivotto et al., it was determined that several factors, including tumor size, margin involvement, tumor grade, and patient age, impact breast cancer metastasis to the lymph nodes.The current study identified the four factors above as part of a comprehensive list of ten factors impacting lymph node metastasis (41).

Conclusions
The random forest algorithm demonstrates satisfactory accuracy in effectively discerning between different categories.Given the missing data within the study, this algorithm offers a viable approach for effectively handling missing data.The random forest algorithm, which incorporates multiple sampling of variables and their utilization in constituent trees, effectively addresses the issue of small data volume.As a result, it yields accurate and acceptable results from a clinical perspective and in similar studies.It is recommended that medical professionals utilize the random forest model developed in the present study.

Footnotes
Authors' Contribution: Study concept and design: F. Z., and J. Y.; analysis and interpretation of data: F. Z. and Z. Z., J. Y., and Gh.G.; drafting of the manuscript: F. Z., Z. Z.; critical revision of the manuscript for important intellectual content: A. F., J. Y., and Gh.G.; statistical analysis: F. Z. Conflict of Interests: There was no conflict of interest.
Data Availability: The dataset presented in the study is available on request from the corresponding author Int J Cancer Manag.2024; 17(1): e140283.during submission or after publication.The data are not publicly available due to ethical reasons.
Ethical Approval: This study was approved under the ethical approval code of "IR.MAZUMS.REC.1401.14995".
Funding/Support: The writers note that they did not receive any funding.

Figure 1 .
Figure 1.Flowchart of hospital records review.

Figure 2 .
Figure 2. Out-of-bag error based on the number of trees in the random forest.

Figure 4 .
Figure 4.The accuracy of RF (full) and RF (reduced) in patients with breast cancer.

Figure 5
Figure 5 displays the Receiver Operating Characteristic (ROC) curves for the full and reduced variable random forests.The area under the curve for the two specified methods is 0.76 and 0.75, respectively.This indicates that the reduced random forest model performs equally well as the full-variable random forest model.

Figure 5 .
Figure 5. Receiver operating characteristic for full random forest and reduced random forest.

Figure 6 .
Figure 6.Determining the importance of variables based on mean decrease Gini and mean decrease accuracy.

Table 1 .
Descriptive Information of the Variables and Chi-Square Test Between the Response Variable and Each Other Variable in Patients with Breast Cancer b t-test.c Values are presented as No. (%).

Table 2 .
The Results of Optimal Selection of the Number of Variables Argument (Mtry) of the Random Forest Model Based on the Out-of-Bag Error Value in Patients with Breast Cancer

Table 3 .
The Results of Evaluating the Random Forest Classifier Model in Patients with Breast Cancer

Table 4 .
Joint Disturbance Matrix for Random Forest Model on Reduced Training and Testing Datasets in Patients with Breast Cancer

Table 5 .
Random Forest Model Evaluation Results on Two Reduced Training and Testing Data Sets of Breast Cancer Patients

Table 6 .
Combined Confusion Matrix for RF (Full) and RF (Reduced) in Patients with Breast Cancer

Table 7 .
The Results of the Evaluation of Two Classification Models, RF (Full) And RF (Reduced), in Patients with Breast Cancer Figure 3.The significance of the factors influencing lymph node metastasis based on mean decrease accuracy and mean decrease Gini in patients with breast cancer.
(35)s not possible to assess the status of PR, ER, and Her2 in the primary tumor, evaluating the status of lymph node metastasis can be an alternative method.In the present article, the two markers, ER and PR, have been identified as two of the ten factors influencing breast cancer metastasis to lymph nodes(35).Chand et al. examined 50 cases involving patients diagnosed with breast cancer.Their research findings indicate a significant association between the variable of tumor location and the occurrence of metastasis to lymph nodes.According to the current study's findings, the tumor location variable has been identified as one of the ten influential factors in the metastasis of breast cancer to lymph nodes (36).In a cohort study conducted in 2023, Zahra Zarean Shahraki et al. utilized the random forest algorithm to analyze a sample of 3 580 female patients diagnosed with breast cancer.The study identified tumor status, age at diagnosis, lymph node status, type of surgery, tumor stage, and duration of breastfeeding as the most influential variables for predicting the probability of breast cancer survival.Based on the present study's findings, age and tumor stages have been identified as factors that impact the metastasis of breast cancer to lymph nodes.