1. Background
The management of the perioperative period represents a critical component of healthcare, involving the coordinated care of patients across the preoperative, intraoperative, and postoperative phases. Well-structured management in this setting can reduce complications, decrease length of stay, and lower the risks of readmission and mortality (1, 2). Anesthesia technique plays a pivotal role in postoperative outcomes, including pain, nausea and vomiting, cognitive recovery, respiratory function, and length of stay. These factors guide clinicians in selecting protocols that promote optimal recovery (3). Recent advancements, including target-controlled infusion, ultrasound-guided regional anesthesia, and closed-loop anesthesia systems, have improved anesthetic safety and efficacy. However, many of these innovations are still under development and not widely adopted in routine practice (4, 5). Incorporating these innovations into practice, supported by sustained research and technological adoption, is essential to further improving patient outcomes (6).
As healthcare systems become increasingly digitized, large clinical datasets have emerged, enabling new approaches to perioperative research. Over the past decade, artificial intelligence (AI) and machine learning (ML) have garnered increasing interest in the medical field. In perioperative care, ML can generate predictive models and early warning systems, support the identification of critical illness, and improve the management of high-risk patients (7-10). Unlike traditional statistical methods, AI can learn from data, refine its performance over time, and develop models tailored to individual patients (11). The intensive care unit (ICU), due to continuous patient monitoring, generates large amounts of clinical data. Publicly available datasets, particularly the MIMIC series, have enabled extensive research in critical care. MIMIC-I included over 90 patients (12), followed by MIMIC-II, which featured larger cohorts and digital integration (13), and MIMIC-III, which expanded to more than 40,000 patients (14). The MIMIC-IV dataset, which covers ICU admissions from 2008 to 2019, provides precise digital information, including electronic medication records, and features a modular structure that facilitates efficient use in clinical research (15).
Despite extensive interest in anesthesia-outcome relationships, most prior studies have been limited to specific procedures or small cohorts, and few have simultaneously assessed both predictive feasibility and real-world outcome associations on a large scale.
2. Objectives
This study aims to evaluate whether routinely collected perioperative clinical features can accurately predict anesthesia type and to investigate the association between anesthesia modality and short-term clinical outcomes using the MIMIC-IV dataset and modern ML techniques.
3. Methods
3.1. MIMIC-IV Database
This study utilized version 3.1 of the MIMIC-IV database, a comprehensive, structured, and anonymized dataset created by the MIT Laboratory for Computational Physiology and made freely available to researchers via the PhysioNet platform. All data used in this study are publicly available to credentialed researchers through the PhysioNet platform. The database contains extensive information about patients who were treated in the emergency room and ICU at Beth Israel Deaconess Medical Center in Boston, Massachusetts. Version 3.1 has information on 364,627 unique patients, 546,028 hospital admissions, and 94,458 ICU stays. It is divided into two main modules: HOSP, which has hospital-level data like admissions, diagnoses, laboratory tests, medications, and administrative or billing information; and ICU, which has continuous ICU data like vital signs, clinical events, administered medications, and caregiver interventions. Only researchers who have finished CITI training on "Data or Specimens Only Research" and signed the data use agreement can access this database. The MIMIC-IV database is a publicly available, de-identified dataset. Therefore, this study did not require separate approval from an Institutional Review Board. MIMIC-IV is now one of the most important tools for research in ML, clinical risk analysis, outcome prediction, and improving medical processes (15).
3.2. Collecting Data and Choosing Features
After signing up and obtaining permission on the PhysioNet platform, data analysis was performed in the Jupyter Notebook environment using Python version 3.11. The libraries used were Pandas, NumPy, and scikit-learn. The unique identifier (subjectid) was used to get patient-level information. Initially, 32 candidate features were selected based on their clinical significance and recommendations from anesthesiology experts. We used a combination of domain expertise and statistical filtering to select the features. Initially, 32 candidate features were chosen based on clinical relevance, followed by statistical analysis using ANOVA. These included demographic variables (like sex and anchor age), information about the admission (like the type of admission, insurance, in-hospital death rate, and 30-day readmission rate), comorbidities (like diabetes, hypertension, CAD, cancer, CKD, COPD, sepsis, and infections, which were coded as binary variables), and procedural information (like the type of surgery and anesthesia). We averaged laboratory and physiological measures, including hemoglobin, platelet count, creatinine, blood urea nitrogen (BUN), sodium, potassium, INR, diastolic blood pressure, GCS score, and weight, to create patient-level summaries that were representative. Records containing incomplete or implausible values were omitted. After that, a univariate statistical analysis using ANOVA was done, and variables with P < 0.05 were thought to be statistically significant. A total of 28 final features were chosen for modeling.
3.3. Target Variable and Outcomes
The main variable of this study was the type of anesthesia, which was a multiclass categorical variable with four groups: General, regional, local, and sedation. Furthermore, a series of secondary outcomes was included to facilitate a more thorough assessment of anesthesia-related effects. These outcomes consisted of duration of hospital stay, duration of ICU stay, in-hospital mortality, 30-day readmission, and incidence of infection. The lengths of hospital and ICU stays were broken down into three groups: Short (≤ 24 hours), intermediate (24 - 72 hours), and long (> 72 hours). In-hospital mortality and 30-day readmission were classified as binary outcomes, and postoperative or in-hospital infection was similarly regarded as a binary variable. This multidimensional outcome design facilitated a more comprehensive examination of the correlation between anesthesia type and clinical outcomes.
3.4. Data Preprocessing
The LabelEncoder converted categorical variables, such as sex, admission type, and surgery type, into numerical values, enabling the data to be used for modeling. We used MinMaxScaler to normalize the numerical variables so that differences in scale would have less of an effect. Data normalization was performed using MinMaxScaler for consistency across models. However, for tree-based models like extreme gradient boosting (XGBoost) and random forest, this scaling does not affect model performance; it was applied to ensure compatibility with other models. We averaged temporal data, such as lab results and vital signs, to make things easier and more consistent. Using column means, SimpleImputer filled in missing numbers, and the Z-score method identified and removed outliers. The last dataset was divided into two parts: A training set and a testing set, with an 80/20 split. A fixed random state was used to ensure the results could be reproduced. To fix the problem of class imbalance in anesthesia type, two methods were used: The synthetic minority oversampling technique (SMOTE) to add more examples to minority classes and adjusting class weights (classweight = 'balanced') for some models to make them more sensitive to categories that are not well represented.
3.5. Model Implementation
All modeling procedures were executed in scikit-learn using Jupyter Notebook. The primary goal was to determine the type of anesthesia to use based on demographic, clinical, and surgical factors, and to examine the relationship between these factors and secondary outcomes. The models used were random forest, logistic regression, support vector machine (SVM), k-nearest neighbors (KNN), decision tree, gradient boosting, and XGBoost, among others. Multi-output models, such as MultiOutputClassifier and MultiOutputRegressor, were used when multiple outcomes needed to be predicted simultaneously.
3.6. Model Evaluation
We evaluated the model on a test set to assess its performance. Standard multiclass evaluation metrics, such as overall accuracy, precision, recall, and F1 score, were used to predict the type of anesthesia as a four-class categorical variable. We utilized functions from the scikit-learn library to perform all the mathematical operations. Lastly, the models that performed best in predicting the type of anesthesia were selected as the best-performing models.
4. Results
4.1. Baseline Characteristics
After data preprocessing and outlier removal, a total of 31,821 patients were included in the study. These patients were divided into four groups based on the type of anesthesia. The general anesthesia group had the largest number of patients (24,545), followed by the sedation, local, and regional anesthesia groups, which ranked second, third, and fourth, with 4,159, 2,990, and 127 patients, respectively. The mean age of the patients was approximately 62.4 years. In terms of gender, 52.1% of the subjects were male and 47.9% were female.
A significant difference was observed between the anesthesia groups. Patients in the regional anesthesia group were older on average than those in the general anesthesia group (mean age 65.7 vs. 62.4; P < 0.001). Also, the prevalence of hypertension was higher in the regional anesthesia group (47.3% vs. 39.8%; P < 0.001), while no significant difference was observed for diabetes (P = 0.21). The prevalence of CAD differed modestly but significantly across groups (P = 0.008). Most laboratory parameters showed statistically significant between-group differences, although absolute differences were small.
Table 1 shows the baseline characteristics (demographic, clinical, and laboratory) by type of anesthesia performed. This table provides information on the mean age, gender, comorbidities (including diabetes, hypertension, coronary artery disease, renal failure, and cancer), and laboratory parameters (hemoglobin, platelets, creatinine, and infection) for each anesthesia group. Significant differences in demographic and clinical characteristics can affect clinical outcomes.
| Features | General Anesthesia (N = 24,545) | Regional Anesthesia (N = 127) | Local Anesthesia (N = 2990) | Sedation (N = 4159) |
|---|---|---|---|---|
| Age | 62.4 ± 14.8 | 65.7 ± 13.1 | 62.1 ± 15.0 | 61.7 ± 13.5 |
| Gender (male) | 52.1 | 53.0 | 54.5 | 50.5 |
| Diabetes | 23.4 | 25.6 | 20.2 | 21.8 |
| Hypertension | 39.8 | 47.3 | 42.2 | 39.3 |
| Coronary artery disease | 15.2 | 16.1 | 14.3 | 12.8 |
| Kidney failure | 12.5 | 13.3 | 11.1 | 10.0 |
| Asthma and COPD | 10.8 | 13.2 | 9.5 | 8.7 |
| Cancer | 6.2 | 7.8 | 5.9 | 6.0 |
| Sepsis history | 7.1 | 8.3 | 6.7 | 6.1 |
| Hemoglobin (g/dL) | 13.2 ± 1.8 | 13.5 ± 1.9 | 13.3 ± 1.7 | 13.0 ± 1.8 |
| Platelets (× 103/µL) | 250 ± 35 | 270 ± 40 | 245 ± 30 | 235 ± 38 |
| Creatinine (mg/dL) | 1.1 ± 0.3 | 1.0 ± 0.2 | 1.1 ± 0.3 | 1.2 ± 0.3 |
| Infection history | 9.1 | 4.6 | 6.1 | 5.5 |
| Hospital length of stay (d) | 2.6 ± 4.9 | 1.8 ± 3.7 | 2.4 ± 4.2 | 2.3 ± 3.5 |
| ICU length of stay (d) | 2.1 ± 4.2 | 1.4 ± 3.5 | 2.0 ± 3.7 | 2.2 ± 3.1 |
Abbreviation: ICU, intensive care unit.
a Values are expressed as mean ± SD or percentage.
4.2. Performance of Machine Learning Models (Synthetic Minority Oversampling Technique-Balanced Data)
Using 28 selected features, seven ML models were compared on the held-out test set after applying SMOTE to address class imbalance across anesthesia types. Gradient boosting and XGBoost achieved the highest overall accuracy (83.1% and 83.1%, respectively), with random forest close behind (82.6%). At the class level, gradient boosting yielded the best F1 score for general anesthesia, whereas XGBoost outperformed the others for sedation and local anesthesia; all models struggled with regional anesthesia due to the extremely small sample size. Overall, KNN and decision tree underperformed relative to boosting-based models.
4.3. Data Balance (Synthetic Minority Oversampling Technique) and Class Weights
The SMOTE increased overall accuracy (from approximately 77.5% without resampling to approximately 83% with resampling for boosting-based models), but did not significantly improve macro-F1, and the performance for the regional class remained low. We therefore report the SMOTE results as the main analysis and provide non-resampled results as a sensitivity analysis. Where supported (e.g., LR, SVM, and RF), classweight = 'balanced' showed broadly consistent patterns with the SMOTE analysis.
4.4. Main Result
The XGBoost and gradient boosting achieved the highest performance in predicting the type of anesthesia. For XGBoost, the approximate binomial 95% CI for accuracy was 82.2 - 84.0%; for gradient boosting, 82.2 - 84.1%. The macro-averaged F1 score for XGBoost was 0.45, reflecting the impact of severe class imbalance. Given the near-tie in accuracies, we report both boosting models as top performers and provide full per-class metrics in Table 2. Agreement between predictions and true labels, measured by Cohen’s κ, was 0.48 (95% CI 0.46 - 0.51) for XGBoost and 0.47 (0.45 - 0.49) for gradient boosting on the four-class task.
| Models | Accuracy | Macro-Precision | Macro-Recall | Macro-F1 | Weighted-F1 |
|---|---|---|---|---|---|
| Gradient boosting | 0.8314 | 0.52 | 0.42 | 0.43 | 0.80 |
| XGBoost | 0.8308 | 0.60 | 0.43 | 0.45 | 0.80 |
| Random forest | 0.8259 | 0.52 | 0.38 | 0.38 | 0.78 |
| KNN | 0.7832 | 0.40 | 0.33 | 0.34 | 0.74 |
| Decision tree | 0.7361 | 0.38 | 0.39 | 0.38 | 0.74 |
| SVM | 0.6573 | 0.40 | 0.53 | 0.41 | 0.71 |
| Logistic regression | 0.5527 | 0.37 | 0.52 | 0.35 | 0.63 |
Abbreviations: XGBoost, extreme gradient boosting; KNN, k-nearest neighbors; SVM, support vector machine.
4.5. Clinical Interpretation
This study indicates that the choice of anesthesia is associated with differences in clinical outcomes, such as in-hospital mortality, 30-day readmission, infection, and length of hospital and ICU stay. Regional anesthesia was associated with more favorable outcomes compared with general anesthesia. In our analyses, regional anesthesia was associated with approximately 37% lower odds of in-hospital mortality (OR ≈ 0.63), shorter ICU stays, fewer infections, and lower 30-day readmission and hospital length of stay. Possible mechanisms — such as attenuation of systemic inflammatory and stress responses — are biologically plausible but remain hypotheses. Because general anesthesia is often selected for more complex or urgent procedures, residual confounding (e.g., confounding by indication) may persist, and estimates for the regional group are limited by its small sample size. Therefore, these findings should be interpreted as associations rather than causal effects, while still providing information that may help clinicians select the most appropriate anesthetic technique for patients with specific clinical conditions.
5. Discussion
In this large retrospective cohort study, which included more than 30,000 patients from the MIMIC-IV database, we observed that the type of anesthesia was associated with short-term outcomes. Compared with general anesthesia, regional anesthesia was associated with lower in-hospital mortality, decreased infection rates, shorter hospital and ICU stays, and reduced 30-day readmissions. Furthermore, ML analyses demonstrated that demographic and clinical characteristics could predict anesthesia type with high accuracy. The XGBoost and gradient boosting achieved the highest accuracy (approximately 83%) on the held-out test set (macro-F1 for XGBoost ≈ 0.45). Collectively, these findings highlight both the clinical implications and the predictive utility of anesthesia selection in patient stratification.
It is important to emphasize that this study is observational in design, and our primary goal was to identify patterns and associations rather than establish causality. Our findings are consistent with earlier evidence that regional anesthesia is associated with more favorable perioperative outcomes than general anesthesia, although effect sizes may vary across subgroups and outcomes. Earlier studies, including systematic reviews and meta-analyses, have compared the efficacy and patient outcomes associated with general, regional, and local anesthesia in both surgical and critical care settings. For example, recent meta-analyses found no significant difference in overall sedation or anesthesia success rates between remimazolam and propofol (16). However, remimazolam may reduce the risk of hypoxemia and injection pain, at the expense of longer awakening times (17).
Regional anesthesia techniques, such as peripheral nerve blocks, have been associated with lower early postoperative pain scores compared with general anesthesia, though differences diminish after the first 12 hours post-surgery. Moreover, regional approaches are associated with lower opioid consumption and fewer instances of nausea and vomiting immediately postoperatively (18). When comparing regional and general anesthesia for major procedures, Bayesian meta-analyses suggest that the use of dexmedetomidine as an adjunct can further improve quality of recovery (QoR) and patient-centered outcomes, including reduced agitation and faster return to baseline function (19).
More recently, analyses using large critical care databases, such as MIMIC-IV, have produced complementary associations regarding anesthetic techniques and sedative combinations. These studies demonstrate that sedatives, such as dexmedetomidine, have been associated with improved survival and fewer complications compared to midazolam or propofol in mechanically ventilated ICU patients (16, 20, 21). Other studies using the same dataset have also suggested that ketamine may offer short-term mortality benefits in critically ill patients on vasopressors, though its advantages may not persist at 90 days (22).
Combination anesthesia approaches, such as general anesthesia combined with regional blocks, have been associated with enhanced recovery markers compared to general anesthesia alone, including improved postoperative pulmonary function, more stable hemodynamics, lower complication rates, and faster recovery of cognitive function and sleep quality. For instance, patients receiving combined anesthesia exhibited better pulmonary oxygenation, more stable hemodynamics, and a faster recovery of cognitive and sleep quality after surgery compared to those receiving general anesthesia alone. Additionally, combined general and epidural anesthesia improved pain control and psychomotor recovery after major surgery (23-25).
Additionally, new agents, such as remimazolam, are being investigated for their role in reducing postoperative nausea and vomiting, which remain important recovery outcomes in anesthesia practice (26). Because most of these analyses are observational and often conducted in single centers, residual confounding and selection bias remain possible; therefore, effect sizes should be interpreted cautiously.
What distinguishes the present study from earlier literature is its scale, breadth, and methodological approach. Unlike prior single-procedure or limited-population analyses, our work leveraged data from more than 30,000 patients undergoing diverse surgical interventions. By integrating modern ML approaches, we not only documented between-group outcome differences across anesthesia modalities but also demonstrated that patient- and procedure-level factors can predict anesthesia type with high accuracy (XGBoost and gradient boosting achieved approximately 83% accuracy on the held-out test set). These findings contribute to the growing body of evidence suggesting that anesthesia choice is not merely a procedural decision but is also associated with the patient's trajectory, and they highlight the potential role of predictive modeling in guiding perioperative care.
In contrast to randomized controlled trials that focus on narrow patient groups, our database-driven approach provides insights into real-world practice patterns and outcomes across a wide range of clinical contexts, thereby offering complementary evidence to trial-based literature. Despite its strengths, this study has notable limitations. Retrospective design inherently risks selection bias, as patients undergoing regional anesthesia may represent less complex surgical cases or distinct comorbidity profiles compared with those receiving general anesthesia. Although covariate adjustments were applied, unmeasured confounding, such as intraoperative management practices or anesthesiologist preference, remains possible. Group size imbalance further constrained analysis, with approximately 24,545 patients in the general anesthesia group but only 127 in the regional anesthesia group; estimates for the regional arm are therefore less precise.
Moreover, reliance on the MIMIC-IV dataset, drawn from a single U.S. tertiary hospital, restricts the generalizability of these findings to broader healthcare contexts. The limitation that databases like MIMIC-IV do not capture postoperative functional outcomes such as long-term pain, cognitive recovery, or quality of life is increasingly acknowledged in anesthesia research. Functional outcomes and QoR after anesthesia and surgery are complex, multidimensional processes that encompass physical, psychological, and social domains, which traditional datasets often fail to capture (27, 28). The QoR scales, such as the QoR-15 and QoR-40, provide patient-centered metrics but are not routinely included in large clinical databases, thereby limiting comprehensive outcome assessment (29).
Several mechanisms may explain why regional anesthesia was associated with improved outcomes. First, regional anesthesia may reduce sympathetic activation and blunt the surgical stress response, thereby contributing to more stable cardiovascular physiology (30). Second, by reducing the need for tracheal intubation and mechanical ventilation in selected procedures, regional anesthesia may lower the risk of pulmonary complications such as pneumonia and ventilator-associated lung injury (31). Third, regional approaches may attenuate systemic inflammation (e.g., lower perioperative cytokine release), potentially reducing the risk of infection and sepsis (32). In older adults, regional anesthesia has been associated with a lower incidence of postoperative delirium and cognitive dysfunction compared with general anesthesia in some studies, which may translate into more favorable recovery trajectories (33-35).
From a clinical perspective, these findings suggest that anesthetic technique should be considered not only from a procedural standpoint but also as part of individualized perioperative risk management. In frail or multimorbid patients, regional anesthesia may offer meaningful reductions in morbidity and resource utilization when feasible; conversely, general anesthesia remains essential for complex or long-duration operations where regional techniques are impractical or contraindicated. Practical considerations, including surgical site, expected duration, anticoagulation status, patient preference, and operator expertise, as well as the possibility of block failure and conversion to general anesthesia, should be incorporated into shared decision-making. Machine-learning tools, such as those evaluated here, could eventually support clinicians by identifying patients most likely to benefit from regional approaches, pending external validation, calibration, and decision-curve analysis. They should not complement clinical judgment.
Further research is needed through multicenter prospective studies to confirm these associations in broader populations. Randomized controlled trials remain crucial for establishing causality, particularly for multimodal anesthesia strategies and novel agents such as remimazolam and ciprofol. Long-term endpoints, including persistent pain, cognitive recovery, and quality of life, should be incorporated. Prospective registries should also include patient-reported outcomes to capture QoR more comprehensively. Ultimately, evaluating AI-enabled perioperative decision support in pragmatic trials could clarify its impact on patient safety, efficiency, and outcomes.
5.1. Conclusions
Our findings from a retrospective analysis of more than 30,000 patients in the MIMIC-IV database indicate that the choice of anesthesia has a measurable impact on perioperative outcomes, with regional anesthesia associated with lower in-hospital mortality, reduced infection rates, and shorter hospital and ICU stays compared with general anesthesia. By incorporating ML methods, particularly the XGBoost algorithm, we also demonstrated that anesthesia type can be predicted with high accuracy using routine demographic and clinical features, highlighting the potential of predictive analytics to support personalized anesthesia planning. While the retrospective design, imbalance in group sizes, and lack of long-term functional outcomes limit causal inference, the consistency of associations across multiple endpoints underscores the clinical importance of anesthesia modality as more than a technical consideration, but rather as a determinant of patient recovery and healthcare resource use. Future prospective multicenter studies and randomized controlled trials are needed to confirm these observations, integrate long-term and patient-reported outcomes, and evaluate emerging agents and multimodal strategies. The use of AI in perioperative decision-making may further enhance individualized and patient-centered care in anesthesia.
5.2. Limitations
This retrospective, observational study precludes causal inference; findings should be interpreted as associations. Although we adjusted for measured covariates, residual and unmeasured confounding (e.g., intraoperative management, provider preference) may remain. Procedure information was available only at a coarse level; detailed factors such as urgency, complexity, surgical approach, and ASA class were not comprehensively captured, which may influence both anesthetic selection and outcomes (confounding by indication). We used secondary data from MIMIC-IV (single U.S. tertiary center), limiting generalizability; registry data can include missingness, miscoding, and exposure misclassification (e.g., combined general-plus-regional techniques labeled as a single category). Preprocessing choices (e.g., imputation and outlier handling) may affect estimates and should be viewed as analytic assumptions. There was marked class imbalance (general ≫ regional); we addressed this only in prediction models using SMOTE within training folds to avoid leakage, but the small regional cohort still reduces the precision/stability of estimates for that group. For ICU time-to-discharge, time-to-event analyses may be affected by competing risks and time-dependent confounding. Finally, while machine-learning models performed well internally, external validation, calibration, and decision-curve analysis are needed before clinical deployment.