Emplying Large Language Models for Surgical Education: An In-depth Analysis of ChatGPT-4

Background: Thegrowinginterestinartiﬁcialintelligence(AI)hasspurredanincreaseintheavailabilityof LargeLanguageModels (LLMs) in surgical education. These LLMs hold the potential to augment medical curricula for future healthcare professionals, facilitating engagement in remote learning experiences, and assisting in personalised student feedback. Objectives: To evaluate the ability of LLMs to assist junior doctors in providing advice for common ward-based surgical scenarios with increasing complexity. Methods: Utilising an instrumental case study approach, this study explored the potential of LLMs by comparing the responses of theChatGPT-4,BingAIandBARD.LLMswerepromptedby3commonward-basedsurgicalscenariosandtaskedwithassistingjunior doctorsinclinicaldecision-making. Theoutputswereassessedbyapanelof twoseniorsurgeonswithextensiveexperienceinAIand education,qualitativelyutilisingaLikertscaleontheiraccuracy,safety,andeﬀectivenesstodeterminetheirviabilityasasynergistic toolinsurgicaleducation. Aquantitativeassessmentof theirreliabilityandreadabilitywasconductedusingtheDISCERNscoreand a set of reading scores, including the Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau Index. Results: BARDprovedsuperiorinreadability,withFleschReadingEaseScore50.13(± 5.00),Flesch-KincaidGradeLevel9.33(± 0.76), and Coleman-Liau Index 11.67 (± 0.58). ChatGPT-4 outperformed BARD and BingAI, with the highest DISCERN score of 71.7 (± 2.52). Using a Likert scale-based framework, the surgical expert panel further aﬃrmed that the advice provided by the ChatGPT-4 was suitable and safe for ﬁrst-year interns and residents. A T-test showed statistical signiﬁcance in reliability among all three AIs (P < 0.05) and readability only between the ChatGPT-4 and BARD. This study underscores the potential for LLM integration in surgical education, particularly ChatGPT, in the provision of reliable and accurate information. Conclusions: This study highlighted the potential of LLM, speciﬁcally ChatGPT-4, as a valuable educational resource for junior doctors. The ﬁndings are limited by the potential of non-generalizability of the use of junior doctors’ simulated scenarios. Future work should aim to optimise learning experiences and better support surgical trainees. Particular attention should be paid to addressingthelongitudinalimpactof LLMs,reﬁningAImodels,validatingAIcontent,andexploringtechnologicalamalgamations for improved outcomes.


Background
Chat Generative Pre-Trained Transformer 4 (ChatGPT-4) (Open AI), BARD (Google), and BingAI (Microsoft) are state-of-the-art large language models (LLM), that generate human-like language to answer questions and complete text (1).Currently, there is a growing prevalence of chatbots and artificial intelligence (AI) in daily life, from assisting with school homework to fooling researchers with phony abstracts (2) While discussions surrounding ownership, authorship, and potential misuse continue to be debated (3,4), there has been a growing trend of artificial intelligence in medical education as a disruptive technology (5)(6)(7).
Trained by a vast corpus of medical literature, these LLMs can provide students with detailed and relevant information on any chosen subject matter (8).ChatGPT-3.5 has demonstrated performance near the passing threshold of 60% accuracy in the USMLE Steps 1, 2 CK, and 3 exams, comparable to a first-year postgraduate doctor seeking licensure as an unsupervised physician in the United States of America (USA) (5).Following the March 2023 release of ChatGPT-4, it has been claimed that the updated version of this LLM has enhanced clinical reasoning and test-answering capabilities compared to previous iterations (9,10).This has been further demonstrated with ChatGPT-4 significantly outperforming ChatGPT-3.5 on American neurosurgical written board examinations (8).
Concurrently, the COVID-19 pandemic has fundamentally reshaped medical education owing to public health concerns and stay-at-home mandates in Australia, leading to reduced face-to-face teaching and clinical exposure for medical students and junior doctors.This has triggered concerns regarding the potential long-term effects of medical training.However, the pandemic has also spurred innovations in medical education, particularly in virtual simulation and telehealth (5,10,11).While most universities resume face-to-face training, the growing prevalence of AI and LLMs cannot be ignored, as they have emerged as possible tools to aid informed diagnosis and make safer treatment plans.The unique ability of LLMs to process vast amounts of clinical data and current information positions them as theoretical adjuncts for medical students and junior doctors.However, the challenge therefore, is ensuring that their everyday use in medical education is not at the cost of critical thinking and clinical acumen.Additionally, given the prevalence of LLMs in the post-pandemic context, it must also be considered whether they have a role in virtual simulation and remote learning.
This study aimed to assess the potential of LLMs, with a focus on ChatGPT-4, BingAI, and BARD, to aid surgical education and offer reliable advice to junior doctors (post-graduate years one or two).Comparison of these LLMs will be conducted through comprehensive quantitative and qualitative assessment, which will provide valuable insights into the potential use of LLMs in surgical education.In particular, the limitations of AI and LLMs are also briefly explored to understand the boundaries of this technology.Ultimately, by exploring these issues in-depth, this study will help reshape the traditional medical education curriculum.
Using a case study approach, each LLM will be prompted in three routine ward-based surgical situations to aid clinical decision-making.These scenarios were formulated based on real-life clinical scenarios and textbook case studies (11,12), with a final review by expert general surgeons.Responses were qualitatively evaluated for accuracy, safety, and effectiveness using a Likert scale (13).Their validity, reliability, and readability were quantitatively assessed using the DISCERN score (14) and various readability metrics, including the Flesch Reading Ease (FRE) Score (15), Flesch-Kincaid Grade Level (FKGL) (16), and Coleman-Liau Index (CLI) (17).It is hypothesised that their outputs may exhibit disparities in reliability and readability owing to differing training data.

Study Design
To address the primary research aim, this study adopted an instrumental case study approach (18).This research method is often used to understand and gain insight into a phenomenon in a context, which in our case, is the use of AI LLMs (ChatGPT-4, BARD, and BingAI) in medical education.

Methodology
A series of three increasingly complex clinical scenarios were posed to AI LLMs (ChatGPT-4, BARD, and BingAI).These scenarios were common ward-based surgical reviews performed by a junior doctor.These scenarios were formulated and designed from real-life clinical scenarios that the authors had encountered as ward-based junior doctors.The scenarios were then validated for accuracy and relevance with textbook analysis of case studies (11,12).These case studies were evaluated by a panel of two board-certified surgeons independently (AL and DD) with over 20 years of experience.
Responses were qualitatively evaluated for accuracy, safety, and effectiveness using a Likert scale (13).If any differences in the Likert scale or reliability tools arose, these were discussed until consensus was achieved.
The responses from each scenario then underwent qualitative analysis for accuracy, appropriateness, and patient safety.The quantitative assessment standards comprised of two aspects: reliability, which was determined using the DISCERN score (14), and readability, which was evaluated through three widely recognized scoring systems: FRE score (15), FKGL (16), and CLI (17).Due to differing training data, it is hypothesised that their outputs may exhibit disparities in reliability and readability.

Data Collection Tools
The Likert scale (13) used in this study, is a 5-point global scale to qualitatively evaluate the accuracy, safety and effectiveness of the three LLMs.The 5-point scale consisted of two utmost poles ('Strong Agree' and 'Strongly disagree') and neutral option ('Neither agree nor disagree'), linked with intermediate answer options ('Agree' and 'Disagree').
The DISCERN questionnaire (14) was used to quantitatively assess the reliability of written information from the LLM outputs, and is considered a valid and reliable score for evaluating consumer health information (19).According to the literature (20), DISCERN scores may be categorised as follows: 'excellent' for scores of 63 to 75 points, 'good' for scores of 51 to 62 points, 'fair' for scores of 39 to 50 points, 'poor' for scores of 27 to 38 points, and 'very poor' for scores of 16 to 26 points.
In this study, readability was assessed using three recognised scoring systems: Flesch Reading Ease Score (15), Flesch-Kincaid Grade Level (16), and Coleman-Liau Index (17).The FRE score and FKGL are both calculated using the average sentence length (i.e., number of words divided by the number of sentences) and the average syllables per word (i.e., number of syllables divided by the number of words) (21).The CLI (17) is calculated using the average number of letters per 100 words, and the average sentence length.Scoring of the FRE is through a 100-point scale with higher scores indicating higher readability and easier to understand text.Alternatively, FKGL and CLI, indicate the USA academic grade level (number of years of education) necessary to comprehend the written material.These readability tests were selected due to their wide and validated use in previous studies (21,22).

Expert Review Framework
To evaluate the AI outputs, a Likert scale-based framework was employed (Table 1).Each LLM output was assessed by a panel of expert General Surgeons using this framework.
The panel of board-certified surgeons conceptualised the research idea and both expert surgeons were recruited due to their experience in AI, research, and education.The surgeons' credentials extended beyond medical degrees to include specialised surgical training, affiliations with professional medical bodies, and leading roles at esteemed medical institutions.Their proficiency in evaluating AI outputs, understanding of its implications in medical education, and previous experience with AI research projects substantiated their selection for the panel.Experts were asked to rate the accuracy, reliability, proficiency, comprehensiveness, relevancy, general knowledge, errors, citations, and references of AI-generated responses.

Statistical Analysis
Between the three LLM, a student's T-test (23) was conducted to determine the statistical significance of the reliability and readability scores.Further commentary and critique of the answers were provided by two specialist general surgeons with extensive clinical experience, who provided an expert review framework on the subject matter.Statistical analyses were then performed on the collected data to determine the AI's performance across different dimensions, with a focus on identifying areas of strength and potential improvement.

Due
to the probabilistic algorithm and random-sampling method of LLMs, answers may vary slightly even if the same question is asked.For this study, three scenarios were placed into ChatGPT-4, BARD, and BingAI, and the first responses from each prompt were recorded.These three LLMs were selected as the three most readily available and widely used LLMs in medical research (24).Extreme care was taken to craft each prompt, to ensure that there were no grammatical errors or points of contention.Subsequent clarification of answers or corrections was also not utilised.To preserve the integrity of the original response and mirror the conditions of a junior doctor, the function to regenerate answers or to alter previous responses was not utilised.
All prompts were inputted on the same day on a single account of ChatGPT plus (owned by one of the authors, IS), which provided access to ChatGPT-4.Access to BARD and BingAI required no additional paywall.No inclusion or exclusion criteria were placed on the answers from ChatGPT-4, BARD and BingAI.Institutional ethics were not required as no human participants were involved in the study.

Quantitative Analysis
Through a comprehensive quantitative analysis of ChatGPT, BARD, and BingAI as shown in Table 2, significant variability was observed in the mean readability, as assessed by three standard grading scales.
Regarding the accuracy of the information, ChatGPT surpassed others by providing medical advice closely aligned with current clinical guidelines and up-to-date references.It is evidenced in the highest DISCERN score of 71.7 (± 2.52) compared to BingAI's 64.3 (± 2.08) and BARD's 56.7 (± 2.52).A t-test comparing all three AIs demonstrated statistical significance in the reliability tests among them, with a P-value < 0.05.Among the readability tests, only the comparison between ChatGPT and BARD was statistically significant (<0.05).

Qualitative Analysis
Scenario A illustrates a patient who is two-days post-haemorrhoidectomy that begins to deteriorate on the ward (Appendix 1 in the Supplementary File).The guidance that is offered by ChatGPT-4 is to start with a focused history, physical examination, then to review investigations.This dynamic and formulaic process of history, examination, and investigation forms the crux of patient assessment and is a distinguishing characteristic of a master clinician (25).From the onset, there is also general advice to escalate to either a supervising physician or surgeon, which demonstrates an awareness of limitations and a high level of patient safety.While the scenario is deliberately broad, the complaints of suprapubic pain could have been triggered by a urinary tract infection.The answer may have been improved by further investigation with urine microscopy, culture, and sensitivity.
In our scenario, the patient becomes acutely unwell after spiking a temperature and experiencing mild tachycardia.The recommendation from ChatGPT-4 is to involve a supervising physician or specialist, which is considered safe and appropriate.While the patient has a fever and warrants further investigations, the answer could have been improved by mentioning the importance of conducting a basic septic screen (chest x-ray, urine culture, and wound/blood cultures).As the patient begins to deteriorate and becomes haemodynamically unstable, the situation becomes highly concerning for necrotising fasciitis.While a formal diagnosis is not mentioned, the recommendations for investigations, further imaging, empiric antibiotic therapy, and fluid resuscitation are all appropriate.Additionally, due to the rapidly progressive nature of necrotising fasciitis (26), there is also a recommendation for close monitoring and a low threshold for escalation to intensive care.Once a clinical diagnosis is established, urgent surgical debridement, antibiotic therapy, and fluid resuscitation become the cornerstone of management (27), which are all recommended by ChatGPT-4.The recommendations from this scenario showcase a safe and practical approach to managing a deteriorating surgical patient on the ward.
Scenario B describes a post-operative patient who is tachycardic and hypotensive on the ward (Appendix 2 in the Supplementary File).Again, the recommendation of further history, physical examination, and investigations are used to assess this patient.Early consultation with a supervising physician or surgeon and appropriate treatment with intravenous fluids is also recommended, demonstrating a high level of patient safety.As the patient continues to deteriorate, important differentials of bowel obstruction or post-operative ileus are proposed for consideration, which normally occur 24 -48 hours post-operatively (28).As the scenario progresses with no recordable drain outputs and persistent pain, ChatGPT-4's response is to reassess the patient, re-evaluate analgesia, and consult with a senior.While the response is adequate and safe, the answer may have been improved by further elucidating options for post-operative analgesia -given its persistent nature (29).Additionally, while there is low evidence for the use of abdominal drainage after open appendectomy (30), the lack of drain outputs may also be misleading, and focus should have been directed toward ascertaining the accuracy of drain measurements.
The progression of the scenario finds a twisted knot in the drain, which likely would have prevented any knowledge of a leak, bleed, or collection.Once a cause is established, the advice to involve the surgical team, fluid resuscitation, and blood transfusion are safe principles of management.In ChatGPT-4's response, there is also the suggestion that this patient may require further surgical re-exploration or haemostasis if ongoing bleeding.This answer therefore may have been improved by ensuring the patient had a valid group and save and was alerted about the possibility of further surgery.Nevertheless, in all answers from ChatGPT-4, safe management principles were complemented with a suggestion to involve a senior medical colleague, highlighting its safe-guarding and practical approach to managing a post-operative patient.
Scenario C describes a challenging patient who has recurrent small bowel obstructions in the emergency department (Appendix 3 in the Supplementary File).The initial approach to diagnosis and management of the patient's condition is congruent with current guidelines (31) -including bowel decompression with nasogastric tube insertion, pain management, and keeping the patient fasted.It is noted that ChatGPT-4 also advises involving the surgical team early in this scenario, demonstrating a strong predisposition towards patient safety.As the patient's electrolytes become deranged, further management of re-checking laboratory values and electrolyte replacement are correctly suggested (32).
The scenario progresses with the patient becoming angry, removing his nasogastric tube, and threatening to discharge him against medical advice.During this ethical dilemma, there is a careful balance between respecting patient wishes and ensuring the best medical practice.The response by ChatGPT-4 highlights the importance of staying calm, educating the patient, offering alternatives, careful documentation, and safety-netting with a discharge plan -highlighting its awareness of patient safety and medico-legal risk.The concept of capacity is also proposed, where junior doctors must ascertain whether a patient fully comprehends the implications of their actions.While the response from ChatGPT-4 was safe and explored issues around capacity, this response may have been improved by listing complications of small bowel obstruction -pain, bowel necrosis and perforation, intra-abdominal abscess, and aspiration (33,34).
The expert review framework revealed distinct differences in the performance of the three LLMs.ChatGPT-4 outperformed Bing's AI and Google's BARD in providing accurate answers, generating factual information, understanding complex questions, and offering comprehensive, relevant, and in-depth information across various topics.ChatGPT-4 also excelled in general knowledge and providing useful insights on different subjects.Divergences in the readability and comprehensibility of LLMs are also noted in previous literature (24).It is potentially attributable to varying training data, data pre-processing strategies, and inherent data structures.Such variations could impact each model's proficiency in managing unique terminologies and abbreviations.However, all three models demonstrated room for improvement in referencing sources and providing accurate citations, with none of them scoring above "Neither Agree or Disagree" in those categories.

Discussion
This study evaluated and compared the performance of the three most popular LLMs -ChatGPT-4, BingAI, and BARD in providing precise and reliable advice for junior doctors in different post-operative scenarios.Overall, ChatGPT-4 demonstrated a strong foundation in clinical assessment by recommending a structured approach consisting of a focused history, and physical examination, followed by investigations.It consistently recommends junior doctors escalate or involve senior surgeons at an early stage, showcasing an appropriate level of patient safety and awareness of junior doctors' limitations in handling surgical emergencies.
ChatGPT-4 was able to generate appropriate differential diagnoses in all 3 scenarios, indicating a comprehensive understanding of the clinical context.It also provided safe and practical guidance on patient management in line with established clinical guidelines while considering the ethical and medico-legal aspects of patient care, including respecting patient autonomy and addressing capacity.However, some ChatGPT-4 responses lacked specificity and comprehensiveness, as shown in scenario B where its answer can be improved by ensuring the patient had a valid group and save for potential re-exploration, reflecting its weakness in anticipating complications.
In contrast, the responses generated by BingAI and BARD are notably less specific in offering distinct recommendations for pathology and imaging tests.Nevertheless, it is important to acknowledge that BARD is the only model to make a preliminary diagnosis of necrotising fasciitis based on haemodynamic instability and provide relevant risk factors to aid junior doctors in formulating differential diagnoses.This capability may indicate BARD's potential as a valuable complementary diagnostic aid (10).Despite these strengths, the performance of BingAI and BARD in providing management advice for handling postoperative emergencies is markedly inferior to that of ChatGPT-4.This deficiency is not only attributable to their responses' lack of structure and comprehensiveness but also to the occasional dissemination of misleading information.For instance, BARD entirely overlooked the inclusion of nasogastric tube (NGT) decompression in the context of bowel obstruction, while BingAI neglected to mention fluid resuscitation in instances of hemodynamic instability.Both oversights could lead to delayed treatment and potentially life-threatening consequences for patients.Nonetheless, both BingAI and BARD consistently emphasised the importance of escalating care throughout their responses.This emphasis is integral to ensuring safe practice for junior doctors in the clinical setting.
The COVID-19 pandemic has substantially disrupted surgical education, posing unique challenges for junior doctors and medical students.The suspension of clinical rotations and postponement of elective surgeries has restricted trainees' clinical exposure.
While online learning offers flexibility and convenience, this shift has limited hands-on learning opportunities, especially concerning procedural and physical examination skills.While most universities have resumed face-to-face training, it is also imperative for surgical training programs to consider adapting their curricula for innovative educational strategies, including hybrid learning model and AI-assisted technologies (35).However, the real question posed to medical educators, is how to incorporate this new educational tool without compromising clinical acumen and patient safety.
LLMs particularly have considerable potential to help address the challenges in surgical education due to its interactive nature and rapid information retrieval capacity (36,37).While they cannot entirely replace in-person training or clinical exposure, they can be instrumental in supplementing and enhancing the educational experience for junior surgeons.Integrating ChatGPT-4 into clinical settings could offer several benefits.Firstly, it can provide real-time assistance to trainees in understanding complex clinical scenarios, interpreting medical data, and making informed decisions.This on-demand support may reduce the cognitive burden on junior doctors and help them refine their clinical reasoning skills.ChatGPT-4 can potentially fill the gaps left by traditional teaching methods, which might be constrained by time, resources, and the availability of experienced faculty.If used as an independent bedside tutor, it can deliver personalised, tailored learning experiences that address the specific needs and knowledge gaps of each trainee.Providing advice on relevant and important questions during history taking and critical components to examination for each presenting complaint.This can facilitate self-directed clinical training and accelerate learning, improving the overall quality of medical education.AI-generated clinical simulations, although not yet developed, hold great potential as a research area.They could be tailored to simulate various patient scenarios and conditions, offering a safe and controlled learning environment for students to hone their skills.
Although utilising LLMs (especially ChatGPT-4) in assisting surgical education is promising, several ethical concerns and potential challenges must be addressed to ensure the responsible and effective integration of such models into medical training.LLMs can generate information based on patterns in the data they have been trained on, which might not be accurate or up to date as demonstrated in some responses from BARD and BingAI.Relying on potentially incorrect information in medical education can have serious consequences for patient care.Establishing responsibility for the consequences of AI-generated advice is unclear.When a trainee follows a LLM's guidance resulting in negative patient outcomes, determining accountability and liability may also prove challenging.Therefore, effective and ethical use of LLMs in medical education requires adherence to key guidelines.Trainees should view LLMs as advisory tools, verifying their output against other trusted resources since these AI models lack real-world clinical judgement.Data privacy and confidentiality must be prioritised, given that interactions with public LLMs may be stored for future model training, making it essential to avoid sharing personally identifiable or confidential patient information.In addition, there is a risk that surgical trainees may become overly reliant on AI-generated advice, potentially undermining their critical thinking and decision-making abilities (38).Striking a balance between using LLMs as a supplementary tool and developing independent clinical judgment is essential.While AI can provide valuable input, it cannot clinically assess the patient, take a detailed history, or complete an effective examination, all fundamental skills integral to medical training.Thus, supervision by medical professionals is essential, particularly in the early stages, to ensure the accurate interpretation and application of LLM-generated advice, and that surgical trainees continue to develop clinical and communication skills, as well as empathy for their patients.

Limitations and Future Directions
The primary limitation of this study lies in the fact that the inquiries posed to LLMs are derived from simulated scenarios constructed by a limited group of junior surgical doctors.Consequently, this approach may result in findings that are less generalisable and applicable to a broader context.Nonetheless, the study offers insights into the potential integration of LLMs within surgical education, thereby contributing to the ongoing discourse on artificial intelligence in medical training.Large-scale longitudinal studies should be conducted to continuously assess the impact of this innovative teaching approach on surgical trainees' knowledge, skill development, and overall educational outcomes, with a focus on comparisons with traditional educational methods to identify areas of improvement or potential drawbacks.
Future research on expanding the AI model's training data is also worthwhile to refine the accuracy of their responses.This may be achieved by including more high-quality and up-to-date resources, such as surgical textbooks, guidelines, and research articles specifically covering a wide variety of clinical scenarios, thus providing more accurate and contextually relevant recommendations (39).
Collaboration with medical professionals and educators should be encouraged to validate, review, and curate AI-generated content aligning with expert consensus and best medical practices.Other strategies such as the integration of LLMs with existing technologies including virtual reality and surgical simulators, in the future, may further enhance the learning experience.Additionally, amalgamating LLMs into virtual and robotic trainers may also provide more comprehensive and context-aware guidance to the surgical trainee, leading to improved delivery of feedback, and ultimately improving patient outcomes (40).

Conclusions
This study illustrates the potential of using AI technologies to aid junior doctors by providing accurate and pertinent guidance in common ward-based surgical scenarios.
The findings suggest LLMs, particularly ChatGPT-4, hold promise as valuable educational resources in medical training in certain scenarios.However, while these results are promising, ethical concerns and challenges limit the routine use of LLM in medical education.
Further investigations are warranted to examine the applicability of LLM in diverse medical specialties, as well as its impact on patient outcomes and building clinical acumen in junior doctors.By comprehending the advantages and constraints of AI language models in medical education, we may devise innovative approaches to instructing future generations of healthcare professionals.

Table 1 .
Evaluation of Large Language Model Platforms' Responses