Journal Club: Introduction to Critical Appraisal

Description

University of Glasgow Radiology Society are proud to present our annual journal club event providing an introduction and revision of critical appraisal.

This session is ideal for medical students completing coursework, interested in intercalating or looking to revise for SFP applications.

The event will focus on analysing the paper 'Artificial Intelligence Algorithm Improves Radiologist Performance in Skeletal Age Assessment: A Prospective Multicenter Randomized Controlled Trial' by Eng et al. 2021. Published in Radiology. link: https://pubs.rsna.org/doi/10.1148/radiol.2021204021...

It's not necessary to have fully read the paper in advance but it would be ideal for the audience to be familiar with the content.

The event will be hosted by Lucy McGuire, final year medical student, current VP of UofG Radiology Society and immediate past president. She completed an intercalated BSc MedSci (Hons) in Clinical Medicine with Clinical Neuroscience in 2021.

Similar communities

View all

Similar events and on demand videos

Computer generated transcript

Warning!
The following transcript was generated automatically from the content and has not been checked or corrected manually.

Journal Club: Introduction to Critical Appraisal Eng et al. Artificial Intelligence Algorithm Improves Radiologist Performance in Skeletal Age Assessment: A Prospective Multicenter Randomized Controlled Trial. Radiology. 2021. Lucy McGuire FinaVP Glasgow Radiology Society 2022-23f Glasgow Past President Glasgow Radiology Society 2021-22 BSc MedSci (Hons) ClinicalMedicine specialising in Clinical Neuroscience 2021What is Critical Appraisal? • Analyzing scientific literature to determine its validity and relevance Terms • Validity- Is the conclusion true? Internal and External Validity. • Relevance- Is the conclusion useful? • Bias- systematic error influencing the results in one direction • Generalisability- how applicable the results are to other populations • Pragmatic- whether the results are applicable to the complexity of real life and not only the perfect scientific settingEvidence Based Medicine • Using the best evidence to make decisions in the care of each patient See also: Guidelines GRADE FrameworkPICO Summary This is a prospective, multicentre, randomised controlled trial • Population: In participating radiologists at 6 USA hospitals • Intervention: Does the use of an AI interpretation as a diagnostic aid • Control: vs usual practice • Outcome measure: affect the accuracy and interpretation time of skeletal age assessments from hand X-rays of childrenBackground What is skeletal age assessment? • When a baby is born it has many cartilaginous “bones” with few ossified areas • As a child grows • Bones become increasingly ossified starting from specific ossification centres • Growth plates of maturing bones fuse • Each ossification centre appears at around the same age in healthy children • Comparing an X-ray to that of a healthy reference can identify when the bones are maturing too fast or slow: Reference Atlas • Identifies genetic conditions, endocrine dysfunction, feasibility of treatment (e.g. scoliosis), assess • Important to get right!Background What is Artificial Intelligence? We must be aware that AI is not a “perfect machine”. It is built by humans and can reflect their biases. An algorithm is only as good as the data it learns from. Rubbish In = Rubbish OutBackground • What’s already been done? • AI for detecting nodules on CT chest gave different output when tested many times over 1 year • AI-aided systems for mammography found no benefit to patient’s health in a prospective-multicentre RCT • There are few RCTs for AI in medicine • Previous articles have been publishes suggesting that AI could be useful in skeletal age assessment • This is the first prospective, multicentre, randomised controlled trialMethods • September 2018 through August 2019 • 6 Centres: • Reference- Stanford University School of Medicine 1. Harvard Medical School and Boston Children’s Hospital 2. Cincinnati Children’s Hospital Medical Center 3. Children’s Hospital of Philadelphia 4. New York University School of Medicine 5. Yale University School of Medicine Ethics- approved by centres, verbal informed consent from radiologists, consent waived for paediatric patients through review board decision.Methods AI Algorithm • Trained using deep learning methods on the open-source training data set released for the Radiological Society of North America (RSNA) Pediatric Bone Age Machine Learning Challenge • Training dataset: 12611 hand radiographs: 5778 girls (46%) and 6833 boys (54%) • Mean chronological age 10y, 7m. • Before this study, the algorithm was tested on a dataset of 200 previously unseen X-rays.Methods • Scans which had a procedure code for skeletal age assessment were sent automatically by PACS to an on-site machine. • The machine then assigned each x-ray on a 1:1 ratio to either receive an AI skeletal age prediction or not (control) • Radiologists were not blinded to patient’s true chronological age • Radiologists were not blinded to intervention (impossible: they know if they have seen an AI estimate or not) • When the radiologist received an AI estimate they also received a message explaining what it wasMethods • End-point- trial would end when each site enrolled 300 examinations or necessary sample size was reached Sample size calculation • Good practice to include this in the methods • Works out how many participants are needed to get a statistically significant answer to the question • Ethical • Prevents over-enrolment- more people exposed to potential harms • Prevents under-enrolment- people exposed to harm for no statistical significanceMethods: Outcomes • Primary efficacy outcome: Mean absolute difference (MAD) between the radiologists reported skeletal age (either with or without AI help) and a ‘ground truth’ skeletal age • Ground truth was the average interpretation of a panel of four radiologists not using a diagnostic aid • The panel were blinded to AI allocation and to each others interpretation • The panel used the same digital atlas and watched an instructional video beforehandMethods: Outcomes • Secondary outcome: median interpretation time • Time stamp of opening and closing radiology report • Compared between AI and non-AI interpretationsMethods Exclusion Criteria- examinations that were kept out of the study because they might interfere with the results • Scans with more than 1 image- the AI doesn’t know which to choose • Reports where a trainee gave a preliminary interpretation- could influence consultant • Hands with deformities • Scans that were not hands • Where 3/4 ground truth panel decided insufficient info • When the interpreting radiologist wasn’t consented to participate • Not excluded: any age, any manufacturer of x-ray machine, any x-ray quality,Methods: StatisticsP values What is a p value? • This tells you if a difference between 2 measurements is statistically significant • If a difference is statistically significant then you can reject the null hypothesis (that there is no difference) • Generally the limit is set at <0.05 = significant (sometimes 0.01) • Chance of a Type 1 Errorbility of this result happening by random chance= • A Type 1 Error is rejecting the null hypothesis (that there is no difference positive is <0.05= <5%ts actually true (false positive) risk of a false Confidence Intervals What is a Confidence Interval (CI)? • Usually set to 95% (sometimes 99%), approximately 2 SD from mean Example • In an experiment you are trying to find out a true value e.g. mean height of humans • The only real way to know the truth is to measure everyone on earth- not possible • So we have to take a sample of people, the more the better • The CI gives a range of values, between which the true answer lies,with 95% confidence,The CI is based on the number of people measured and the variability in their heights Example (made up numbers!) If you wanted to then compare the heights of 2 groups e.g. English and French people, the null hypothesis = “there is no difference in height between English and French people” • French mean height 5ft 8inches (95% CI, 5ft 3inch, 6ft 2inch) “With 95% confidencethe mean height of French people falls between5ft 3inch – 6ft 2inch with the best estimatebeing 5ft 8inches” • English mean height 5ft 6inches (95% CI, 5ft 1inch, 6ft 0inch)“With 95% confidence the mean height of English people falls between 5ft 1inch– 6ft 0inch with the best estimate being 5ft 6inches” Notice that the above CI ranges overlap a lot, therefore the difference is not significant, but you compare them anyway… • Difference in mean height French– English = 0ft 2inches (95% CI, -0ft 1inch, 0ft 3inch). Here the CI crosses zero. Therefore there is no significant difference and the p > 0.05 • “With 95% confidence the mean height difference betweenFrench and English people falls between-1inch and 3inches with the best estimatebeing French people are 2 inches taller. As the 95% CI crosses zero we currentlyfail to reject the null hypothesis that thereis no height differencebetween French and English people”. The more people in the sample, the narrower the CI, the more precise the estimate. A Type 2 error is a false negative, accepting the null hypothesis (that there is no difference between values) when it is actually false, this can occur when there is not enough data, e.g. underpowered sample size.Results • In the pre-trial test analysis of 200 scans • MAD between AI interpretation and panel interpretation = 4.9 months • At the reference centre: Stanford • MAD between AI interpretation and panel interpretation = 5.3 months • This is not statistically different from the pre-trial test (5.3 vs 4.9, p= 0.37) • MAD between Reference Centre and Other Centres (vs 5.3) 1. 5.5 p=0.27 2. 6.4 p=0.04 Deterioration from reference centre 3. 5.9 p=0.09 4. 6.0 p=0.10 5. 5.1 p=0.82 • AI performance did not differ between boys and girls • AI performance was affected by rotation and anatomical disordersResults: Primary Outcome • With-AI group compared to Ground Truth MAD= 5.36 months • Control group compared to Ground Truth MAD= 5.95 months • 5.36 months vs 5.95 months, p =0.04 • With-AI group disagreed with the Ground Truth by >12 months= 9.3% • Control group disagreed with the Ground Truth by >12 months= 13.0% • 9.3 vs 13, p= 0.02 • With-AI group disagreed with the Ground Truth by >24 months= 0.5% • Control group disagreed with the Ground Truth by >24 months= 1.8% • 0.5 vs 1.8, p= 0.02 • In an adjusted mixed-effects model, the primary outcome showed lower diagnostic error in the with-AI group than in the control groupResults: Radiologist Behaviour • Radiologist on their own vs AI on its own , MAD = 6.0 months vs 6.2 months, p=0.51 • Use of the AI improved radiologist predictions • Radiologists deferred to accurate AI predictions more often than inaccurate ones (70.0% vs 42.9% p<0.001) • When presented with an inaccurate AI prediction, radiologists improved on that prediction more often than they worsened an accurate one (51.7% vs 17.2% p<0.001) • Although the AI produced accurate predictions more often than inaccurate ones (66.0% vs 11.5% p<0.001), radiologist supervision improved the AI predictions (MAD, with radiologist [5.4 months] vs control AI [6.2 months] p = 0.01) • When radiologists were presented an inaccurate AI prediction, they performed less accurately than the control group (MAD, 10.9 months [with-AI] vs 9.4 months [control] p =0.06) NOT significant • The AI group gave faster predictions than the control group (median 102 seconds [with- AI] vs 142 seconds [control], p=0.001) = Secondary OutcomeResults: Something strange happened… At Centre 5: • The AI predictions were just as accurate as other centres (center 5 [47.1%] vs other centers [39.8%] p = 0.24) • Radiologists worsened accurate AI predictions more than at other centres (centre 5 [40.6%] vs other centres [21.2%] p = 0.01) • Interestingly, radiologists without AI at Centre 5 performed better than radiologists at other centres (MAD, Centre 5 [4.8 months] vs other Centres [6.1 months] p= 0.04)Discussion • pre-trial experiments a real-life prospective setting was equal to in • The AI worked across almost all the centres equally to the reference centre showing some generalisability • A prospective trial was important to simulate real clinical conditions • The fact that the AI worsened predictions at centre 5 shows that different external factors can affect the AI efficacy • There was different levels of automation bias between different radiologists, some did worse whilst the overall group did better- is this ethical?Discussion • Automation bias- when people are more likely to accept computer suggestions despite contrary information • May improve interpretation time but increase errorDiscussion • We don’t know the participant outcomes because they weren’t measured- did the AI improve or harm anyone’s health? • Other potential future trials could be blinded, one group gets an AI prediction and the other gets a random number. Would this make automation bias dangerous to these patients? • Maybe the radiologists reporting the non-AI scans felt pressure to compete with the AI and were extra accurate? Not representative of real practice?Conclusion • “In this prospective multicentre randomized controlled trial comparing use of an artificial intelligence (AI) algorithm as a diagnostic aid with the current standard of care, overall diagnostic error was significantly decreased when the AI algorithm was used compared with when it was not. Diagnostic error was decreased with use of the AI algorithm at some but not all centres, including an outlier centre where diagnostic error was increased. Taken together, diagnostic aid for radiologists and reinforce the importance of interactive effects between human radiologists and AI algorithms in determining potential benefits and harms of assistive technologies in clinical medicine.”Critical Appraisal • What do we think about the results and conclusions of this paper? • Are they valid (true)? • Are they relevant (useful)? Validity Strengths Weaknesses But… Radiology is a reliable source of peer-reviewed Non-blinding of radiologists introduced bias in Blinding with random numbers has ethical issues evidence terms of effort, were the statisticians blinded? due to automation bias Clear rationale for research (equipoise) The outcomes aren’t patient centred, is this Due to wide pragmatic inclusion criteria there relevant for patient outcomes would be very complicated varied outcome measures Multicentre – 6 centres, proves some Unexplained variation at Centre 5, unmeasured Shows that there is issues in generalisability, they generalisability confounders? reported this Inclusion/Exclusion criteria wide enough to be Conflicts of interest? Some authors own stocks in The conflicts were reported, results weren’t all generalisable AI, tech companies e.g. Microsoft, Canon. Who positive in AI favour funded this study? Randomised-controlled design- reduces many of Radiologists only participated if they consented, It is unethical to force participation without the effects of bias, well balanced baseline what was their motivation? Bias for or against AI? consent characteristics Prospective design simulated real-life clinical Scans were excluded if interpreted by trainee first-Inclusion in analysis would have made comparison environment this is very common between groups harder due to confounding Sample size calculated- quite big 1903 Scans were excluded if anatomy deformed- these Limitations of current technology patients may be most difficult to assess and therefore need AI the most Appropriate statistical analysis for different data Patient’s weren’t consented explicitly, waived by The board review decided that it was ethical to do distributions – normal vs non-normal board so Blinding of AI in ground truth panelRelevance Do we, in the real world, care about these results? • Not a currently accessible technology • Seems to generally improve accuracy and speed • Important to determine confounding factors that caused results at Centre 5 • Important to determine patient centred effects, does this actually improve health? • Up for debate?Questions? Comments? Further Analysis? Please fill in this feedback form, it is very useful for the society and for my personal portfolio! 🙏 You will get a certificate of attendance in return