Computer generated transcript
Warning!
The following transcript was generated automatically from the content and has not been checked or corrected manually.
Journal Club:
Introduction to Critical Appraisal
Eng et al. Artificial Intelligence Algorithm Improves Radiologist
Performance in Skeletal Age Assessment: A Prospective Multicenter
Randomized Controlled Trial. Radiology. 2021.
Lucy McGuire
FinaVP Glasgow Radiology Society 2022-23f Glasgow
Past President Glasgow Radiology Society 2021-22
BSc MedSci (Hons) ClinicalMedicine specialising in Clinical Neuroscience 2021What is Critical Appraisal?
• Analyzing scientific literature to determine its validity and relevance
Terms
• Validity- Is the conclusion true? Internal and External Validity.
• Relevance- Is the conclusion useful?
• Bias- systematic error influencing the results in one direction
• Generalisability- how applicable the results are to other populations
• Pragmatic- whether the results are applicable to the complexity of real life
and not only the perfect scientific settingEvidence Based Medicine
• Using the best evidence to make decisions in the care of each patient
See also:
Guidelines GRADE FrameworkPICO Summary
This is a prospective, multicentre, randomised controlled trial
• Population: In participating radiologists at 6 USA hospitals
• Intervention: Does the use of an AI interpretation as a diagnostic aid
• Control: vs usual practice
• Outcome measure: affect the accuracy and interpretation time of
skeletal age assessments from hand X-rays of childrenBackground
What is skeletal age assessment?
• When a baby is born it has many cartilaginous “bones”
with few ossified areas
• As a child grows
• Bones become increasingly ossified starting from specific
ossification centres
• Growth plates of maturing bones fuse
• Each ossification centre appears at around the same age in
healthy children
• Comparing an X-ray to that of a healthy reference can
identify when the bones are maturing too fast or slow:
Reference Atlas
• Identifies genetic conditions, endocrine dysfunction,
feasibility of treatment (e.g. scoliosis), assess
• Important to get right!Background
What is Artificial Intelligence?
We must be aware that
AI is not a “perfect
machine”.
It is built by humans and
can reflect their biases.
An algorithm is only as
good as the data it learns
from.
Rubbish In = Rubbish OutBackground
• What’s already been done?
• AI for detecting nodules on CT chest gave different output when
tested many times over 1 year
• AI-aided systems for mammography found no benefit to patient’s
health in a prospective-multicentre RCT
• There are few RCTs for AI in medicine
• Previous articles have been publishes suggesting that AI could be
useful in skeletal age assessment
• This is the first prospective, multicentre, randomised controlled trialMethods
• September 2018 through August 2019
• 6 Centres:
• Reference- Stanford University School of Medicine
1. Harvard Medical School and Boston Children’s Hospital
2. Cincinnati Children’s Hospital Medical Center
3. Children’s Hospital of Philadelphia
4. New York University School of Medicine
5. Yale University School of Medicine
Ethics- approved by centres, verbal informed consent from radiologists,
consent waived for paediatric patients through review board decision.Methods
AI Algorithm
• Trained using deep learning methods on the open-source training
data set released for the Radiological Society of North America
(RSNA) Pediatric Bone Age Machine Learning Challenge
• Training dataset: 12611 hand radiographs: 5778 girls (46%) and 6833
boys (54%)
• Mean chronological age 10y, 7m.
• Before this study, the algorithm was tested on a dataset of 200
previously unseen X-rays.Methods
• Scans which had a procedure code for skeletal age assessment were
sent automatically by PACS to an on-site machine.
• The machine then assigned each x-ray on a 1:1 ratio to either receive
an AI skeletal age prediction or not (control)
• Radiologists were not blinded to patient’s true chronological age
• Radiologists were not blinded to intervention (impossible: they know
if they have seen an AI estimate or not)
• When the radiologist received an AI estimate they also received a
message explaining what it wasMethods
• End-point- trial would end when each site enrolled 300 examinations or
necessary sample size was reached
Sample size calculation
• Good practice to include this in the methods
• Works out how many participants are needed to get a statistically
significant answer to the question
• Ethical
• Prevents over-enrolment- more people exposed to potential harms
• Prevents under-enrolment- people exposed to harm for no statistical significanceMethods: Outcomes
• Primary efficacy outcome: Mean absolute difference (MAD) between
the radiologists reported skeletal age (either with or without AI help)
and a ‘ground truth’ skeletal age
• Ground truth was the average interpretation of a panel of four
radiologists not using a diagnostic aid
• The panel were blinded to AI allocation and to each others
interpretation
• The panel used the same digital atlas and watched an instructional
video beforehandMethods: Outcomes
• Secondary outcome: median interpretation time
• Time stamp of opening and closing radiology report
• Compared between AI and non-AI interpretationsMethods
Exclusion Criteria- examinations that were kept out of the study because they
might interfere with the results
• Scans with more than 1 image- the AI doesn’t know which to choose
• Reports where a trainee gave a preliminary interpretation- could influence
consultant
• Hands with deformities
• Scans that were not hands
• Where 3/4 ground truth panel decided insufficient info
• When the interpreting radiologist wasn’t consented to participate
• Not excluded: any age, any manufacturer of x-ray machine, any x-ray quality,Methods: StatisticsP values
What is a p value?
• This tells you if a difference between 2 measurements is statistically
significant
• If a difference is statistically significant then you can reject the null
hypothesis (that there is no difference)
• Generally the limit is set at <0.05 = significant (sometimes 0.01)
• Chance of a Type 1 Errorbility of this result happening by random chance=
• A Type 1 Error is rejecting the null hypothesis (that there is no difference
positive is <0.05= <5%ts actually true (false positive) risk of a false Confidence Intervals
What is a Confidence Interval (CI)?
• Usually set to 95% (sometimes 99%), approximately 2 SD from mean
Example
• In an experiment you are trying to find out a true value e.g. mean height of humans
• The only real way to know the truth is to measure everyone on earth- not possible
• So we have to take a sample of people, the more the better
• The CI gives a range of values, between which the true answer lies,with 95% confidence,The CI is based on the number of people measured and the
variability in their heights
Example (made up numbers!)
If you wanted to then compare the heights of 2 groups e.g. English and French people, the null hypothesis = “there is no difference in height between English
and French people”
• French mean height 5ft 8inches (95% CI, 5ft 3inch, 6ft 2inch) “With 95% confidencethe mean height of French people falls between5ft 3inch –
6ft 2inch with the best estimatebeing 5ft 8inches”
• English mean height 5ft 6inches (95% CI, 5ft 1inch, 6ft 0inch)“With 95% confidence the mean height of English people falls between 5ft 1inch–
6ft 0inch with the best estimate being 5ft 6inches”
Notice that the above CI ranges overlap a lot, therefore the difference is not significant, but you compare them anyway…
• Difference in mean height French– English = 0ft 2inches (95% CI, -0ft 1inch, 0ft 3inch). Here the CI crosses zero. Therefore there is no significant
difference and the p > 0.05
• “With 95% confidence the mean height difference betweenFrench and English people falls between-1inch and 3inches with the best estimatebeing
French people are 2 inches taller. As the 95% CI crosses zero we currentlyfail to reject the null hypothesis that thereis no height differencebetween
French and English people”.
The more people in the sample, the narrower the CI, the more precise the estimate. A Type 2 error is a false negative, accepting the null hypothesis (that
there is no difference between values) when it is actually false, this can occur when there is not enough data, e.g. underpowered sample size.Results
• In the pre-trial test analysis of 200 scans
• MAD between AI interpretation and panel interpretation = 4.9 months
• At the reference centre: Stanford
• MAD between AI interpretation and panel interpretation = 5.3 months
• This is not statistically different from the pre-trial test (5.3 vs 4.9, p= 0.37)
• MAD between Reference Centre and Other Centres (vs 5.3)
1. 5.5 p=0.27
2. 6.4 p=0.04 Deterioration from reference centre
3. 5.9 p=0.09
4. 6.0 p=0.10
5. 5.1 p=0.82
• AI performance did not differ between boys and girls
• AI performance was affected by rotation and anatomical disordersResults: Primary Outcome
• With-AI group compared to Ground Truth MAD= 5.36 months
• Control group compared to Ground Truth MAD= 5.95 months
• 5.36 months vs 5.95 months, p =0.04
• With-AI group disagreed with the Ground Truth by >12 months= 9.3%
• Control group disagreed with the Ground Truth by >12 months= 13.0%
• 9.3 vs 13, p= 0.02
• With-AI group disagreed with the Ground Truth by >24 months= 0.5%
• Control group disagreed with the Ground Truth by >24 months= 1.8%
• 0.5 vs 1.8, p= 0.02
• In an adjusted mixed-effects model, the primary outcome showed lower diagnostic error in the with-AI
group than in the control groupResults: Radiologist Behaviour
• Radiologist on their own vs AI on its own , MAD = 6.0 months vs 6.2 months, p=0.51
• Use of the AI improved radiologist predictions
• Radiologists deferred to accurate AI predictions more often than inaccurate ones (70.0%
vs 42.9% p<0.001)
• When presented with an inaccurate AI prediction, radiologists improved on that
prediction more often than they worsened an accurate one (51.7% vs 17.2% p<0.001)
• Although the AI produced accurate predictions more often than inaccurate ones (66.0%
vs 11.5% p<0.001), radiologist supervision improved the AI predictions (MAD, with
radiologist [5.4 months] vs control AI [6.2 months] p = 0.01)
• When radiologists were presented an inaccurate AI prediction, they performed less
accurately than the control group (MAD, 10.9 months [with-AI] vs 9.4 months [control] p
=0.06) NOT significant
• The AI group gave faster predictions than the control group (median 102 seconds [with-
AI] vs 142 seconds [control], p=0.001) = Secondary OutcomeResults: Something strange happened…
At Centre 5:
• The AI predictions were just as accurate as other centres (center 5
[47.1%] vs other centers [39.8%] p = 0.24)
• Radiologists worsened accurate AI predictions more than at other
centres (centre 5 [40.6%] vs other centres [21.2%] p = 0.01)
• Interestingly, radiologists without AI at Centre 5 performed better
than radiologists at other centres (MAD, Centre 5 [4.8 months] vs
other Centres [6.1 months] p= 0.04)Discussion
• pre-trial experiments a real-life prospective setting was equal to in
• The AI worked across almost all the centres equally to the reference
centre showing some generalisability
• A prospective trial was important to simulate real clinical conditions
• The fact that the AI worsened predictions at centre 5 shows that
different external factors can affect the AI efficacy
• There was different levels of automation bias between different
radiologists, some did worse whilst the overall group did better- is
this ethical?Discussion
• Automation bias- when people are more likely to accept computer
suggestions despite contrary information
• May improve interpretation time but increase errorDiscussion
• We don’t know the participant outcomes because they weren’t
measured- did the AI improve or harm anyone’s health?
• Other potential future trials could be blinded, one group gets an AI
prediction and the other gets a random number. Would this make
automation bias dangerous to these patients?
• Maybe the radiologists reporting the non-AI scans felt pressure to
compete with the AI and were extra accurate? Not representative of
real practice?Conclusion
• “In this prospective multicentre randomized controlled trial
comparing use of an artificial intelligence (AI) algorithm as a
diagnostic aid with the current standard of care, overall diagnostic
error was significantly decreased when the AI algorithm was used
compared with when it was not. Diagnostic error was decreased with
use of the AI algorithm at some but not all centres, including an
outlier centre where diagnostic error was increased. Taken together,
diagnostic aid for radiologists and reinforce the importance of
interactive effects between human radiologists and AI algorithms in
determining potential benefits and harms of assistive technologies in
clinical medicine.”Critical Appraisal
• What do we think about the results and conclusions of this paper?
• Are they valid (true)?
• Are they relevant (useful)? Validity
Strengths Weaknesses But…
Radiology is a reliable source of peer-reviewed Non-blinding of radiologists introduced bias in Blinding with random numbers has ethical issues
evidence terms of effort, were the statisticians blinded? due to automation bias
Clear rationale for research (equipoise) The outcomes aren’t patient centred, is this Due to wide pragmatic inclusion criteria there
relevant for patient outcomes would be very complicated varied outcome
measures
Multicentre – 6 centres, proves some Unexplained variation at Centre 5, unmeasured Shows that there is issues in generalisability, they
generalisability confounders? reported this
Inclusion/Exclusion criteria wide enough to be Conflicts of interest? Some authors own stocks in The conflicts were reported, results weren’t all
generalisable AI, tech companies e.g. Microsoft, Canon. Who positive in AI favour
funded this study?
Randomised-controlled design- reduces many of Radiologists only participated if they consented, It is unethical to force participation without
the effects of bias, well balanced baseline what was their motivation? Bias for or against AI? consent
characteristics
Prospective design simulated real-life clinical Scans were excluded if interpreted by trainee first-Inclusion in analysis would have made comparison
environment this is very common between groups harder due to confounding
Sample size calculated- quite big 1903 Scans were excluded if anatomy deformed- these Limitations of current technology
patients may be most difficult to assess and
therefore need AI the most
Appropriate statistical analysis for different data Patient’s weren’t consented explicitly, waived by The board review decided that it was ethical to do
distributions – normal vs non-normal board so
Blinding of AI in ground truth panelRelevance
Do we, in the real world, care about these results?
• Not a currently accessible technology
• Seems to generally improve accuracy and speed
• Important to determine confounding factors that caused results at
Centre 5
• Important to determine patient centred effects, does this actually
improve health?
• Up for debate?Questions?
Comments?
Further Analysis?
Please fill in this feedback form, it is very
useful for the society and for my personal
portfolio! 🙏
You will get a certificate of attendance in
return