Proper assessment of the ability of an artificial intelligence (AI)-enabled medical device to generalize to new patient populations is necessary to determine the safety and effectiveness of the device. Assessing AI generalizability relies on performance assessment on a data set which represents the device’s intended population, which can be challenging to obtain. An understanding of the AI model’s decision space can indicate how the device is likely to perform on patients not represented in the available data. Our tool for decision region analysis for generalizability (DRAGen) assessment estimates the composition of the region of the decision space surrounding the available data. This provides an indication of how the model is likely to perform on samples which are similar to, but not represented by, the available finite data set. DRAGen can be applied to any binary classification model and requires no knowledge of the model’s training process. In a case study, we demonstrated DRAGen on a COVID classification model and showed that the decision region composition can identify differences in correct classification rates between the positive and negative classes, even with comparable performance on the original test set. Performance evaluation using a data set which was not represented during model development nor within the original test set shows a disparity in the performance between COVID-positive and COVID-negative patients, as indicated by DRAGen. By releasing this tool, we encourage future AI developers to use our tool to improve understanding of generalizability.
PurposeUnderstanding an artificial intelligence (AI) model’s ability to generalize to its target population is critical to ensuring the safe and effective usage of AI in medical devices. A traditional generalizability assessment relies on the availability of large, diverse datasets, which are difficult to obtain in many medical imaging applications. We present an approach for enhanced generalizability assessment by examining the decision space beyond the available testing data distribution.ApproachVicinal distributions of virtual samples are generated by interpolating between triplets of test images. The generated virtual samples leverage the characteristics already in the test set, increasing the sample diversity while remaining close to the AI model’s data manifold. We demonstrate the generalizability assessment approach on the non-clinical tasks of classifying patient sex, race, COVID status, and age group from chest x-rays.ResultsDecision region composition analysis for generalizability indicated that a disproportionately large portion of the decision space belonged to a single “preferred” class for each task, despite comparable performance on the evaluation dataset. Evaluation using cross-reactivity and population shift strategies indicated a tendency to overpredict samples as belonging to the preferred class (e.g., COVID negative) for patients whose subgroup was not represented in the model development data.ConclusionsAn analysis of an AI model’s decision space has the potential to provide insight into model generalizability. Our approach uses the analysis of composition of the decision space to obtain an improved assessment of model generalizability in the case of limited test data.
Assessing the generalizability of deep learning algorithms based on the size and diversity of the training data is not trivial. This study uses the mapping of samples in the image data space to the decision regions in the prediction space to understand how different subgroups in the data impact the neural network learning process and affect model generalizability. Using vicinal distribution-based linear interpolation, a plane of the decision region space spanned by the random ‘triplet’ of three images can be constructed. Analyzing these decision regions for many random triplets can provide insight into the relationships between distinct subgroups. In this study, a contrastive self-supervised approach is used to develop a ‘base’ classification model trained on a large chest x-ray (CXR) dataset. The base model is fine-tuned on COVID-19 CXR data to predict image acquisition technology (computed radiography (CR) or digital radiography (DX) and patient sex (male (M) or female (F)). Decision region analysis shows that the model’s image acquisition technology decision space is dominated by CR, regardless of the acquisition technology for the base images. Similarly, the Female class dominates the decision space. This study shows that decision region analysis has the potential to provide insights into subgroup diversity, sources of imbalances in the data, and model generalizability.
Empirical ensembles of deep convolutional neural network (DNN) models have been shown to outperform individual DNN models as seen in the last few ImageNet challenges. Several studies have also shown that ensemble DNNs are robust against out-of-sample data, making it an ideal approach for machine learning-enabled medical imaging tasks. In this work, we analyze deep ensembles for the task of classifying true and false lung nodule candidates in computed tomography (CT) volumes. Six ImageNet pretrained DNN models with minimal modifications for lung nodule classification were used to generate 63 ensemble DNNs using all possible combinations of DNN models. The checkpoint predictions during the training of the DNN models were used as a surrogate to understand the training trajectories each model took to result in the finalized model. The predictions from each checkpoint across the six DNN models were projected to a 2-dimensional space using uniform manifold approximation and projection method. The output scores from these six models were compared using a rank-biased overlap measure that incorporates larger weights for top scoring candidates and the ability to handle arbitrary sized list of candidates. Both these analyses indicate that diversity in the training process leads to diversity of the scores for the same training and test samples. The competition performance metric (CPM) from the free-response operating characteristic curve shows that as the number of DNN models in each ensemble increases, the CPM increases from an average of 0.750 to 0.79.
KEYWORDS: Error analysis, Statistical analysis, Monte Carlo methods, Binary data, Imaging devices, Computer simulations, Medical imaging, In vivo imaging, Numerical integration, Imaging systems
There are at least two sources of variance when estimating the performance of an imaging device: the doctors
(readers) and the patients (cases). These sources of variability generate variances and covariances in the observer
study data that can be addressed with multi-reader, multi-case (MRMC) variance analysis. Frequently, a fully-crossed
study design is used to collect the data; every reader reads every case. For imaging devices used during
in vivo procedures, however, a fully-crossed design is infeasible. Instead, each patient is diagnosed by only one
doctor, a doctor-patient study design. Here we investigate percent correct (PC) under this doctor-patient study
design. From a probabilistic foundation, we present the bias and variance of two statistics: pooled PC and
reader-averaged PC. We also present variance estimates of these statistics and compare them to naive estimates.
Finally, we run simulations to assess the statistics and the variance estimates. The two PC statistics have the
same means but different variances. The variances depend on how patients are distributed among the readers
and the amount of reader variability. Regarding the variance estimates, the MRMC estimates are unbiased,
whereas the naive estimates bracket the true variance and can be extremely biased.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.