KEYWORDS: Calibration, Tumors, Image segmentation, Algorithm development, Machine learning, Detection and tracking algorithms, Cancer detection, Breast cancer
Machine learning (ML) based whole slide imaging biomarkers have great potential to improve the efficiency and consistency of biomarker quantification, thereby facilitating the development of prognosis models for personalized medicine. Assessment methods in this area are still under-developed. Using the public TiGER (Tumor InfiltratinG lymphocytes in breast cancER) challenge data, we developed a deep neural network-based algorithm for automated tumorinfiltrating lymphocytes (TILs) scoring from whole slide images (WSIs) of biopsies and surgical resections of human epidermal growth factor receptor-2 positive (HER2+) and triple-negative breast cancer (TNBC) patients. The purpose of this study is to assess our model’s performance on a new independent dataset. Seven pathologists independently assessed 320 pre-selected regions of interests (ROIs) across 32 WSIs for TILs scoring. Our results show that there is substantial variability among pathologists in scoring TILs density. We also observed a systematic discrepancy between the ML-based TILs scoring and the pathologists’ manual scoring that led us to develop a calibration between the two. Our calibration reduced the discrepancy, increasing the intra-class-correlation coefficient (ICC) from 0.35 (95% CI [-0.062, 0.625]) for uncalibrated scores to 0.67 (95% CI [0.6, 0.736]) after calibration.
KEYWORDS: Image segmentation, Medical imaging, Lung, Anatomy, Radiotherapy, Image processing algorithms and systems, Error analysis, Education and training
Segmentation of medical images with known ground truth is useful for investigating properties of performance metrics and comparing different approaches of combining multiple manual segmentations to establish a reference standard, thereby informing selection of performance metrics and truthing methods. For medical images, however, segmentation ground truth is typically not available. One way of synthesizing segmentation errors is to use regular geometric objects as ground truth, but they lack the complexity and variability of real anatomical objects. To address this problem, we developed a medical image segmentation synthesis (MISS)-tool. The MISS-tool emulates segmentations by adjusting truth masks of anatomical objects extracted from real medical images. We categorized six types of segmentation errors and developed contour transformation tools with a set of user-adjustable parameters to modify the defined truth contours to emulate different types of segmentation errors, thereby generating synthetic segmentations. In a simulation study, we synthesized multiple segmentations to emulate algorithms or observers with pre-defined sets of segmentation errors (e.g., under/over-segmentation) using 220 lung nodule cases from the LIDC lung computed tomography dataset. We verified that the synthetic segmentation results manifest the type of errors that are consistent with our pre-configured setting. Our tool is useful for synthesizing a range of segmentation errors within a clinical segmentation task.
Studies have shown that the increased presence of tumor-infiltrating lymphocytes (TILs) is associated with better long-term clinical outcomes and survival, which makes TILs a potentially useful quantitative biomarker. In clinics, pathologists’ visual assessment of TILs in biopsies and surgical resections result in a quantitative score (TILs-score). The Tumor-infiltrating lymphocytes in breast cancer (TiGER) challenge is the first public challenge on automated TILs-scoring algorithms using whole slide images of hematoxylin and eosin-stained (H&E) slides of human epidermal growth factor receptor-2 positive (HER2+) and triple-negative breast cancer (TNBC) patients. We participated in the TiGER challenge and developed algorithms for tumor-stroma segmentation, TILs cell detection, and TILs-scoring. The whole slide images in this challenge are from three sources, each with apparent color variations. We hypothesized that color-normalization may improve the cross-source generalizability of our deep learning models. Here, we expand our initial work by implementing a color-normalization technique and investigate its effect on the performance of our segmentation model. We compare the segmentation performance before and after color-normalization by cross validating the models on the three datasets. Our results show a substantial increase in the performance of the segmentation model after color-normalization when trained and tested on different sources. This might potentially improve the model’s generalizability and robustness when applied to the external sequestered test set from the TiGER challenge.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.