The goal of personalized medicine is to tailor disease prevention, diagnosis and treatment to each individual, while considering their genes, environment and lifestyles. This emerging field recognizes the limitations of the one-size-fits-all approach to treating the “average patient”, which ignores the numerous, sometimes subtle differences between patients. Variations include the fact that diseases manifest differently amongst human beings. To address a more robust diagnosis across patients, researchers are exploring the topic of disease subtypes discovery or the discovery of different characterizations of the same diseases. From a machine learning perspective, this problem translates into a clustering (unsupervised grouping of data) task in a supervised setting. In other words, the superclasses are known but there exist various unknown substructures in the data within each superclass that improve the discrimination of the original classes. We apply the idea of subtype discovery to the diagnosis of lung cancer nodules by attempting to discover different types of malignant and benign nodules in imaging data. Early detection/diagnosis could improve patient survival rates for this type of cancer, which is the leading cause of cancer death in the U.S. Our method 1) finds robust—homogeneous by design and differientiable—subtypes of lung nodules through iterative K-means clustering that help classify them and 2) leaves some data unclustered. This set of unclustered or “hard” data represents images that cannot confidently be assigned to any subtypes and may require more resources (e.g., time or radiologists) to diagnose. Our approach is applied to the Lung Image Database Consortium (LIDC) data set. We hypothesize that our subtypes classification will outperform the classification of the original classes and produce quantitatively and qualitatively more meaningful representations of the diseases when compared not only to the original classes but also to subtypes produces by simply overclustering the data (i.e., producing more clusters than necessary to capture original classes or minimizing the clustering loss function without checking the content of clusters). We improve the performance by 11% over the original classification and provide a detail evaluation of our newly discovered sub-types.
|