This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Chest computed tomography (CT) is crucial for the detection of lung cancer, and many automated CT evaluation methods have been proposed. Due to the divergent software dependencies of the reported approaches, the developed methods are rarely compared or reproduced.
The goal of the research was to generate reproducible machine learning modules for lung cancer detection and compare the approaches and performances of the award-winning algorithms developed in the Kaggle Data Science Bowl.
We obtained the source codes of all award-winning solutions of the Kaggle Data Science Bowl Challenge, where participants developed automated CT evaluation methods to detect lung cancer (training set n=1397, public test set n=198, final test set n=506). The performance of the algorithms was evaluated by the log-loss function, and the Spearman correlation coefficient of the performance in the public and final test sets was computed.
Most solutions implemented distinct image preprocessing, segmentation, and classification modules. Variants of U-Net, VGGNet, and residual net were commonly used in nodule segmentation, and transfer learning was used in most of the classification algorithms. Substantial performance variations in the public and final test sets were observed (Spearman correlation coefficient = .39 among the top 10 teams). To ensure the reproducibility of results, we generated a Docker container for each of the top solutions.
We compared the award-winning algorithms for lung cancer detection and generated reproducible Docker images for the top solutions. Although convolutional neural networks achieved decent accuracy, there is plenty of room for improvement regarding model generalizability.
Lung cancer is one of the most prevalent cancers worldwide, causing 1.76 million deaths per year [
Due to the improved performance of machine learning algorithms for radiology diagnosis, some developers have sought commercialization of their models. However, given the divergent software platforms, packages, and patches employed by different teams, their results were not easily reproducible. The difficulty in reusing the state-of-the-art models and reproducing the diagnostic performance markedly hindered further validation and applications.
To address this gap, we reimplemented, examined, and systematically compared the algorithms and software codes developed by the best-performing teams of the Kaggle Data Science Bowl. Specifically, we investigated all modules developed by the 10 award-winning teams, including their image preprocessing, segmentation, and classification algorithms. To ensure the reproducibility of results and the reusability of the developed modules, we generated a Docker image for each solution using the Docker Community Edition, a popular open-source software development platform that allows users to create self-contained systems with the desired version of software packages, patches, and environmental settings. According to Docker, there are over 6 million Dockerized applications, with 130 billion total downloads [
We obtained the low-dose chest CT datasets in Digital Imaging and Communications in Medicine format from the Kaggle Data Science Bowl website [
To systematically compare the solutions developed by the award-winning teams, we acquired the source codes of the winning solutions and their documentation from the Kaggle news release after the conclusion of the competition. Per the rules of this Kaggle challenge, the source codes of these award-winning solutions were required to be released under open-source licenses approved by the Open Source Initiative [
We compared the workflows of the top 10 solutions by examining and rerunning their source codes. For each solution, we inspected all steps taken from inputting the CT images to outputting the prediction. We documented the versions of the software package and platform dependencies of each solution.
The Kaggle Data Science Bowl used the log-loss function to evaluate the performance of the models [
To investigate whether models with high performance in the public test set generalize to the images in the final test set, we computed the Spearman correlation coefficient of the log-loss in the two test sets. All analyses were conducted using R version 3.6 (R Foundation for Statistical Computing).
We reproduced the results by recompiling the source codes and dependencies of each of the top ten solutions. Since the solutions used various platforms and different versions of custom software packages, many of which were not compatible with the most updated packages or mainstream release, we generated Docker images [
The log-loss score distribution of the top 250 teams in the Kaggle Data Science Bowl Competition. The log-loss scores of the public test set and the final test set of each team were plotted. The red horizontal line indicates the log-loss of outputting the cancer probability as 0.5 for each patient. The blue horizontal line shows the log-loss of outputting cancer probability of each patient as the prevalence of cancer (0.26) in the training set.
A weak to moderate correlation between the log-loss scores of the public test set and the scores of the final test set. The red regression line shows the relation between the log-loss scores of the public test set and those of the final test set using a linear regression model. (A) The log-transformed scores of all participants who finished both stages of the Kaggle Data Science Bowl Competition were plotted. The Spearman correlation coefficient of the performance in the two test sets is .23. (B) The log-transformed scores of the top 10 teams defined by the final test set performance. The Spearman correlation coefficient among the top 10 teams is .39.
In addition to the training dataset provided by the Kaggle challenge, most teams used CT images and nodule annotations from other publicly available resources.
Frequently used image preprocessing steps include lung segmentation and voxel scaling. Voxel scaling ensures that the voxels of images from various CT scan protocols correspond to similar sizes of physical space. Variants of U-Net [
After lung nodule segmentation, classification algorithms were employed to generate final cancer versus noncancer predictions. Most of the solutions leveraged existing ImageNet-based architecture and transfer learning [
A model of the informatics workflow used by most teams. In addition to the Kaggle training set, most teams obtained additional publicly available datasets with annotations. Lung segmentation, image rescaling, and nodule segmentation modules were commonly used before classification.
Comparisons of the top-performing solutions of the Kaggle Data Science Bowl.
Rank | Team name | Additional datasets used | Data preprocessing | Nodule segmentation | Classification algorithms | Implementation | Final test set score |
1 | Grt123 | LUNA16a | Lung segmentation, intensity normalization | Variant of U-Net | Neural network with a max-pooling layer and two fully connected layers | Pytorch | 0.39975 |
2 | Julian de Wit and Daniel Hammack | LUNA16, LIDCb | Rescale to 1×1×1 | C3Dc, ResNet-like CNNd | C3D, ResNet-like CNN | Keras, Tensorflow, Theano | 0.40117 |
3 | Aidence | LUNA16 | Rescale to 2.5×0.512×0.512 (for nodule detection) and 1.25×0.5×0.5 (for classification) | ResNete | 3D DenseNetf multitask model (different loss functions depending on the input source) | Tensorflow | 0.40127 |
4 | qfpxfd | LUNA16, SPIE-AAPMg | Lung segmentation | Faster R-CNNh, with 3D CNN for false positive reduction | 3D CNN inspired by VGGNet | Keras, Tensorflow, Caffe | 0.40183 |
5 | Pierre Fillard (Therapixel) | LUNA16 | Rescale to 0.625×0.625×0.625, lung segmentation | 3D CNN inspired by VGGNet | 3D CNN inspired by VGGNet | Tensorflow | 0.40409 |
6 | MDai | None | Rescale to 1×1×1, normalize HUi | 2D and 3D ResNet | 3D ResNet + a Xgboost classifier incorporating CNN output, patient sex, # nodules, and other nodule features | Keras, Tensorflow, Xgboost | 0.41629 |
7 | DL Munich | LUNA16 | Rescale to 1×1×1, lung segmentation | U-Net | 2D and 3D residual neural network | Tensorflow | 0.42751 |
8 | Alex, Andre, Gilberto, and Shize | LUNA16 | Rescale to 2×2×2 | Variant of U-Net | CNN, tree-based classifiers (with better performance) | Keras, Theano, xgboost, extraTree | 0.43019 |
9 | Deep Breath | LUNA16, SPIE-AAPMj | Lung mask | Variant of SegNet | Inception-ResNet v2 | Theano and Lasagne | 0.43872 |
10 | Owkin Team | LUNA16 | Lung segmentation | U-Net, 3D VGGNet | Gradient boosting | Keras, Tensorflow, xgboost | 0.44068 |
aLUNA16: Lung Nodule Analysis 2016.
bLIDC: Lung Image Database Consortium.
cC3D: convolutional 3D.
dResNet-like CNN: residual net–like convolutional neural network.
eResNet: residual net.
fDenseNet: dense convolutional network.
gSPIE-AAPM: International Society for Optics and Photonics–American Association of Physicists in Medicine Lung CT Challenge.
hR-CNN: region-based convolutional neural networks.
iHU: Hounsfield unit.
jDataset has been evaluated but not used in building the final model.
A summary of the chest computed tomography datasets employed by the participants.
Datasets | Number of CTa scan series | Data originated from multiple sites | Availability of nodule locations | Availability of nodule segmentations | Availability of patients’ diagnoses (benign versus malignant) |
Kaggle Data Science Bowl (this competition) | Training: 1397; public test set: 198; final test set: 506 | Yes | No | No | Yes |
Lung nodule analysis | 888 | Yes | Yes | Yes | Yes |
SPIE-AAPMb Lung CT Challenge | 70 | No | Yes | No | Yes |
Lung Image Database Consortium | 1398 | Yes | Yes | Yes | Yes |
aCT: computed tomography.
bSPIE-AAPM: International Society for Optics and Photonics–American Association of Physicists in Medicine.
Most of the winning teams developed their modules with Keras and Tensorflow. Only one team used Pytorch (the top-performing team), Caffe, or Lasagne. All of the top 10 teams employed a number of python packages for scientific computing and image processing, including NumPy, SciPy, and Scikit-image (skimage). A summary of package dependencies is shown in
The most widely used dependencies by the top 10 teams. The packages are ordered by their prevalence among the top teams. For simplicity, dependencies used by only one team are omitted from the figure.
To facilitate reusing the code developed by the top teams, we generated a Docker image for each of the available solutions. Our developed Docker images are redistributed under the open-source licenses chosen by the original developers [
This is the first study that systematically compared the algorithms and implementations of award-winning pulmonary nodule classifiers. Results showed that the majority of the best-performing solutions used additional datasets to train the pulmonary nodule segmentation models. The top solutions used different data preprocessing, segmentation, and classification algorithms. Nonetheless, they only differ slightly in their final test set scores.
The most commonly used data preprocessing steps were lung segmentation and voxel scaling [
To enhance the reproducibility of the developed modules, we generated a Docker image for each of the award-winning solutions. The Docker images contain all software dependencies and patches required by the source codes and are portable to various computing environments [
Since it was difficult to compile and release a large deidentified chest CT dataset to the public, the public test set only contains images from 198 patients. Leveraging the 5-digit precision of the log-loss value shown on the leaderboard, one participant implemented and shared a method for identifying all ground truth labels in the public test set during the competition [
There are several approaches future contest organizers can take to ensure the generalizability of the developed models. First, a multistage competition can filter out the overfitted models using the first private test set and only allow reasonable models to advance to the final evaluation. In addition, organizers can discourage leaderboard probing by only showing the performance of a random subset of the public test data or limiting the number of submissions allowed per day. Finally, curating a larger test set can better evaluate the true model performance and reduce random variability [
In summary, we compared, reproduced, and Dockerized state-of-the-art pulmonary nodule segmentation and classification modules. Results showed that many transfer learning approaches achieved reasonable accuracy in diagnosing chest CT images. Future works on additional data collections and validation will further enhance the generalizability of the current methods.
American Association of Physicists in Medicine
convolutional neural network
computed tomography
Lung Nodule Analysis 2016
residual net
Scikit-image
International Society for Optics and Photonics
The authors express their appreciation to Dr Steven Seltzer for his feedback on the manuscript; Drs Shann-Ching Chen, Albert Tsung-Ying Ho, and Luke Kung for identifying the data resources; Dr Mu-Hung Tsai for pointing out the computing resources; and Ms Samantha Lemos and Nichole Parker for their administrative support. K-HY is a Harvard Data Science Fellow. This work was supported in part by the Blavatnik Center for Computational Biomedicine Award and grants from the Office of the Director, National Institutes of Health (grant number OT3OD025466), and the Ministry of Science and Technology Research Grant, Taiwan (grant numbers MOST 103-2221-E-006-254-MY2 and MOST 103-2221-E-168-019). The authors thank the Amazon Web Services Cloud Credits for Research, Microsoft Azure Research Award, and the NVIDIA Corporation for their support on the computational infrastructure. This work used the Extreme Science and Engineering Discovery Environment Bridges Pylon at the Pittsburgh Supercomputing Center (through allocation TG-BCS180016), which is supported by the National Science Foundation (grant number ACI-1548562).
None declared.