This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Recent years have been witnessing a substantial improvement in the accuracy of skin cancer classification using convolutional neural networks (CNNs). CNNs perform on par with or better than dermatologists with respect to the classification tasks of single images. However, in clinical practice, dermatologists also use other patient data beyond the visual aspects present in a digitized image, further increasing their diagnostic accuracy. Several pilot studies have recently investigated the effects of integrating different subtypes of patient data into CNN-based skin cancer classifiers.
This systematic review focuses on the current research investigating the impact of merging information from image features and patient data on the performance of CNN-based skin cancer image classification. This study aims to explore the potential in this field of research by evaluating the types of patient data used, the ways in which the nonimage data are encoded and merged with the image features, and the impact of the integration on the classifier performance.
Google Scholar, PubMed, MEDLINE, and ScienceDirect were screened for peer-reviewed studies published in English that dealt with the integration of patient data within a CNN-based skin cancer classification. The search terms
A total of 11 publications fulfilled the inclusion criteria. All of them reported an overall improvement in different skin lesion classification tasks with patient data integration. The most commonly used patient data were age, sex, and lesion location. The patient data were mostly one-hot encoded. There were differences in the complexity that the encoded patient data were processed with regarding deep learning methods before and after fusing them with the image features for a combined classifier.
This study indicates the potential benefits of integrating patient data into CNN-based diagnostic algorithms. However, how exactly the individual patient data enhance classification performance, especially in the case of multiclass classification problems, is still unclear. Moreover, a substantial fraction of patient data used by dermatologists remains to be analyzed in the context of CNN-based skin cancer classification. Further exploratory analyses in this promising field may optimize patient data integration into CNN-based skin cancer diagnostics for patients’ benefits.
The incidence of skin cancer has been increasing throughout the world, resulting in substantial health and economic burdens [
However, single-image classification does not reflect the clinical reality. In fact, dermatologists’ diagnoses are based on both the visual inspection of a single image and the integration of information from various sources.
An overview of patient data considered by dermatologists while diagnosing skin lesions. The framed characteristics in the figure illustrate the fraction of patient data that can potentially be recognized by convolutional neural networks from a single image input. UVR: ultraviolet radiation.
This review presents the status quo of CNN-based skin lesion classification using image input and patient data. The included studies were analyzed with respect to the amount and type of patient data used for integration, the encoding and fusing techniques, and the reported results. The review also discusses the heterogeneity of the studies that have been conducted so far and points out the potential and challenges of such combined classifiers that should be addressed in the future.
Google Scholar, PubMed, MEDLINE, and ScienceDirect were searched for peer-reviewed publications, restricted to human research published in English. The search terms
This review only includes skin lesion classification studies using CNNs that consider both image and patient data. It must be noted that there are a few studies that investigated the incorporation of visual and nonvisual information on skin cancer classification, but did not obtain visual features using deep learning techniques, for example, the studies by Binder et al [
The objective of this review is to update practitioners on the status quo approaches toward patient data incorporation into CNN-based skin lesion diagnostics regarding all relevant practical aspects.
The goal is to achieve better performance of the CNN-based classifier by integrating new information that cannot be extracted from a digitized image. Various types of patient data have been shown to assist dermatologists.
A CNN-based classifier extracts various visual features from a digitized image as the basis for its diagnosis. Patient data are nonimage data and are mostly provided as numbers or strings in tables. The patient data can be classified in a dichotomous fashion (presence of the feature: yes or no), fall into several discrete categories (eg, Fitzpatrick skin type), or be continuous (eg, patient age). This may require different, carefully chosen encoding and fusing techniques. Moreover, the weight attributed to patient data in comparison with image features can strongly influence how the different features contribute toward the decision making of the system.
This review aims to summarize the recent findings regarding the impact of patient data on the performance of CNN-based classifiers.
The included publications reported different statistical metrics as the study end points. If the classes in the test set are approximately equally distributed, then accuracy is a frequently used performance metric, where the total number of correctly predicted samples is divided by the total number of samples in the test set. In binary classification problems with a positive and a negative class, sensitivity and specificity are further common study end points, especially if there is an imbalance between the samples of both classes. Sensitivity was determined only on the basis of the actual positive samples in the test set. It is calculated by counting the correctly classified positive samples by the total number of positive samples. In contrast, specificity was determined based on actual negative samples in the test set. Here, the correctly classified negative samples were divided by the total number of negative samples. While using a CNN, the sensitivity and specificity depend on the selected cutoff value. If the output of the neural network is greater than the cutoff value, the input is assigned to the positive class, and if it is below that value, then the input is assigned to the negative class. Thus, this value represents a central parameter for the trade-off between sensitivity and specificity. A decrease in the threshold value leads to an increase in the sensitivity with a simultaneous decrease in specificity and vice versa. The dependence of the cutoff value of the specificity and sensitivity of the two metrics is shown in the receiver operating characteristic curve. Here, the sensitivity is plotted against the false-positive rate (1−specificity) in a diagram for each possible cutoff value. The area under the receiver operating characteristic curve was used as an integral performance measure for the algorithms.
A total of 11 publications fulfilling the inclusion criteria are summarized in
Summary table.
Study | Patient data types | Result (without/with) | Classification task | CNNa architecture | Data set | Samples, n |
Bonechi et al [ |
4 types: age, sex, location, and presence of melanocytic cells | Accuracy: 0.8344/0.8834 | Binary: benign or malignant (MELb, BCCc, SCCd) | ResNet50 | ISICe | 5405 |
Chin et al [ |
5 types: age; sex; size; how long it existed; changes in size, color, or shape including bleeding and itching | Accuracy: 0.84/0.92 | Binary: low risk or high risk for MEL | DenseNet121 | Own | 5289 |
Gonzalez-Diaz [ |
2 types: age and sex | Accuracy: 0.848/0.859 | Binary: MEL yes or no | ResNet50 | 2017 ISBIf challenge+interactive atlas of dermoscopy [ |
6302 |
Gessert et al [ |
3 types: age, sex, and location | Sensitivity: 0.725/0.742; specificity: data not available | 8 classes: MEL, NVg, BCC, AKh, BKLi, DFj, VASCk, SCC | EfficientNets | ISIC (HAM10000 [ |
27,665 |
Kawahara et al [ |
3 types: sex, location, and elevation | Sensitivity: 0.527/0.604; specificity: 0.902/0.910 | 5 classes: MEL, BCC, NV, MISCl, SKm | Inception V3 | 7-point data set | 808 |
Kharazmi et al [ |
5 types: age, sex, location, size, and elevation | Accuracy: 0.847/0.911 | Binary: BCC yes or no | Convolutional filters of learned kernel weights from a sparse autoencoder | Own | 1199 |
Li et al [ |
3 types: age, sex, and location | Sensitivity: 0.8544/0.8764; specificity: data not available | 7 classes: NV, MEL, BKL, BCC, AKIECn, VASC, DF | SENet154 | ISIC 2018 data set | 10,015 |
Pacheco and Krohling [ |
8 types: age, location, lesion itches, bleeds or has bled, pain, recently increased, changed its pattern, and elevation | Accuracy: 0.671/0.788 | 6 Classes: BCC, SCC, AK, SK, MEL, NV | ResNet50 | Own | 1612 |
Ruiz-Castilla et al [ |
3 types: age, sex, and size | Accuracy: 0.61/0.85 | Binary: MEL yes or no | Shallow network with 2 convolutional layers | ISIC | 300 |
Sriwong et al [ |
3 types: age, sex, and location | Accuracy: 0.7929/0.8039 | 7 classes: AKIEC, BCC, BKL, DF, MEL, NV, VASC | AlexNet | HAM10000 | 16,720 |
Yap et al [ |
3 types: age, sex, and location | Mean average precision: 0.726/0.729; Accuracy: 0.721/0.720 | 5 classes: BCC, SCC, MEL, BKL, NV | ResNet50 | ILSVRCo 2015 [ |
2917 (only testing) |
aCNN: convolutional neural network (most of the studies had the goal of investigating the usefulness of the presented fusion technique independently of the convolutional neural network architecture and, therefore, often showed the performance of the fusion with multiple architectures; we included only the best-performing architecture).
bMEL: melanoma.
cBCC: basal cell carcinoma.
dSCC: squamous cell carcinoma.
eISIC: International Skin Imaging Collaboration.
fISBI: International Symposium on Biomedical Imaging [
gNV: melanocytic nevus.
hAK: actinic keratosis.
iBKL: benign keratosis-like lesion.
jDF: dermatofibroma.
kVASC: vascular lesion.
lMISC: summary of dermatofibroma, lentigo, melanosis, miscellaneous, and vascular lesion.
mSK: seborrheic keratosis.
nAKIEC: actinic keratosis and intraepithelial carcinoma.
oILSVRC: ImageNet Large Scale Visual Recognition Challenge.
Most of the studies included three types of patient data (7/11, 64%). Compared with the diversity of potentially useful patient data illustrated in
The means of choice to encode the patient data was one-hot encoding in most cases. One-hot encoding is one way to encode several discrete classes with a string of bits, where exactly one value in the string of bits encoding one class is assigned 1 and all others are assigned 0 (eg,
As patient data are rarely documented in a standardized way, dealing with missing values is an essential skill that requires the algorithm to be proficient. However, only 18% (2/11) of publications went into detail on how they dealt with missing values. Gessert et al [
Overview of the different fusing techniques in the main function blocks of the combined classifier. CNN: convolutional neural network.
The fusing techniques differ in the way they actively weigh the image and patient data. In 82% (9/11) of studies, a concatenation-based fusion was applied, that is, the feature vector extracted from the images was enlarged by attaching the encoded patient data. In this case, weighting is achieved by defining the ratio between the number of features originating from the image and the patient data input. Common CNN architectures extract 1024, 2048, or even more features from the image input. In most studies, the authors decided to reduce the image features before concatenating them with patient data. In only 27% (3/11) of studies, the authors provided sufficient information on this point and revealed a considerable variance in the ratio of image features to patient data: 112 to 28 [
In addition, the studies vary in the extent to which deep learning methods were applied to the patient data before fusing or on the combined feature vector after fusing it with the image data. Sriwong et al [
As summarized in
Although 5 studies reported results for binary classification tasks, 55% (6/11) of studies dealt with a multiclass classification problem, distinguishing between up to 8 different skin diseases, and revealed insights on how the use of patient data influences the classification performance for an individual type of skin lesion.
Influence of included patient data on the classification performance of the single skin diseases or lesionsa.
Study, patient data, and metric | Skin disease | |||||||||||
|
MELb | NVc | BCCd | SCCe | AKf | AKIECg | BKLh | DFi | VASCj | MISCk | SKl | |
|
||||||||||||
|
AUCm | +n | (+/−)o | −p | − | + | Xq | + | + | − | X | X |
|
Sensitivity | − | − | − | − | − | X | − | − | − | X | X |
|
Specificity | + | + | + | + | + | X | + | + | + | X | X |
|
||||||||||||
|
Sensitivity | + | − | + | X | X | − | + | + | − | X | X |
|
Specificity | − | + | − | X | X | + | + | +/− | + | X | X |
|
||||||||||||
|
Sensitivity | − | − | + | X | X | − | + | + | + | X | X |
|
||||||||||||
|
Sensitivity | + | + | + | X | X | X | X | X | X | + | + |
|
Specificity | + | + | +/− | X | X | X | X | X | X | + | + |
|
||||||||||||
|
Sensitivity | + | + | + | + | + | X | X | X | X | X | + |
|
Specificity | + | + | + | − | + | X | X | X | X | X | + |
aThe study of Yap et al [
bMEL: melanoma.
cNV: Melanocytic nevus.
dBCC: basal cell carcinoma.
eSCC: squamous cell carcinoma.
fAK: Actinic keratosis.
gAKIEC: actinic keratosis and intraepithelial carcinoma.
hBKL: benign keratosis-like lesion.
iDF: dermatofibroma.
jVASC: vascular lesion.
kMISC: miscellaneous and vascular lesion.
lSK: seborrheic keratosis.
mAUC: area under the curve.
nIndicates improvement compared with classification performance without patient data.
oIndicates no change compared with classification performance without patient data.
pIndicates degradation compared with classification performance without patient data.
qThis implies that the lesion type was not considered in the classification task of the study.
In total, 36% (4/11) of studies analyzed the influence of the used patient data on the classification performance in a more differentiated way. They showed the impact of either individual patient data or special combinations of patient data on classification performance, thereby providing a more detailed insight into the contribution of individual patient data.
As the only ones, Pacheco and Krohling [
Li et al [
Sriwong et al [
Bonechi et al [
Although the main evidence for a good diagnosis is still provided by the image input, all 11 publications indicate a possible benefit of integrating patient data in CNN classifiers, as illustrated in
One focus of further research into combined CNN-based classifiers should be to render its classification process transparent, easy to understand, and applicable in a clinical setting. The 11 studies published so far have dealt with these aspects only marginally. Therefore, these issues need to be addressed in future studies to reliably reveal the potential of integrating patient data.
No objective benchmarks exist in the field of integrating patient data into CNN-based classifiers. The heterogeneity of the studies conducted so far is substantial. This applies to the number and types of skin diseases or lesions to be classified, databases and data augmentation, CNN architectures, patient data, and fusion techniques. These aspects have a great influence on the way that the algorithm learns to diagnose the lesions in question and render it very difficult to reproduce and compare the approaches and results externally and independently. A way to solve this would be the more extensive use of external and publicly available data sets to objectively optimize the classification accuracy in an experimental setting. This needs to be done systematically in preparation for clinical trials that will be required to prove the algorithm’s generalizability and applicability in the clinic. In addition, the best way to handle missing data needs to be addressed.
All presented studies lack an investigation of the impact of patient data individually and in combination on single-lesion classes. Both the fusion method and weight attributed to the patient data in addition to the biological significance itself may substantially influence the classification results. Further research should be dedicated to explaining the mechanisms by which the incorporation of these factors contributes to the decision making of the CNN-based combined classifier to render the results more transparent.
As shown in
All 11 studies published so far indicate that the integration of patient data into CNN-based skin lesion classifiers may improve classification accuracy. The studies mainly used patient data that were routinely recorded (age, sex, and lesion location). Regarding the technical details, the main differences in the presented approaches occur in the fusing techniques. Further research should be dedicated to systematically evaluating the impact of incorporation of individual and combined patient data into CNN-based classifiers to show its benefit reproducibly and transparently and to pave the way for the translation of these combined classifiers into the clinic.
Relevant references for the overview of the patient data illustrated.
actinic keratosis
basal cell carcinoma
benign keratosis-like lesion
Bristol Myers Squibb
convolutional neural network
Merck Sharp & Dohme
melanocytic nevus
squamous cell carcinoma
seborrheic keratosis
vascular lesion
JH, A Heckler, EKH, and TJB contributed to the concept and design of the study. JH identified and analyzed studies with contributions from A Heckler and EKH. TJB oversaw the study, critically reviewed and edited the manuscript, and gave final approval. JNK, JSU, FM, FFG, A Hauschild, LF, JGS, KG, TW, HK, MH, SH, WS, DS, BS, RCM, MS, TJ, SF, and DBL substantially contributed to the conception and design, provided critical review and commentary on the draft manuscript, and approved the final version. All of the authors guaranteed the integrity and accuracy of this study. This research is funded by the Federal Ministry of Health in Germany (Skin Classification Project; grant holder: TJB).
A Hauschild reports clinical trial support, speaker’s honoraria, and consultancy fees from the following companies: Amgen, Bristol Myers Squibb (BMS), Merck Serono, Merck Sharp & Dohme (MSD), Novartis, Oncosec, Philogen, Pierre Fabre, Provectus, Regeneron, Roche, OncoSec, Sanofi Genzyme, and Sun Pharma (outside the submitted work). BS reports advisory roles for or has received honoraria from Pierre Fabre Pharmaceuticals, Incyte, Novartis, Roche, BMS, and MSD; research funding from BMS, Pierre Fabre Pharmaceuticals and MSD; and travel support from Novartis, Roche, BMS, Pierre Fabre Pharmaceuticals, and Amgen outside the submitted work. FM has received travel support, speaker’s fees, and/or advisor’s honoraria from Novartis, Roche, BMS, MSD, and Pierre Fabre and research funding from Novartis and Roche outside the submitted work. JSU is on the advisory board or has received honoraria and travel support from Amgen, Bristol Myers Squibb, GSK, LeoPharma, Merck Sharp and Dohme, Novartis, Pierre Fabre, Roche, outside the submitted work. SH reports advisory roles for or has received honoraria from Pierre Fabre Pharmaceuticals, Novartis, Roche, BMS, Amgen, and MSD outside the submitted work. TJB reports owning a company that develops mobile apps (Smart Health Heidelberg GmbH, Handschuhsheimer Landstr. 9/1, 69120 Heidelberg). WS received travel expenses for attending meetings and/or (speaker) honoraria from Abbvie, Almirall, Bristol Myers Squibb, Celgene, Janssen, LEO Pharma, Lilly, MSD, Novartis, Pfizer, Roche, Sanofi Genzyme, and UCB outside the submitted work.