This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Machine learning algorithms have been drawing attention at the joining of pathology and radiology in prostate cancer research. However, due to their algorithmic learning complexity and the variability of their architecture, there is an ongoing need to analyze their performance.
This study assesses the source of heterogeneity and the performance of machine learning applied to radiomic, genomic, and clinical biomarkers for the diagnosis of prostate cancer. One research focus of this study was on clearly identifying problems and issues related to the implementation of machine learning in clinical studies.
Following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) protocol, 816 titles were identified from the PubMed, Scopus, and OvidSP databases. Studies that used machine learning to detect prostate cancer and provided performance measures were included in our analysis. The quality of the eligible studies was assessed using the QUADAS-2 (quality assessment of diagnostic accuracy studies–version 2) tool. The hierarchical multivariate model was applied to the pooled data in a meta-analysis. To investigate the heterogeneity among studies,
In the final analysis, 37 studies were included, of which 29 entered the meta-analysis pooling. The analysis of machine learning methods to detect prostate cancer reveals the limited usage of the methods and the lack of standards that hinder the implementation of machine learning in clinical applications.
The performance of machine learning for diagnosis of prostate cancer was considered satisfactory for several studies investigating the multiparametric magnetic resonance imaging and urine biomarkers; however, given the limitations indicated in our study, further studies are warranted to extend the potential use of machine learning to clinical settings. Recommendations on the use of machine learning techniques were also provided to help researchers to design robust studies to facilitate evidence generation from the use of radiomic and genomic biomarkers.
Prostate cancer (PCa) is the second most diagnosed cancer worldwide in men [
More recently, multiparametric magnetic resonance imaging (mpMRI) has been demonstrated to be a better radiomic biomarker than systematic TRUS biopsy, achieving high diagnostic accuracy and becoming a clinical routine investigation for suspected PCa patients [
Alongside radiomic investigation, there are numerous Food and Drug Administration–approved genomic biomarkers underlying the biomolecular functions most strongly associated with clinical outcomes. In fact, a major focus of personalized medicine has been the biomolecular characterization of tumors by integrating genomics into clinical oncology to identify unique druggable targets and generate higher-order tumor classification methods that can support clinical treatment decisions [
Over the last decade, the landscape for PCa detection tools has expanded to include novel biomarkers, clinical information, genomic assays, and noninvasive imaging tests. The prospect of detecting PCa using readily available clinical and demographic health information is a potentially innovative part of improving screening practices [
In this scenario, machine learning (ML) is helping researchers in identifying and discovering new biomarkers to detect PCa. ML is a branch of artificial intelligence (AI) techniques based on the development and training of algorithms by learning from data and the performance of predictions. ML methods are able to improve and learn over time in a more efficient way than classical statistical approaches [
Therefore, this study aimed to suggest an integrated estimate of the accuracy for use of ML algorithms in detecting PCa through a systematic review and meta-analysis of the available studies. Due to the internal heterogeneity of ML algorithms, subgroup analyses helped in investigating the diagnostic capability of ML systems and highlighting the sources of bias and common pitfalls to avoid in order to assure reproducibility among studies. Subgroup analyses were mainly based on the model choice, model development, and validation methods to identify potential covariates that could influence the diagnostic performance of ML.
This review helps to support ML studies in rising up the pyramid of evidence. In fact, we identify and discuss recurrent factors that hinder the uptake of these studies in clinical settings.
To the best of the authors’ knowledge, there are no systematic review and meta-analysis studies evaluating the performance and estimating the current status of existing approaches on PCa detection. Therefore, this study aims to fill the gap in the existing literature and gather recommendations on ML model development to achieve robust results to automatically detect PCa.
We conducted and reported this meta-analysis in accordance with PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines [
The PubMed, Scopus, and OvidSP (ie, Embase) databases were searched to identify studies evaluating the accuracy of radiomic, clinical, and genomic biomarkers in the diagnosis of PCa. The following criteria were used to limit the research: papers published in the last 5 years (from 2015 to 2020) to guarantee homogeneity among radiomic studies, as the new protocol (PI-RADS) for mpMRI was updated in 2015 [
An author (RC) retrieved the initial search results and removed duplicates via Excel (Microsoft). Subsequently, another author (MF) manually searched for and removed any remaining duplicates. Finally, RC and MF independently screened the studies by title, abstract, and keywords, after which the full texts of the selected studies were assessed by inclusion and exclusion criteria. The main considerations for study inclusion were if machine learning was fully applied in distinguishing individuals or lesions with clinically diagnosed PCa from controls and if the study assessed the accuracy of such applications. Detailed inclusion and exclusion criteria are reported in Table S2 in
After the evaluation was completed, two authors extracted the following information from the selected literature: literature data—the first author, publication date, study population, number of patients, study design, and data collection; basic research information—age, Gleason score, and PSA level, where possible; information regarding the reference standard used in individual studies; definitions of positive and negative PCa (PCa positive and control) and methodologies to distinguish individuals or lesions with PCa from the control group; specific methodologies to process and classify data for use in machine learning algorithms; and the sensitivity, specificity, and, if available, true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) rates.
The authors independently graded the quality of the eligible studies using the quality assessment of diagnostic accuracy studies–version 2 (QUADAS-2) tool [
For radiomic analysis, due to the very low number of included studies investigating central gland and transition zone (TZ) prostate tumors, only studies investigating the peripheral zone were included in the meta-analysis. This was also due to the fact that central gland and TZ prostate tumors have significantly different quantitative imaging signatures [
Due to the low number of studies employing 3D volumes of interest (VOIs) to extract quantitative features, only studies delineating 2D regions of interest (ROIs) were included in the meta-analysis to reduce the risk of bias. This was mainly due to the fact that significant differences were found between prediction performance when using 3D VOIs and that when using 2D ROIs [
To reduce heterogeneity among the selected studies, subgroup analyses were carried out for radiomic and genomic studies due to their intrinsic differences in data acquisition, analysis, and feature extraction. Radiomic subgroup analyses helped to investigate the role of the mpMRI biomarker in detecting PCa via ML, whereas genomic subgroup analyses were carried out to understand the role of genomic biomarkers in detecting PCa via ML.
Several covariates suitable for subgroup analysis were identified during the review process where the individual peculiarities of the studies, which may affect the outcome, were investigated.
The included studies were investigated if they explored a patient- or lesion-based model, validation approaches (cross-validation, hold-out approach or external validation, or no validation), ML algorithms (regression-based model, tree-based model, or deep learning algorithms), whether the studies used a DL or ML approach, or whether the employed data set was balanced or unbalanced. For genomic studies, the use of different specimens (ie, urine, serum, semen, and tissue) was also investigated in a subgroup analysis. One study [
In case a study investigated multiple ML algorithms, only the method achieving the highest area under the curve (AUC) was included in the meta-analysis, as AUC is a good estimator of ML performance.
This meta-analysis was conducted via the Open Meta-Analyst Software tool, and statistical significance was expressed with 95% CIs. Pooled estimates for sensitivity and specificity with the corresponding 95% CIs were used to determine the accuracy of machine learning for detecting PCa in radiomic and genomic studies. From these data, we generated a hierarchical summary receiver operating characteristic curve (HSROC) and coupled forest plots by random-effects model. Heterogeneity among studies was assessed by calculation of the inconsistency index (
In our meta-analysis, a multivariate random-effects model was used to consider both within- and between-subject variability and threshold effects [
According to the search strategy described above, 877 titles were identified in PubMed, Scopus, and OvidSP. After removing duplicates, 816 titles were considered. Of these, 708 were excluded after reading of the abstracts because they did not meet the inclusion criteria. From the remaining 108 full-text articles, 71 were removed due to the exclusion criteria. Finally, 37 full texts were included in the qualitative analysis, and 29 studies were considered appropriate for inclusion in the meta-analysis. A flowchart of the literature search is shown in
The distribution of the risk of bias evaluated via the QUADAS-2 tool for the included studies is presented in the supplementary materials (Figure S1 in
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flowchart of literature search: included/excluded titles, abstracts, and full papers. ML: machine learning; MRI: magnetic resonance imaging; PCa: prostate cancer; TZ: transition zone; VOI: volume of interest.
The publication years ranged from 2015 to 2020 to guarantee homogeneity among radiomic studies, as the new PI-RADS was updated in 2015 [
Characteristics of 37 studies included in the systematic review.
Characteristics | Studies, n | Patients (average over the number of studies), n | ||
Prospective | 8 | 2210 (276.25) | ||
Retrospective | 29 | 6414 (221.17) | ||
|
||||
Private data set | 33 | 7760 (235.15) | ||
Public database (SPIE-AAPM-NCIa PROSTATEx challenge) | 2 | 399 (199.5) | ||
Mixed (private and public) data set | 2 | 465 (232.5) | ||
|
||||
Random forest | 4 | 1621(405.25) | ||
Regression-based models | 20 | 4678 (233.9) | ||
Partial least squares discriminant analysis (PLS-DA) | 2 | 180 (90) | ||
Linear discriminant analysis (LDA) | 1 | 53 | ||
Support vector machine (SVM) | 2 | 65 (32.5) | ||
Classification and regression tree (CART) | 1 | 67 | ||
Artificial neural networks (ANNs) | 2 | 1012 (506) | ||
Deep neural networks (DNNs) | 1 | 195 | ||
Convolutional neural networks (CNNs) | 3 | 696 (232) | ||
Deep learning: SNCSAEb | 1 | 57 | ||
|
||||
Multiparametric MRIc | 20 | 5058 (252.9) | ||
|
13 | 3132 (240.92) | ||
Urine | 6 | 930 (155) | ||
Serum | 3 | 901 (300.3) | ||
Semen | 2 | 108 (54) | ||
Tissue | 2 | 800 (400) | ||
Clinical data | 4 | 2812 (703) | ||
Internal validation | 29 | 6540 (225.52) | ||
External validation | 3 | 1380 (460) | ||
Internal and external validation | 1 | 364 | ||
Unknown | 5 | 704 (140.8) |
aSPIE-AAPM-NCI: International Society for Optics and Photonics–American Association of Physicists in Medicine–National Cancer Institute.
bSNCSAE: stacked nonnegativity constraint sparse autoencoders.
cMRI: magnetic resonance imaging.
Of the final 37 papers, 29 were considered for the meta-analysis. Eight studies were excluded to reduce heterogeneity among the studies. Of those, 2 studies were excluded because they extracted radiomic features from VOIs [
Studies [
All the included studies for the radiomic analysis are reported in
Multivariate meta-analysis via the HSROC model was assessed for all the studies (Figure S2 in
The calculated heterogeneity values for pooled sensitivity and specificity were 84% and 79% (
To resolve the heterogeneity, subgroup analysis was conducted for different covariates. The subgroup analysis per model-based covariate is shown in
Accuracy measures of radiomic studies for the systematic review.
Study, year | Model basisa | Patients, n | Total sample (PCa+, PCa-)b | Crossvalc/ split/none | MLd methodse | TP,f n | FN,g n | FP,h n | TN,i n | Senj (lower-upper) | Spek (lower-upper) |
Zhao, 2015 [ |
LB | 71 | 238 (92, 146) | 120 (60, 60) | ANN | 57 | 35 | 16 | 130 | 0.620 |
0.890 |
Valerio, 2016 [ |
LB | 53 | 106 (53, 53) | None | LDA | 51 | 2 | 1 | 53 | 0.962 |
0.981 |
Lay, 2017 [ |
LB | 224 | 410 (123, 287) | Crossval | RF | 109 | 14 | 57 | 230 | 0.886 |
0.801 |
Reda, 2017 [ |
LB | 18 | 53 (26, 27) | Crossval | SNCSAE | 26 | 1 | 1 | 27 | 0.963 |
0.964 |
Starobinets, 2017 [ |
LB | 169 | 509 (291, 218) | Crossval | LR | 264 | 27 | 24 | 194 | 0.907 |
0.890 |
Wang, 2017 [ |
PB | 172 | 172 (79, 93) | Crossval | DCNN | 55 | 24 | 15 | 78 | 0.696 |
0.839 |
Le, 2017 [ |
LB | 364 | 913 (463, 450) | 275 (139, 135) | multimodal CNN | 125 | 14 | 6 | 129 | 0.899 |
0.956 |
Kwon, 2018 [ |
LB | 204 | 191 (36, 155) | Crossval | LASSO LR | 35 | 5 | 9 | 90 | 0.875 |
0.909 |
Song, 2018 [ |
LB | 195 | 547 (261, 286) | 55 (23, 32) | DNN | 20 | 3 | 3 | 29 | 0.870 |
0.906 |
Chen, 2019 [ |
PB | 381 | 381 (182, 199) | 155 (55, 60) | LR | 55 | 1 | 1 | 59 | 0.982 |
0.983 |
Devine, 2019 [ |
LB | 65 | 97 (81, 16) | Crossval | LR | 61 | 20 | 2 | 14 | 0.753 |
0.875 |
Gholizadeh, 2019 [ |
LB | 11 | 297 (161, 136) | Crossval | SVM | 161 | 1 | 9 | 127 | 0.994 |
0.934 |
Ma, 2019 [ |
PB | 81 | 81 (44, 37) | None | LR | 42 | 2 | 5 | 32 | 0.955 |
0.865 |
Mazaheri, 2019 [ |
LB | 67 | 170 (102, 68) | 91 (52, 39) | CART | 51 | 1 | 19 | 20 | 0.981 |
0.513 |
Qi, 2019 [ |
PB | 199 | 199 (85, 114) | 66 (28, 38) | LR | 23 | 5 | 3 | 35 | 0.821 |
0.921 |
Zhang, 2019 [ |
PB | 140 | 140 (60, 80) | Crossval | RF | 14 | 6 | 5 | 22 | 0.700 |
0.815 |
aLB: lesion-based model; PB: patient-based model.
bPCa: prostate cancer.
cCrossval: cross-validation techniques.
dML: machine learning.
eANN: artificial neural networks; LDA: linear discriminant analysis; RF: random forest; SNCSAE: stacked nonnegativity constraint sparse autoencoders; LR: logistic regression; DCNN: deep convolutional neural networks; LASSO: least absolute shrinkage and selection operator; DNN: deep neural networks; SVM: support vector machine; CART: classification and regression tree.
fTP: true-positive.
gFN: false-negative.
hFP: false-positive.
iTN: true-negative.
jSen: sensitivity.
kSpe: specificity.
Subgroup analysis for the model-based covariate in radiomic studies. Subgroup 1: lesion-based models; subgroup 2: patient-based models. FN: false-negative; FP: false-positive; TN: true-negative; TP: true-positive.
Subgroup analysis for the validation covariate in radiomic studies. Subgroup 1: internal cross-validation; subgroup 2: hold-out approach or external validation; subgroup 3: no validation. FN: false-negative; FP: false-positive; TN: true-negative; TP: true-positive.
The results of the subgroup analysis to discriminate among machine and deep learning methods are reported in
Therefore, Devine et al [
Subgroup analysis for the machine learning algorithm covariate in radiomic studies. Subgroup 1: regression-based models; subgroup 2: tree-based models; subgroup 3: deep learning methods. FN: false-negative; FP: false-positive; TN: true-negative; TP: true-positive.
Subgroup analysis for the machine learning or deep learning covariate in radiomic studies. Subgroup 1: machine learning–based models; subgroup 2: deep learning methods. FN: false-negative; FP: false-positive; TN: true-negative; TP: true-positive.
Subgroup analysis for the imbalance covariate in radiomic studies. Subgroup 1: balanced data sets; subgroup 2: unbalanced data sets. FN: false-negative; FP: false-positive; TN: true-negative; TP: true-positive.
Subgroup analysis for the model-based covariate in a subset of radiomic studies. Subgroup 1: lesion-based models; subgroup 2: patient-based models. FN: false-negative; FP: false-positive; TN: true-negative; TP: true-positive.
Overall hierarchical summary receiver operating characteristic curve (HSROC) for a subset of radiomic studies. HSROC was calculated for radiomic studies with low heterogeneity, excluding 4 studies [36,44,51,53].
All the included studies for the genomic analysis are reported in
An HSROC model was assessed for all genomic studies (Figure S4 in
The calculated heterogeneity values for the pooled sensitivity and specificity were 73% and 92% (
To resolve this heterogeneity, subgroup analyses were conducted for several covariates. The subgroup analysis for model-based covariates is shown in
Accuracy measures of genomic studies for the systematic review.a
Study, year | Model basisb | Predictor | Patients, n | Total sample (PCa+, PCa-)c | Crossvald/ split/none | TP,e n | FN,f n | FP,g n | TN,h n | Seni |
Spej |
Donovan, 2015 [ |
PB | Urine | 195 | 195 (89, 106) | None | 80 | 9 | 84 | 22 | 0.899 |
0.208 |
Roberts, 2015 [ |
PB | Semen | 66 | 66 (12, 54) | Crossval | 11 | 1 | 32 | 20 | 0.917 |
0.385 |
Zhang, 2015 [ |
PB | Serum | 580 | 580 (180, 400) | 320 (120, 200) | 84 | 36 | 5 | 195 | 0.7 |
0.975 |
Mengual, 2016 [ |
PB | Urine | 224 | 224 (15, 73) | Crossval | 116 | 35 | 12 | 61 | 0.768 |
0.836 |
Salido-Guadarrama, 2016 [ |
PB | Urine | 143 | 143 (73, 70) | None | 60 | 13 | 13 | 57 | 0.822 |
0.814 |
Dereziński, 2017 [ |
PB | Serum | 89 | 89 (49, 40) | 34 (19, 15) | 13 | 6 | 0 | 15 | 0.675 |
0.969 |
Dereziński, 2017a [ |
PB | Urine | 89 | 89 (49, 40) | 34 (19,15) | 17 | 2 | 4 | 11 | 0.895 |
0.733 |
Kirby, 2017 [ |
LB | Tissue | 101 | 398 (286, 112) | 262 (213, 49) | 180 | 33 | 4 | 45 | 0.845 |
0.918 |
Barceló, 2018 [ |
PB | Semen | 42 | 42 (34, 18) | None | 22 | 2 | 5 | 13 | 0.917 |
0.722 |
Amante, 2019 [ |
PB | Urine | 91 | 91 (43, 48) | Crossval | 40 | 3 | 5 | 43 | 0.93 |
0.896 |
Brikun, 2019 [ |
PB | Urine | 94 | 94 (42, 52) | 29 (13, 16) | 12 | 1 | 5 | 11 | 0.923 |
0.687 |
Gao, 2019 [ |
PB | Urine | 183 | 183 (108, 75) | 77 (55, 22) | 48 | 7 | 5 | 17 | 0.873 |
0.773 |
Patel, 2019 [ |
LB | Tissue | 699 | 795 (699, 96) | 242 (212, 30) | 199 | 13 | 2 | 28 | 0.939 |
0.933 |
Santotoribio, 2019 [ |
PB | Serum | 232 | 232 (32, 200) | None | 30 | 2 | 58 | 142 | 0.937 |
0.71 |
aAll studies employed regression-based models.
bLB: lesion-based model; PB: patient-based model.
cPCa: prostate cancer.
dCrossval: cross-validation techniques.
eTP: true-positive.
fFN: false-negative.
gFP: false-positive.
hTN: true-negative.
iSen: sensitivity.
jSpe: specificity.
The subgroup analysis among studies that employed internal cross-validation techniques (subgroup 1) [
Subgroup analysis for the model-based covariate in genomic studies. Subgroup 1: lesion-based models; subgroup 2: patient-based models. FN: false-negative; FP: false-positive; TN: true-negative; TP: true-positive.
Subgroup analysis for the validation covariate in genomic studies. Subgroup 1: internal cross-validation; subgroup 2: hold-out approach or external validation; subgroup 3: no validation. FN: false-negative; FP: false-positive; TN: true-negative; TP: true-positive.
A subgroup analysis was also carried out based on the specimen used by the genomic studies (ie, urine [
An inspection of ML algorithms among genomic studies was not possible because all the included studies employed a regression-based model (Table S4 in
Finally, the effect of using balanced or highly unbalanced data sets in ML approaches was investigated (
As a result, among several covariates, the imbalance covariate was the only one by which the heterogeneity could be partially resolved for more than 5 studies.
By inspecting
Five studies employing urine specimens and balanced data sets showed a very low heterogeneity (
The HSROC curve for the studies employing balanced data sets to automatically detect PCa via urine biomarkers is shown in
Subgroup analysis for the predictor covariate in genomic studies. FN: false-negative; FP: false-positive; TN: true-negative; TP: true-positive.
Subgroup analysis for the imbalance covariate in genomic studies. Subgroup 1: balanced data sets; subgroup 2: unbalanced data sets. FN: false-negative; FP: false-positive; TN: true-negative; TP: true-positive.
Coupled forest plots for balanced studies. The included studies investigated urine specimens. FN: false-negative; FP: false-positive; TN: true-negative; TP: true-positive.
Hierarchical summary receiver operating characteristic curve (HSROC) for a subset of genomic studies. HSROC was calculated for genomic studies with low heterogeneity [15,17,66,67,69].
This paper presents the results of a systematic literature review with meta-analysis of articles investigating machine learning algorithms to detect PCa via radiomic or genomic analysis. One research focus of this study was on clearly evaluating how the implementation of different ML approaches impacts the clinical results. At this stage, due to the high heterogeneity of methods and tools employed in the existing literature, no clear clinical relevance on the use of ML for PCa can be drawn from this study. This review shows that ML has helped to improve the diagnostic performance of the detection of PCa, but challenges still remain for clinical applicability of such methods, and more research is needed. The presented literature aims to help in building an ML system that is robust and computationally efficient to assist clinicians in the diagnosis of PCa via radiomic and genomic biomarkers.
In this review, 37 studies were shortlisted, and 29 studies were included in a meta-analysis. All patients were diagnosed with PCa by biopsy. However, not all the included studies reported full information on the methods used to carry out biopsy (eg, direct MRI-guided, cognitive fusion, or MRI-TRUS fusion biopsy).
In the radiomic and genomic meta-analysis, 16 and 14 studies were included, respectively. Heterogeneity among radiomic and genomic studies was 84% and 73%, respectively. This was expected, as ML methods are usually regarded as black boxes, and the consideration of all possible transformations is onerous. Moreover, there are no clear guidelines on how to develop AI approaches for medical studies, even though a few recommendations have been summarized by Foster et al [
To partially solve the heterogeneity for the included studies, subgroup analyses were conducted based on several covariates. In the field of ML, applications where repeated measures or records have been captured on each subject can affect the overall performance. In most studies, the main aim is to predict if a given subject is “sick” or a “control” subject. In these applications, each subject has a single label type (eg, “sick” or control case). Nonetheless, there are other classification problems where each subject can have multiple labels. For instance, multiple lesions can be extracted from the same subject, and the control part can be represented by the benign-adjacent prostate lesion. It has been demonstrated that this phenomenon, known as identity confounding, can cause discrepancy in classification performance [
In both radiomic and genomic studies, patient-based models presented lower heterogeneity and performance than lesion-based models; this could be due to the fact that lesion-based models employed a bigger size sample, but the models may be overfit due to repeated measures.
A second important covariate to examine in ML problems is the data set construction. In particular, the data set is usually divided into training and testing sets in order to reduce overfitting problems [
Different ML approaches were also investigated among radiomic studies as a possible covariate factor. There were no relevant differences in heterogeneity or performance among subgroups (
The imbalance covariate was crucial in this study. Unbalanced and small data sets are very common in the medical field, and ML algorithms tend to produce unsatisfactory classifiers when handled with imbalanced data sets. Therefore, several techniques to overcome this problem have been proposed over time [
For radiomic studies, after excluding studies that employed highly unbalanced data sets, the heterogeneity was less than 50%. The final pooled sensitivity and specificity for the use of mpMRI were 0.808 (95% CI 0.38-0.999) and 0.831 (95% CI 0.41-0.999), respectively.
For genomic studies, the heterogeneity dropped to 36% and reached a value close to zero when Donovan et al [
Only 4 studies [
Comparison among genomic and radiomic studies was not possible because they describe two different but complementary prospective approaches to the disease. However, the pooled sensitivity and specificity for both mpMRI and urine biomarkers were around 80%, showing them to be promising biomarkers in the detection of PCa via ML in clinical practice. The use of mpMRI has shown great diagnostic potential [
In this scenario, a typical ML postprocessing pipeline for radiomic and genomic analysis to automatically detect PCa may be constituted of a few crucial steps. In the case of radiomic studies, a common pipeline may be constituted of (1) examination of mpMRI; (2) image segmentation through the delineation of ROIs or VOIs, which can include whole gland volume, a specific zone, and one or multiple lesions, which should be explicitly specified in the manuscript; (3) image preprocessing; (4) filtering; (5) feature extraction; (6) integration of radiomic data with clinical data, genomic data, or both; (7) feature selection in relation to the target class; and (8) algorithm training, validation, and testing. Alternatively, a DL approach would only require the examination of the images and annotation of the ROIs or VOIs of the whole image, according to the desired classification output.
The image processing pipeline should be carefully described in the manuscripts, and the spatial coregistration of DWIs is a critical factor in the correct analysis of diffusion tensor imaging data, which has often been used as a predictor of PCa diagnosis. Moreover, the use of endorectal coil can cause high deformation of the prostate compared with other coils and may not provide adequate MR image quality [
Due to the high heterogeneity of genomic studies, a standard pipeline configuration could be structured into (1) missing value management; (2) filtering to remove low-variance features; (3) data normalization due to data coming from heterogeneous formats; (4) a feature selection step to remove irrelevant features due to the high dimension of data; (5) dealing with class imbalance distribution present in this type of large-scale data set; and (6) algorithm training, validation, and testing. Alternatively, a DL approach would handle filtering and feature selection to generate handcrafted features. Deep learning is a powerful tool to integrate different “omics” and increase the computational power of diagnostic tools.
Further general recommendations on how to avoid bias and pitfalls in applying ML to medical problems are as follows: (1) in the case of multicenter studies, it is recommended to use batch effect approaches to prevent any bias due to different study protocols and feature normalization procedures to reduce within-subject bias [
Our study presents several limitations. Some variability still remains due to the actual thresholds between studies. However, the multiple hierarchical model accounts for between- and within-subject variability among studies, including threshold effects. Another factor that could have affected the heterogeneity among studies is the use of different predictors among radiomic and genomic studies. Moreover, several studies reported little or incomplete information on the parameters used to develop ML models. Therefore, the number of parameters that are estimated by each technique was not investigated as a possible source of heterogeneity among studies. Additional heterogeneity in the observed results is due to the variability of calibration differences between equipment and differences between readers or observers, as well as variation in the implementation of tests. Another possible bias may be due to the preprocessing techniques on the extracted data and feature selection and feature normalization methods.
We limited the search to English-only studies; although this is common in systematic reviews, this exclusion criterion could have reduced the generalizability of the findings. However, the extent and effects of language bias have recently diminished because of a shift toward publication of studies in English [
Finally, publication bias was not assessed in our analysis, as there are currently no statistically adequate models in the field of meta-analysis of diagnostic test accuracy [
ML has shown its potential to empower clinicians in the detection of prostate cancer. The accuracy of ML algorithms for diagnosis of PCa was considered acceptable, in terms of heterogeneity, for 12 radiomic studies investigating mpMRI and 5 genomic studies using urine biomarkers.
However, given the limitations indicated in our study, further well-designed studies are warranted to extend the potential use of ML algorithms to clinical settings. Recommendations on the use of these techniques were also provided to help researchers to design robust studies aiming to identify radiomic and genomic biomarkers to detect cancer.
Supplementary material.
artificial intelligence
area under the curve
dynamic contrast-enhanced
deep learning
diffusion-weighted imaging
false-negative
false-positive
hierarchical summary receiver operating characteristic curve
machine learning
multiparametric magnetic resonance imaging
prostate cancer
Prostate Imaging Reporting and Data System
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
prostate-specific antigen
quality assessment of diagnostic accuracy studies–version 2
region of interest
synthetic minority oversampling technique
true-negative
true-positive
transrectal ultrasonography
transition zone
volume of interest
This work was supported by “Progetti di Ricerca Corrente” funded by the Italian Ministry of Health.
RC, MF, and CC collected the data. All authors contributed to project development, data analysis, and the writing and editing of the manuscript.
None declared.