This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
HIV and sexually transmitted infections (STIs) are major global public health concerns. Over 1 million curable STIs occur every day among people aged 15 years to 49 years worldwide. Insufficient testing or screening substantially impedes the elimination of HIV and STI transmission.
The aim of our study was to develop an HIV and STI risk prediction tool using machine learning algorithms.
We used clinic consultations that tested for HIV and STIs at the Melbourne Sexual Health Centre between March 2, 2015, and December 31, 2018, as the development data set (training and testing data set). We also used 2 external validation data sets, including data from 2019 as external “validation data 1” and data from January 2020 and January 2021 as external “validation data 2.” We developed 34 machine learning models to assess the risk of acquiring HIV, syphilis, gonorrhea, and chlamydia. We created an online tool to generate an individual’s risk of HIV or an STI.
The important predictors for HIV and STI risk were gender, age, men who reported having sex with men, number of casual sexual partners, and condom use. Our machine learning–based risk prediction tool, named MySTIRisk, performed at an acceptable or excellent level on testing data sets (area under the curve [AUC] for HIV=0.78; AUC for syphilis=0.84; AUC for gonorrhea=0.78; AUC for chlamydia=0.70) and had stable performance on both external validation data from 2019 (AUC for HIV=0.79; AUC for syphilis=0.85; AUC for gonorrhea=0.81; AUC for chlamydia=0.69) and data from 2020-2021 (AUC for HIV=0.71; AUC for syphilis=0.84; AUC for gonorrhea=0.79; AUC for chlamydia=0.69).
Our web-based risk prediction tool could accurately predict the risk of HIV and STIs for clinic attendees using simple self-reported questions. MySTIRisk could serve as an HIV and STI screening tool on clinic websites or digital health platforms to encourage individuals at risk of HIV or an STI to be tested or start HIV pre-exposure prophylaxis. The public can use this tool to assess their risk and then decide if they would attend a clinic for testing. Clinicians or public health workers can use this tool to identify high-risk individuals for further interventions.
HIV and sexually transmitted infections (STIs) are major global public health concerns [
In response to the rising rates of STIs, the WHO proposed the “Global health sector strategy on Sexually Transmitted Infections, 2016-2021,” which aimed to end STI epidemics as public health concerns by 2030. This specifically includes a 90% reduction in gonorrhea incidence globally from the 2018 global baseline and achieving a rate of ≤50 congenital syphilis cases per 100,000 live births in 80% of countries [
An easily accessible and user-friendly tool that accurately identifies an individual's risk of infection could form part of a web-based risk prediction program and play a role in risk prediction and personalized risk management [
A number of mathematical techniques can be used to generate an individual’s risk of HIV and STIs. Logistic regression has limitations in predictive analysis that uses complex and big data. Logistic regression methods require strong assumptions and cannot easily deal with nonlinear relationships, interactions, and multicollinearity [
Despite the advantages of machine learning approaches, there is an absence of individual risk prediction tools for HIV and STI risk using machine learning models. Existing studies using machine learning algorithms to predict HIV and STI acquisition mainly focus on HIV [
The Melbourne Sexual Health Centre (MSHC) is the largest public sexual health center in Victoria, Australia and offers free HIV and STI testing and management [
We used data from March 2, 2015, to December 31, 2018, as the development data set (training and testing data set). The HIV study data set included training and testing data (88,642 consultations). The syphilis, gonorrhea, and chlamydia study data sets had 92,291, 97,473, and 115,845 consultations, respectively.
We used temporal validation as the external validation to evaluate the transportability and generalizability of our risk prediction models. The COVID-19 epidemic may potentially have changed the demographics of those who attend the MSHC [
Ethical approval was granted by the Alfred Hospital Ethics Committee, Melbourne, Australia (project number: 124/18). All methods were carried out following relevant guidelines and regulations of the Alfred Hospital Ethics Committee. As this was a retrospective study involving minimal risk to the privacy of the study participants, the need for informed consent was waived by the Alfred Hospital Ethics Committee. All identifying details of the study participants were removed before any computational analysis.
The data fields we selected for inclusion as predictors were informed by literature review, expert opinion, and prior work [
Characteristics of clinic consultations in the training and testing data set.
Variables | HIV (n=88,642 consultations) | Syphilis (n=92,291 consultations) | Gonorrhea (n=97,473 consultations) | Chlamydia (n=115,845 consultations) | |
|
|||||
|
Female | 26,651 (30.1) | 27,134 (29.4) | 31,282 (32.1) | 38,548 (33.3) |
|
Male | 61,991 (69.9) | 65,157 (70.6) | 66,191 (67.9) | 77,297 (66.7) |
Age at consultation (years), median (IQR) | 29.0 (24.0-35.0) | 29.0 (25.0-35.0) | 28.0 (24.0-35.0) | 28.0 (24.0-34.0) | |
|
|||||
|
Australia | 39,148 (44.2) | 40,990 (44.4) | 43,881 (45.0) | 51,162 (44.2) |
|
Overseas | 46,003 (51.9) | 47,670 (51.7) | 49,835 (51.1) | 60,272 (52.0) |
|
Missing | 3491 (3.9) | 3631 (3.9) | 3757 (3.9) | 4411 (3.8) |
|
|||||
|
No | 56,175 (63.4) | 57,413 (62.2) | 54,595 (56.0) | 68,584 (59.2) |
|
Yes | 25,067 (28.3) | 27,150 (29.4) | 34,751 (35.7) | 38,930 (33.6) |
|
Missing | 7383 (8.3) | 7728 (8.4) | 8127 (8.3) | 8331 (7.2) |
|
|||||
|
Not applicable (female) | 26,651 (30.1) | 27,134 (29.4) | 31,282 (32.1) | 38,548 (33.3) |
|
No | 16,508 (18.6) | 17,089 (18.5) | 15,245 (15.6) | 26,975 (23.3) |
|
Yes | 45,483 (51.3) | 48,068 (52.1) | 50,946 (52.3) | 50,322 (43.4) |
aSTI: sexually transmitted infection
HIV infection was defined as a new diagnosis of HIV based on serology. Syphilis infection was defined as a new diagnosis of early syphilis (primary, secondary, and early latent [<2 years]) using a blood test or nucleic amplification test (NAAT). Gonorrhea infection was defined as a new diagnosis of gonorrhea using culture or NAAT at any anatomical site. In the clinic, gonorrhea testing initially occurs with NAAT, and culture is mostly used after a positive NAAT. Chlamydia infection was defined as a new diagnosis using NAAT at any anatomical site. Our previous publications report the diagnostic methods in detail [
We developed 34 machine learning models to assess the risk of acquiring HIV, syphilis, gonorrhea, and chlamydia (details in
Development of machine learning algorithms. The architecture of the gradient boosting machine was adapted from Feng et al [
Logistic regression has been widely used to predict the risk of incident STIs and HIV [
We first established 4 regression models, including logistic regression, ridge regression, least absolute shrinkage and selection operator (LASSO) regression, and elastic net regression (ENR). Based on the preliminary results of the 4 regression analyses, we found that ENR was better than the other 3 regression analyses (details in
Stacking ensemble learning is an ensemble learning method that trains a new model based on the combined predictions of 2 (or more) previous machine learning models. Stacking ensemble learning often performs better than individual machine learning techniques [
Our models used a one-hot encoding scheme for data classification. We did not impute missing data but created a binary feature vector indicating missing values. The data were considered “imbalanced” given that each of the 4 infections was <10%. Imbalanced data may cause either overfitted or underperformed predictive results [
Our machine learning models predicted the probability of HIV or an STI with a normalized distribution between values 0 and 1. The model-predicted probability was calibrated to the actual prevalence level of HIV and STIs. We used a logistic function to provide a fitting curve for each model-predicted probability and infection prevalence. We regarded the estimated infection prevalence as the “calibrated risk” of infection and presented it in the risk report. We used MATLAB R2019a (MathWorks, Natick, MA) to calibrate the model-predicted probability to the actual prevalence level. The method is described in detail in our previous paper [
To investigate the effect of predictors, we used the best base machine learning model to calculate the variable importance for HIV, syphilis, gonorrhea, and chlamydia infection. We identified and selected predictors that accounted for more than 80.0% of the overall model performance for each infection. We retrained, retested, and revalidated the best performing model based on these predictors. We compared the AUC, sensitivity, and specificity to re-evaluate the model performance with the shortlisted predictors. We also used the AUC to evaluate the change in performance in the best machine learning model before and after predictor shortlisting (details in
Our training and testing data included 216 (0.2% of 88,642 consultations) HIV infections, 787 (1.9% of 92,291 consultations) syphilis infections, 7581 (7.8% of 97,473 consultations) gonorrhea infections, and 10,217 (8.8% of 115,845 consultations) chlamydia infections. The proportion of each of the 4 infection data sets that was men was between 66.7% (77,297/115,845) and 70.6% (65,157/92,291). Further details are provided in
Our results demonstrated that the ensemble learning models performed better than individual machine learning models. Of all 34 models, our best model (ensemble ENR+GBM+RF) provided acceptable or excellent performance on testing data for predicting HIV (AUC=0.78), syphilis (AUC=0.84), gonorrhea (AUC=0.78), and chlamydia (AUC=0.70; Figures S1-S3 in
The top 10 predictors for each of the 4 infections accounted for >80.0% of the overall HIV and STI model performance. These predictors included gender, presence of STI symptoms, MSM, age, country of birth, having sex with a man in the last 12 months, the number of casual male sexual partners in the last 12 months, condom use with male partners in the last 12 months, the number of casual female sexual partners in the last 12 months, drug injection in the last 12 months, sex overseas in the last 12 months, past gonorrhea infection, past nonspecific urethritis infection, past syphilis infection, contact with a gonorrhea case, contact with a chlamydia case, and contact with a syphilis case (
Importance of the top 10 predictors in the prediction of HIV or sexually transmission infections (STIs) using a gradient boosting machine, for detecting (A) HIV, (B) syphilis, (C) gonorrhea, and (D) chlamydia.
Based on the selected most important predictors and the best model (ensemble ENR+GBM+RF), we built a HIV and STI risk prediction tool, named
Receiver operating characteristic curve performance of the HIV and sexually transmitted infection (STI) risk prediction tool on (A) testing data analysis from 2015-2018, (B) external data validation analysis from 2019, and (C) external data validation analysis from 2020-2021. AUC: area under the curve.
To estimate the risk of HIV or an STI, we fitted the data using a logistic function to provide a fitting curve for each model-predicted probability and infection prevalence (Figures S4-S7 in
Graphical user interface elements of the HIV and sexually transmitted infection (STI) risk prediction tool, called MySTIRisk. A prototype version of the tool is available at [
These are examples of the HIV and STI risk prediction results:
Your HIV risk is about 2/1000. In a group of 1000 people like me, 2 will have HIV. 998 people will not have HIV.
Your syphilis risk is about 10/1000. In a group of 1000 people like me, 10 will have syphilis. 990 people will not have syphilis.
Your gonorrhea risk is about 30/1000. In a group of 1000 people like me, 30 will have gonorrhea. 970 people will not have gonorrhea.
Your chlamydia risk is about 50/1000. In a group of 1000 people like me, 50 will have chlamydia. 950 people will not have chlamydia.
The following examples describe testing recommendations:
Benefits of testing: Prevent all complications and prevent unknowingly transmitting infection to others.
Consequences of not testing: Complications from infections such as infertility (untreated chlamydia), chronic pain (untreated chlamydia), hearing loss (untreated syphilis), and cancer (untreated HIV).
This is the first web-based risk prediction tool based on machine learning algorithms and self-reported data to accurately identify HIV and syphilis, gonorrhea, and chlamydia infection in men and women and was stable on external validation. Our findings showed that machine learning algorithms could predict HIV and STIs in clinic attendees. Our results also showed that stacking ensemble learning algorithms perform better than individual machine learning models to predict HIV and STIs. We then developed a web-based application to provide an immediate and individualized assessment for the risk of a positive diagnosis of HIV and 3 STIs. Our application could be a part of clinic websites or digital health platforms to identify individuals with a higher risk of HIV and STIs or potential candidates for HIV pre-exposure prophylaxis (PrEP). Further validation studies in other countries can assess the usefulness of this risk prediction tool, which helps reduce HIV and STI incidence and the cost of HIV and STI screening, which requires expensive equipment and specialized expertise.
Our results showed that nonlinear machine learning algorithms provided better performance than the conventional logistic regression for predicting HIV and STIs in men and women. Our findings are consistent with the results of previous machine learning predictive models for HIV and STIs [
Our results showed that the stacking ensemble machine learning techniques outperform individual machine learning models. We systematically developed and tested 34 machine learning models and found that stacking ensemble learning technology outperformed individual machine learning models [
Our models have several strengths compared with previous machine learning models for predicting HIV and STIs. First, our predictive models were not limited to high-risk groups (such as MSM). HIV and STI risk prediction models have been published previously but mainly for high-risk individuals, such as MSM [
We were unable to locate any web-based, publicly available tool to quantify STI risk. We identified some available web-based HIV prediction tools, such as the “HIV risk prediction tool” [
Our web-based HIV and STI risk prediction tool can be used as a screening tool to potentially increase HIV and STI testing and encourage access to testing and health care (Figure S8 in
There are many possible ways that our web-based risk prediction tool could be potentially used, including as part of a behavioral intervention to control HIV and STIs or to help clinicians or public health workers identify high-risk individuals for risk management or further interventions. An example of this exists in adolescent health risk behaviors. Researchers used an individual’s risk behavior scores and personalized feedback as part of an intervention for health behaviors, including nutritional behaviors, physical activity, and sleep [
Future work will investigate the effectiveness of this web-based HIV and STI risk prediction tool for behavioral change (ie, uptake of PrEP or condom promotion) and STI service utilization behaviors (timely clinic attendance and HIV and STI testing uptake) after receiving risk prediction results and testing recommendations. Implementing this web-based HIV and STI prediction tool may encourage individuals with STI symptoms or those at high risk without symptoms to attend health services for timely testing and regular testing. Since February 2009, the MSHC has offered MSM regular SMS reminders for STI screening [
This study has some limitations. First, the predictive factors depend on self-reported information from the CASI system, which is subject to the participants' recall, nonresponse, and social desirability bias. For example, MSM who declined to report the number of male partners were at a higher risk of chlamydia [
This is the first web-based risk assessment tool using machine learning algorithms and self-reported data to identify HIV, syphilis, gonorrhea, and chlamydia in men and women. Our online risk prediction tool could accurately predict the risk of HIV and STIs in clinic attendees with a simple self-administered questionnaire. Our risk prediction tool could be part of clinic websites or digital health platforms. The public can use this risk prediction tool to assess their HIV and STI risk to inform testing. Clinicians or public health workers can use this risk prediction tool to identify high-risk individuals for further interventions.
Supplementary tables and figures.
artificial intelligence
area under the curve
computer-assisted self-interview system
cross-validation
deep learning
electronic health records
elastic net regression
gradient boosting machine
least absolute shrinkage and selection operator
Melbourne Sexual Health Centre
men who have sex with men
nucleic amplification tests
Naive Bayes
pre-exposure prophylaxis
random forest
ridge regression
sexually transmitted infection
World Health Organization
EC and JJO are supported by Australian National Health and Medical Research Council Emerging Leadership Investigator Grants (GNT1172873 and GNT1193955, respectively). CKF is supported by an Australian National Health and Medical Research Council Leadership Investigator Grant (GNT1172900). LZ is supported by the National Natural Science Foundation of China (Grant number: 81950410639); Outstanding Young Scholars Support Program (Grant number: 3111500001); Xi’an Jiaotong University Basic Research and Profession Grant (Grant number: xtr022019003, xzy032020032); Epidemiology modeling and risk assessment (Grant number: 20200344); and Xi’an Jiaotong University Young Scholar Support Grant (Grant number: YX6J004). The authors want to acknowledge Afrizal Afrizal from the Melbourne Sexual Health Centre (MSHC) for data extraction. The authors thank Glenda Fehler for her contribution to data cleaning. The authors also would like to acknowledge Jon Emery from the University of Melbourne for an insightful discussion on risk prediction tools (eg,
XX, CKF, and LZ conceived and designed the study. XX cleaned the data, established the models and coding, wrote the first draft, and edited the manuscript. WL, EC, CKF, and LZ contributed to data cleaning. XX, ZG, ZY, YB, and LZ contributed to establishing the models and coding. JW and XX developed the web-based application. CKF and LZ contributed to establishing the web-based application. EC, CKF, and LZ contributed to data verification and supervision. EC, YB, ZY, ZG, JJO, WL, CKF, and LZ contributed to the interpretation of data and manuscript revision. All authors contributed to the preparation of the manuscript and approved the final manuscript.
None declared.