Development of an Online Health Care Assessment for Preventive Medicine: A Machine Learning Approach

Background In the era of information explosion, the use of the internet to assist with clinical practice and diagnosis has become a cutting-edge area of research. The application of medical informatics allows patients to be aware of their clinical conditions, which may contribute toward the prevention of several chronic diseases and disorders. Objective In this study, we applied machine learning techniques to construct a medical database system from electronic medical records (EMRs) of subjects who have undergone health examination. This system aims to provide online self-health evaluation to clinicians and patients worldwide, enabling personalized health and preventive health. Methods We built a medical database system based on the literature, and data preprocessing and cleaning were performed for the database. We utilized both supervised and unsupervised machine learning technology to analyze the EMR data to establish prediction models. The models with EMR databases were then applied to the internet platform. Results The validation data were used to validate the online diagnosis prediction system. The accuracy of the prediction model for metabolic syndrome reached 91%, and the area under the receiver operating characteristic (ROC) curve was 0.904 in this system. For chronic kidney disease, the prediction accuracy of the model reached 94.7%, and the area under the ROC curve (AUC) was 0.982. In addition, the system also provided disease diagnosis visualization via clustering, allowing users to check their outcome compared with those in the medical database, enabling increased awareness for a healthier lifestyle. Conclusions Our web-based health care machine learning system allowed users to access online diagnosis predictions and provided a health examination report. Users could understand and review their health status accordingly. In the future, we aim to connect hospitals worldwide with our platform, so that health care practitioners can make diagnoses or provide patient education to remote patients. This platform can increase the value of preventive medicine and telemedicine.


Introduction
In the ever-changing technological era, the internet can provide rapid and convenient medical services in the form of health care, preventive medicine, and telemedicine. Medical informatics is a multidisciplinary field that comprises medicine and computer science. As computer technology continues to advance, medical informatics can be used to develop various applications such as electronic medical records (EMRs), medical image processing, clinical diagnosis decision systems, hospital information management systems, telemedicine, and internet and health information systems [1][2][3][4].
To construct a health care information system, several factors must be considered: the hospital information system, including both clinical management and diagnosis services; the storage and processing of patient information, such as EMRs and electronic health records; decision support systems, such as expert diagnosis systems; and the artificial intelligence (AI) algorithms that need to be applied to those factors (eg, data mining in EMRs and decision-making in clinical diagnosis) [5][6][7][8].
The mass application of EMRs and the digitalization of medical equipment and instruments have led to the continuous expansion of information capacity in hospital databases. Therefore, informatics research should focus on basic electronic medical database construction, data collection and analysis, medical decision support, and automatic knowledge acquisition. Furthermore, the use of machine learning (ML) technology in AI to extract the most important information has led to cutting-edge research in medicine [9][10][11][12][13]. The goal of AI is to construct an intelligent machine that imitates the natural intelligence of humans. Computers, robots, and software that are made with such technology will have human-like thinking processes, but with the ability to utilize superhuman speed and power effectively. Knowledge engineering is an essential part of AI research, especially ML, because AI operations require a significant amount of real-world data.
ML is defined as a "machine that is capable of self-learning without any guidance." Therefore, the main purpose of ML is to make computers self-learning and auto-correcting when analyzing data. The core technology of ML must identify specific patterns and information hidden within very large data sets using statistical analysis and prediction automatically [14][15][16][17].
Disease and disability are influenced by several factors: environmental factors, genetic predisposition, pathogens, and lifestyle choices. Some conditions are a dynamic process that can affect an individual before they are aware of any problem [18][19][20]. The core of preventive medicine is to prevent chronic diseases among people who are at risk of certain diseases. In some cases, it can also be used to reverse their condition, returning them to a good health status. In the past, due to information asymmetry, doctors and hospitals led the medical environment, and patients did not have access to any appropriate methods or information to implement real-time self-management. Patients who failed to obtain an early diagnosis would have to pay higher health care costs. Therefore, the spirit of prevention medicine is that "an ounce of prevention is worth a pound of cure" [21][22][23].
Metabolic syndrome (MetS) is a cluster of conditions comprising high blood sugar, high blood pressure, abnormal blood lipid levels, abdominal obesity, and other metabolic risk factors. It is a warning sign of potential future chronic disease. People with MetS have an increased risk of subsequent development of type II diabetes, hypertension, hyperlipidemia, heart disease, and stroke compared with healthy people [24][25][26][27][28].
Chronic kidney disease (CKD) is defined as kidney function that is impaired for longer than 3 months, leading to irreversible damage. The National Kidney Foundation Kidney Disease Outcome Quality Initiative guideline classifies CKD into 5 stages according to the estimated glomerular filtration rate (eGFR) and using the recommended Modification of Diet in Renal Disease (MDRD) equation [29]. There are many causes of CKD, such as congenital anomalies of the kidney, urinary tract obstruction, urinary tract infection, and glomerulopathy. In addition, hypertension, diabetes, and gout are common chronic diseases that cause CKD if undertreated [30,31].
Telemedicine uses information and telecommunication technology to deliver medical information and physicians' diagnoses to patients without the limitations of time and space. It combines information and communication technologies with medical expertise to provide various services: remote consultation and conferencing for doctors; comprehensive medical care for residents in remote and outlying islands; and teaching and training opportunities for medical staff. The internet can be used to assist with the popularization of telemedicine to achieve a two-way communication channel between patients and medical practitioners [32,33]. Therefore, this study aims to construct an online ML-driven medical database system from EMRs of subjects who have undergone health examination, and provide online self-health evaluation for MetS and CKD.

Setting
The study was conducted at the Health Management Center (HMC) of Taipei Medical University Hospital (TMUH). Electronic medical records (EMRs) were obtained and reviewed from the HMC, which receives approximately 60 to 70 visits per month.

Ethics
The study was approved by the Institutional Review Board (IRB) of TMUH prior to data collection (TMUH TMU-JIRB number N201906023), in accordance with the original and amended Declaration of Helsinki. The IRB waived the need for informed consent because of the retrospective nature of this study.

EMR Database and System
The databases and the selected predicting variables (Table 1) were derived from previous publication on MetS and CKD [34][35][36]. Figure 1 shows an overview of the system and the main functions. Briefly, using a series of complicated procedures, the two databases (MetS and CKD) were connected to an internet platform to construct one integrated system. This web-based system was embedded with ML models to provide various medical evaluations and analyses. The online system was constructed on a server as a web-based environment. The frontend implementation included the programming language JavaScript (Oracle Corp), the framework VueJS (Vue), and the styling Syntactically Awesome Style Sheets (Sass). The backend implementation used Java and R as the programming languages, and all ML calculations and evaluations were conducted using the statistical program R (version 3.6.1, R Foundation for Statistical Computing). The back web framework was Spring Boot (Pivotal Software), connecting the MySQL (Oracle Corp) database as the storage system.

Study Populations
Figure 2 [37][38][39] shows an overview of the main study population and the validation populations. Briefly, the starting study population included 48,628 EMRs of Taiwanese adults aged over 18 years who underwent a self-paid health examination at TMUH from July 2015 to December 2019. All the study participants completed a self-questionnaire on demographics, existing medical conditions, and the use of medications.   Flowchart of data collection and preprocessing for MetS and CKD data sets including training and validation sets. SAS Enterprise Guide is a software that combines the analytic ability of SAS software with a user-friendly interface. It provides several functions of Structured Query Language (SQL), which includes a text mining technique. ACC: accuracy; AUC: area under the curve; BUN: blood urea nitrogen; CKD: chronic kidney disease; KNN: k-nearest neighbors algorithm; MetS: metabolic syndrome; UA: uric acid. * Centers for Disease Control and Prevention (CDC) and National Center for Health Statistics (NCHS) [37], ** Iimori et al [39], *** De Nicola et al [38].
Subsequently, the starting population data underwent data cleaning and preprocessing to form two distinct databases (MetS and CKD) for ML. For the MetS database, there were a total of 1129 participants after the exclusion of participants without FibroScan (Echosens) measurements. For the CKD database, there were a total of 2287 participants after the exclusion of participants without values for creatinine, blood urea nitrogen, and uric acid.
Due to the inconsistent definition of MetS across the world, the ML performance of the MetS database and the CKD database were validated using different study populations. The ML performance of the CKD database was validated using Taiwanese, Italian, US, and Japanese data sets, but the ML performance of the CKD database was only validated using a Taiwanese data set [37][38][39]. Since different variables may be unavailable in different validation data sets, unavailable variables were simply excluded in ML performance analysis for a balanced comparison.

ML Techniques
The ML techniques used in this system included supervised learning models, such as classification and regression tree (CART) and random forest [35,36]. Supervised learning was applied to classify the patients in the training set and predict patients with a specific chronic disease or syndrome in the validation set before the prediction model was available on this system [40,41]. In addition, unsupervised learning (hierarchical clustering using the Ward method and Euclidean distance) was embedded in a heat map, providing classified visualization between new input records and the database. An interactive heat map that could be rearranged or zoomed in and out was applied to this system [42][43][44][45][46][47].
All outcomes were presented on the web platform after the ML system evaluated the users' EMRs. Although the ML system was developed on a web-based interface, it could be embedded in the Internet of Medical Things (IoMT) environment, for example, as apps or real-time monitoring systems between several medical centers and hospitals [48][49][50].

Questionnaire Selection
To measure the usability of websites, we invited potential users of the ML system (physicians, medical staff, and potential users) to fill out a system usability scale (SUS) evaluation questionnaire. SUS was chosen as the usability test tool because previous studies found it to be reliable and quick to answer, and the final score is provided with interpretation based on a well-established reference standard [51,52]. In general, the higher the SUS score, the better the usability of the website. Details about the questionnaire design (the 10 questions), score summary, and results of reliability and validity tests are given in Multimedia Appendix 1.

Results
The web-based health care ML system provides online diagnosis of three diseases (Figure 3), and it is available on the internet [53]. The website provides an assessment of MetS and CKD; the system for noncancer liver disease is still under beta testing. Report pages are provided for online diagnosis of each disease. Therefore, users from all over the world can choose the evaluation provided depending on their requirements. Users input the predicting variables (Table 1) into the website to evaluate their health (Figure 4), and the evaluation results will appear in <5 seconds when there is a single request. Missing predicting variables are allowed, and the missing values will be imputed based on the mean values from the database. However, the users are warned that missing predicting variables may result in poorer prediction accuracy. The details of stress tests with different numbers of requests (100 to 800) can be found in Multimedia Appendix 2. Briefly, a stress test with 800 requests reports a throughput of 4.7 requests per second. To evaluate the usability of the system, we invited 30 volunteers to complete the SUS evaluation questionnaire. The volunteers included 6 physicians, 12 medical staff, and 12 potential users (Multimedia Appendix 1). It was found that the average SUS score is 74, which indicates a good usability rating [54]. In addition, results were found to be reliable and valid by Kaiser-Meyer-Olkin and Bartlett tests. The entire analysis process follows a strict privacy policy, so that none of the patients' private information is ever recorded.  The clinical outcomes established by our database are reported on the website when users have finished entering their medical record data on the website ( Figure 5). The CART model and ensemble learning model (random forest) are shown in the output interface. A scoring prediction model obtained using the supervised learning model is also provided online. For unsupervised learning, a color visualization of the clustering heat map depicts a vivid medical pattern of the patient's EMR data, and a record of each user is also constructed using hierarchical clustering with yellow highlights labeling in the heat map ( Figure 6). The user will then be classified as more similar to either a healthy subject (green column on the lower left) or an unhealthy subject (orange column on the upper left). A blue bar depicts abnormal values, while a red bar depicts normal values. In addition, on the web system, users can choose to view it as landscape or portrait. The zoom-in and zoom-out functions and the height of the cluster are also dynamic, with users being able to change the settings online to inspect the medical outcomes in detail.  Characteristics of participants in the training and validation data set for MetS can be found in Table 2 and the characteristics of participants of the training set and the validation sets for CKD can be found in Table 3. In general, there are minimal differences in patient characteristics between the training data set and the validation data set for the Taiwanese population of MetS and CKD. However, when comparing the characteristics of the Taiwanese population with other populations (US, Italy, and Japan) for CKD ML performance validation, it was found that there are substantial differences in age and presence of hypertension (Table 3).    Table 4 shows the validation performances of supervised learning models in predicting MetS and CKD. In general, it was found that the random forest ML model has higher accuracy than the CART model. Using the random forest ML model, MetS can be predicted with an accuracy of 0.909, and CKD can be predicted up to an accuracy of 0.947. Due to the inconsistent definition of MetS globally, the ML performance of the MetS database has only been validated using the Taiwan data set. However, the ML performances of the CKD database have been validated using data sets from Taiwan, Italy, the United States, and Japan. In general, the CKD database shows good external applicability, and has high AUC for all 4 validation data sets (Taiwan: AUC=0.982; USA: AUC=0.929; Italy: AUC=0.977; Japan: AUC=0.923). However, the validation accuracy and F1 value of CKD prediction differs more substantially, as the unavailable data were excluded from the analysis. When compared to the Taiwanese CKD data set, the respective unavailable data are approximately 6% for the US data set, 50% for the Italy data set, and 67% for the Japan data set. Therefore, it is observed that the Japanese validation data set has the lowest accuracy (0.743) in predicting CKD, as it also has the highest proportion of unavailable data.

Discussion
Overview This ML medical system for three common diseases in family medicine (MetS, CKD, and liver diseases) was constructed from EMR subjects who underwent self-paid health examination. Several ML prediction models are applied to the databases, and the outcomes are summarized and presented visually on the website for users and medical staff. The accuracy of predicting MetS reached 90.9%, and AUC was 0.904 in this system. For chronic kidney disease, the prediction accuracy reached 94.7%, and the AUC was 0.982. In general, users who were invited to test this system rated it with good usability and could easily assess their health online through this web-based ML monitoring system.

CART
Decision trees are an important type of ML algorithm for predictive modeling. They are commonly used in data mining with the objective of creating a model that predicts the dependent variable (the target) based on numerous independent variables [34,37].
A decision tree is a nonparametric ML modeling technique used for regression and classification problems. In classification problems, the target variable is categorical, and the tree is used to identify which group or class a target variable would likely fall into. In regression problems, the target variable is continuous, and the tree is used to predict its value. To find solutions, a decision tree makes a sequential, hierarchical decision about the outcome's variable according to the predictor [55][56][57].
Hence, CART can provide a visual tree-based diagram for medical practitioners to disseminate health care information to patients. It also helps users to understand the significance of different risk factors for specific diseases. For example, the cut-off controlled attenuation parameter (CAP) score was used to separate patients with MetS and those with other health observations. The CAP score was brought to the attention of users, thereby increasing their awareness of self-health [34].

Random Forest
Random forest, also called random decision forests, is a popular ensemble learning method in ML. Ensemble methods use multiple learning algorithms to improve ML results by combining several decision tree models. This approach allows better predictive performance compared with a single model. Random forest is a parallel ensemble method in which the base learners are generated in parallel. The basic motivation of parallel methods is to exploit independence between the base learners because the error can be reduced dramatically by averaging [58,59]. As random forest provides a bagging technique for feature estimates, it also offers efficient estimates of the test error without incurring the cost of repeated model training associated with cross-validation. Moreover, random forest ranks risk factors in prediction models, which clinicians can use as a reference for diagnosis, and remote users can use to review their risk assessment of related diseases [34,[60][61][62]. For instance, clinicians can refer to significant factors of certain diseases to determine whether those factors exceed the thresholds or not, allowing patients to be more vigilant about their risk of developing such diseases. In addition, sequential ensemble methods such as AdaBoost and XGBoost will be implemented and uploaded to our system in the future.

Clustering
Hierarchical clustering is a widely used unsupervised learning technique that groups data with similar characteristics. Both agglomerative and divisive approaches use dendrograms for the results. A heat map is a color graphical representation of data, which uses a matrix with color gradients to present the similarity of data.
Many studies on genetic bioinformatics and bacterial ecology have used heat maps for the analysis of large and complicated data sets, and some medical studies have used heat maps with clustering to present the relationship between various biomarkers according to their characteristics [34,[63][64][65]. Furthermore, our system provides an interactive clustering heat map for health care. From the perspective of big data, users can evaluate their health status by using ML models and EMR databases. In addition to online health evaluation, in the future, this system could be implemented into different IoMT to assist medical practitioners in achieving real-time health evaluations and monitoring remote patients or patients in specific wards. For the heat map, the EMR data of users were grouped into clusters of patients with diseases in the database; they would then be classified as clinically high-risk objects requiring close attention in the clinical setting [34]. Therefore, whether it is applied in preventive medicine for health management, in a monitoring system for critical care, or in the telemedicine environment, our system can provide real-time monitoring and help predict patient conditions.

Limitations and Future Work
To the best of our knowledge, this is the first web-based machine learning system based on self-paid health examination subjects that can provide an online self-health evaluation for several common diseases (MetS, CKD, and liver diseases). The version 1.0 web-based system still has several limitations that may be improved in the next update. First, the 1.0 system is not yet ready for embedding into a hospital for real-time assessment. We are currently working on an improved system to accept unstructured data input and multimodal data, which are especially essential for the prediction of eye diseases such as macular degeneration. Second, the 1.0 system did not have a user login or account security function. Retrievable prediction and security will be improved as the system is matured for hospital embedment. Third, the 1.0 system does not have whole dynamic analyses such as an interactive decision tree; whole dynamic analyses will be incorporated in subsequent versions to improve communication between the medical staff and patients.
In the future, more clustering algorithms will be implemented in subsequent versions to make the prediction results more robust and reliable. Although the 1.0 system can currently only evaluate three chronic diseases (MetS, CKD, and liver diseases) frequently encountered in family medicine, more chronic disease prediction models, such as those for coronary artery disease, will be added in the near future.

Conclusion
We constructed an ML health monitoring system to offer an online health assessment service to medical units, telemedicine patients, and all health-conscious users worldwide. Our aim is that this system will be implemented in medical centers as a real-time patient monitoring system and provide regular health evaluations for telemedicine patients. Online users can now access our platform and use ML technology to estimate their health status, increasing self-health awareness.