A Novel Approach for Continuous Health Status Monitoring and Automatic Detection of Infection Incidences in People With Type 1 Diabetes Using Machine Learning Algorithms (Part 2): A Personalized Digital Infectious Disease Detection Mechanism

Background: Semisupervised and unsupervised anomaly detection methods have been widely used in various applications to detect anomalous objects from a given data set. Specifically, these methods are popular in the medical domain because of their suitability for applications where there is a lack of a sufficient data set for the other classes. Infection incidence often brings prolonged hyperglycemia and frequent insulin injections in people with type 1 diabetes, which are significant anomalies. Despite these potentials, there have been very few studies that focused on detecting infection incidences in individuals with type 1 diabetes using a dedicated personalized health model. Objective: This study aims to develop a personalized health model that can automatically detect the incidence of infection in people with type 1 diabetes using blood glucose levels and insulin-to-carbohydrate ratio as input variables. The model is expected to detect deviations from the norm because of infection incidences considering elevated blood glucose levels coupled with unusual changes in the insulin-to-carbohydrate


Introduction
Anomaly or novelty detection problem involves identifying the anomalous or novel instances, which exhibit different characteristics, from the rest of the data set and has been widely used in various applications including machine fault and sensor failure detection, prevention of credit card or identity fraud, health and medical diagnostics and monitoring, cyber-intrusion detection, and others [1][2][3][4][5][6][7]. The term anomaly was precisely coined by Hawkins [8] as "observations that deviate much from the other observations so as to arouse suspicions that it could be generated by a different process." Anomalousness is usually described as point, contextual, and collective, depending on how the degree of anomaly is computed [1,7,9]. On the basis of the necessity of having labeled data instances for the respective class, the anomaly detection problem can be approached as supervised, semisupervised, and unsupervised [3,7,[9][10][11]. Supervised anomaly detection, for example, multiclass classification, requires labeled data instances for both the target and the nontarget (anomaly) classes. This characteristic makes it impractical for tasks where there is difficulty in either finding enough samples for the anomaly class, that is, poorly sampled and unbalanced data, or demarcating boundaries of the anomaly class [7,10,12]. Moreover, anomalies could also evolve over time, and what is known today might not be valid through time, making the characterization of anomalies class more challenging. In this case, semisupervised anomaly detection, that is, one-class classification, is preferred given that it only requires characterizing what is believed to be normal (target data instances) to detect the abnormal (nontarget data instances) [7]. Under certain circumstances, for example, medical domain, obtaining and demarcating the anomalous (nontarget) data instances can become very difficult, expensive, and time consuming, if not impossible [7,13]. For instance, assume a health diagnostic and monitoring system that detects health changes in an individual by tracking the individual's physiological parameters, where the current health status is examined based on a set of parameters, and raises a notification alarm when the individual health deteriorates [12]. In such a system, it becomes feasible to rely on a method that can be trained using only the regular or normal day measurements (target days) so as to detect deviation from normality [12,14]. This is because demarcating the exact boundaries between normal and abnormal health conditions is very challenging given that each pathogen has a different effect on the individual physiology. The one-class classifiers-based anomaly detection methods can be roughly grouped into 3 main groups: boundary and domain-based, density-based, and reconstruction-based methods based on how their internal function is defined and the approach used for minimization [3,10,12,13,15,16]. These models take into account different characteristics of the data set, and depending on the data set under consideration, these models could achieve different generalization performance, overfitting, and bias [12]. Unlike supervised and semisupervised anomaly detection methods, unsupervised methods do not require labeled instances to detect the anomaly (nontarget) instances because they rely on the entire data set to determine the anomalies and can be another possible alternative to semisupervised anomaly detection methods [7,10,12]. One of the drawbacks of unsupervised methods is that they require significant amount of data to achieve comparable performance. Both semisupervised and unsupervised methods have been used in various applications to detect anomalous instances [1,7,10,16]. In particular, these methods have been popular in the medical domain owing to their suitability for such applications, where there is lack of a sufficient data set for the other classes [13]. Accordingly, considering the difficulty and expense of obtaining enough sample data sets for the infection days from people with type 1 diabetes, a one-class classifier and unsupervised models are proposed for detecting infection incidence in people with type 1 diabetes. Type 1 diabetes, also known as insulin-dependent diabetes, is a chronic disease of blood glucose regulation (hemostasis), and is caused by the lack of insulin secretion from pancreatic cells [17,18]. In people with type 1 diabetes, the incidence of infection often results in hyperglycemia and frequent insulin injection [19][20][21][22][23][24][25][26]. Infection-induced anomalies are characterized by violation of the norm of blood glucose dynamics, where blood glucose remains elevated despite taking a higher amount of insulin injection with less carbohydrate consumption [19]. Despite these potentials, there have been very few studies that focused on detecting infection incidence in individuals with type 1 diabetes using a dedicated personalized health model. Therefore, the objective of this study was to develop an algorithm, that is, a personalized health model that can automatically detect the incidence of infection in people with type 1 diabetes using blood glucose levels and insulin-to-carbohydrate ratio as input variables. For this, a one-class classifier and unsupervised models are proposed. The model is expected to detect any deviations from the norm because of infection incidences considering elevated blood glucose level (hyperglycemia incidences) coupled with unusual changes in the insulin-to-carbohydrate ratio, that is, frequent insulin injections and unusual reduction in the amount of carbohydrate intake [19]. Three groups of one-class classifiers and two unsupervised density-based models were explored. A detailed theoretical description of the proposed models is given in Multimedia Appendix 1 [1,[7][8][9][10][11][12][13][14][15][16][27][28][29][30][31][32][33][34][35][36][37]. The anomaly detection problem studied in this paper can be regarded as a contextual anomaly, where the ratio of insulin-to-carbohydrate is the context and the average blood glucose level is the behavioral attribute. This is mainly because of the fact that elevated blood glucose levels do not always signify being anomalies without looking at the context of the ratio of insulin-to-carbohydrate in this case. Throughout the paper, the term object is used to describe a feature vector incorporating the number of parameters under consideration. For example, an object X can define a specific event of an individual blood glucose dynamics at a specified time index k and is represented by a feature vector X k =(x k,1 , x k,2 ), where x k, 1 represents the ratio of total insulin-to-total carbohydrate and x k,2 represents the average blood glucose level in a specific time-bin (interval) around k.

Methods
A group of one-class classifiers and unsupervised models were tested and compared. The one-class classifier incorporates 3 groups: boundary and domain-based, density-based, and reconstruction-based methods. The boundary and domain-based method contains support vector data description (SVDD) [27], one-class support vector machine (V-SVM) [28], incremental support vector machine [29], nearest neighbor (NN) [12], and minimum spanning tree (MST) [15]. Density-based method includes normal Gaussian [32], minimum covariance Gaussian [38], mixture of Gaussian (MOG) [32], Parzen [39], naïve Parzen [32], K-nearest neighbor (KNN) [12,30], and local outlier factor (LOF) [31]. The reconstruction-based method includes principal component analysis (PCA) [12,32], K-means [32], self-organizing maps (SOM) [12,32], and auto-encoder networks [12]. In addition, the unsupervised models were also tested, including the LOF [31,33] and the connectivity-based outlier factor (COF) [33,34]. The input variables, average blood glucose levels and ratio of total insulin (bolus) to total carbohydrate, used in training and testing of the models were selected in accordance with the description provided by Woldaregay et al [19], and the ratio was calculated by dividing the total insulin with the total carbohydrate within a specified time-bin. The data set consists of high-precision self-recorded data collected from 3 real subjects (2 males and 1 female; average age 34 [SD 13.2] years) with type 1 diabetes. It incorporates blood glucose levels, insulin, carbohydrate information, and self-reported infections cases of influenza (flu) and, mild and light common cold without fever, as shown in Table 1. Exemplar data depicting the model's input features for 2 specific patient years with and without infection are shown in Figures 1-4, and a more detailed description of the input features for 10-patient years with and without infection incidences can be found in Multimedia Appendix 2 [12,19]. The data were resampled and imputed in accordance with the description provided by Woldaregay et al [19], and the preprocessed data were smoothed using a moving average filter of 2 days' (48 hours) window size to remove short-term and small-scale features [19,40,41]. Feature scaling was carried out using min-max scaling [42] to normalize the data between 0 and 1, which is important to ensure that larger parameters do not dominate the smaller ones. The data sets are labeled as target and nontarget data sets, where the target data sets include all the self-recorded normal period of the year and the nontarget data set includes only the self-reported infection periods when the individual was sick. Accordingly, the one-class classifiers were trained using only the target data sets containing the regular or normal period of the year and tested using both the target and the nontarget (infection period) data sets. For the unsupervised models, all the data sets containing both the target and the nontarget data sets were presented during testing. The hyperparameters of most of the one-class classifiers were optimized using a consistency approach [43]. Models such as naïve Parzen and Parzen were optimized using the leave-one-out method. For MST, the entire MST was used. For PCA, the fraction of variance retained from the training data set was set to be 0.67. The models were evaluated based on different characteristics including data nature (with and without filter), data granularity (hourly and daily), data sample size, and required computational time. All the experiments were conducted using MATLAB 2018b (Mathworks, Inc). Most of the models were implemented using ddtools, prtools, and anomaly detection toolbox, which are MATLAB toolboxes [32,33,35].   Hourly scatter plot of average blood glucose levels versus total insulin (bolus) to total carbohydrate ratio for a specific regular or normal patient year without any infection incidences.  . Hourly scatter plot of average blood glucose levels versus total insulin (bolus) to total carbohydrate ratio for a specific patient year with an infection incidence (flu).

Model Evaluation
The performance of the one-class classifiers was evaluated using 20 times 5-fold stratified cross-validation. For both daily and hourly cases, the user-specified outlier fraction threshold ß was set to 0.01 such that 1% of the training target data are allowed to be classified as outlier or get rejected [12]. Class imbalance was mitigated by oversampling of the nontarget data sets through random sampling [44]. Performance was measured using the area under the receiver operating characteristic (ROC) curve (AUC), specificity, and F1-score [45][46][47][48]. The AUC, specificity, and F1-score were reported as the average (SD) of twenty times five-fold stratified cross-validation rounds. AUC is the result of integration (summation) of the ROC curve over a range of possible classification thresholds [49]. It is regarded as robust (insensitive) when it comes to the presence of data imbalance; however, it is impractical for real-world implementation because it is independent of a single threshold [48]. Specificity measures the ratio of correctly classified negative samples from the total number of available negative samples [50]. Thus, it depicts the proportion of infection days (nontarget samples) that are correctly classified as such to the total number of infection days (period). It is only used to examine how the model performs in regard to the nontarget class (infection days). F1-score is the harmonic mean of precision and recall, where the value ranges from 0 to 1, and high F1 scores depict high classification performance [45]. F1-score is considered appropriate when evaluating model performance with regard to one target class and in the presence of unbalanced data sets [10,[46][47][48]. The models were further compared based on various criteria, which can contribute to the implementation of the models in real-world settings, including computation time, sample size, number of user-defined parameters, and sensitivity to outliers in the training data sets: • Computation time: this characteristic defines the amount of time taken to train and test the model. Regarding personal use, response time is crucial for acceptance of the services by a wide range of users. Furthermore, with regard to the outbreak detection settings, this is an important parameter given that a system that uses data from many participants needs to have an acceptable response time. However, in real-world applications, the training phase can be performed in an offline mode, which makes the testing response time very crucial.
• Sample size: this characteristic specifies the minimum amount of training data required to generate an acceptable performance. This is an important factor given that the system relies on self-recorded data; it is difficult to accumulate a large set of data for an individual initially.
• Number of user-defined parameters: this characteristic defines the complexity of the model. It is simpler and less data are required to estimate a model with fewer parameters. This is an important factor because it is easier for an individual to implement the simple model compared with the complex model.
• Sensitivity to outliers in the training data sets: this characteristic defines how the model estimation is affected by outliers in the training set. This is a crucial characteristic because the model training depends on self-reported data, which are highly dependent on the accuracy of the user data registration. It is possible that the user might forget to report some infection incidence and hence might be considered as target data sets and be used as a training data set. Furthermore, errors incurred during manual registration of data can also affect model generalization.

Data Collection and Ethical Declaration
The study protocol has been submitted to the Norwegian Regional Committees for Medical Health Research Ethics Northern Norway for evaluation and was found exempted from regional ethics review because it is outside the scope of medical research (reference number: 108435). Written consent was obtained, and the participants donated the data sets. All data from the participants were anonymized.

Results
The models were evaluated based on two different versions of the same data set: raw and filtered. The input variables to the models were the average blood glucose levels and the ratio of total insulin (bolus)-to-total carbohydrate. The necessary computational time for both training and testing of the models was also estimated. A comparison of the classifiers was carried out taking into account their performance, necessary sample size for producing acceptable performance, and computational time. These models were further compared based on their theoretical guarantee provided for robustness to outliers in the target data set and based on their complexity. In addition, these classifiers were compared with the unsupervised version of some selected models.

Model Evaluation
Model training and evaluations were carried out on an individual basis taking into account different characteristics of the data, specified time window or resolution (hourly and daily), and nature of the data (raw data and its smoothed version). For daily evaluation, we compared the performance of the models on raw data and its smoothed version with a 2-day moving average filter. For hourly evaluation, we compared the performance of the model on a smoothed version of the data set. The purpose of the comparison was to study the performance gain achieved by removing short-time noises from the data set through smoothing. The average and SD of AUC, specificity, and F1-score are computed and reported for each model. The top performing models from each category are highlighted in italics within each tables.

Semisupervised Models
The regular or normal days were labeled as the target class data set and the infection period as the nontarget class data set. Three groups of one-class classifiers were trained on the target class and tested on a data set containing both the target and the nontarget classes. In addition to the data characteristics stated above, resolution and data nature, the one-class classifier performance was also assessed taking into account the required sample object size to produce acceptable data description. In this direction, we consider four groups of sample size: 1 month, 2 months, 3 months, and 4 months data sets. In the model evaluation, the data set containing the infection period was presented during testing. The evaluation was carried out based on 20 times 5-fold stratified cross-validation. The performance of the model was reported as the average and SD of AUC, specificity, and F1-score of the rounds. A score plot of each model for both the hourly and the daily scenarios using the smoothed version of the data can be found in Multimedia Appendix 3, where the models were trained on random 120 regular or normal days of the patient year and tested over the whole year.

Daily
As can be seen in Tables 2 and 3 below (see also Multimedia Appendix 4), the performance of the models generally improves as the size of the sample increases. The models performed well with respect to the raw data sets; however, the performance significantly improved with the smoothed version of the data. The results indicate that the sample size greatly affects the model performance and that there is a larger variation in performance when the training data set is small. Generally, it can be seen that the models generalize well with the 3-month data set (90 sample objects) and further improve after 3 months. In general, on average, with both the raw and smoothed data sets, the boundary and domain-based method performed better with a small sample size. As the sample size increased, all the three groups produced comparable descriptions of the data. From each respective category, models such as V-SVM, K-NN, and K-means performed well across all the sample sizes.

First Case of Infection (Flu)
The boundary and domain-based method achieved a better description of the data with a small sample size when compared with the other two groups. However, as the sample size increased, all the three groups achieved relatively comparable descriptions of the data. Specific models such as V-SVM, K-NN, and K-means performed better from their respective group. Regarding the raw data, as seen in Table 2, all the models failed to generalize from the 1-month data set as compared with the large sample objects, that is, 3 months, which was expected: 1. From the boundary and domain-based method, V-SVM performed better in all the sample sizes and achieved comparable performance even with 60 objects and improved significantly afterward. SVDD produced a comparable description with higher sample sizes, that is, 3 months and later. 2. From the density-based method, K-NN performed better in all the sample sizes and achieved better performance even with 60 objects. Naïve Parzen produced comparable performance with higher sample sizes, that is, 3 months and later. 3. From the reconstruction-based method, K-means achieved better performance for all sample sizes.
Smoothing the data, as shown in Table 3, improved the model performance even with 30 sample objects:

Second Case of Infection (Flu)
The boundary and domain-based method achieved better performance with a small sample size compared with the density and reconstruction-based methods. However, as the sample size increased, all the three groups achieved comparable performance. The detailed numerical values of comparison are given in Multimedia Appendix 4. Specific models such as V-SVM, K-NN, and K-means performed better from their respective group. Regarding the raw data, all the models failed to generalize from the 1-month data set as compared with the higher sample objects, that is, 3 months (Multimedia Appendix 4): 1. From the boundary and domain-based method, SVDD, MST, and incremental support vector data description (incSVDD) performed better with a larger sample object, and V-SVM achieved better description with 30 sample objects. 2. From the density-based method, all the models exhibited similar performance. Naïve Parzen and K-NN, with only 60 sample objects, achieved comparable performance with the higher sample objects. 3. From the reconstruction-based method, K-means achieved better performance for all sample sizes.
Smoothing the data significantly improved the performance of the model even with 30 objects, compared with the raw data (Multimedia Appendix 4): 1. From the boundary and domain-based method, the V-SVM achieved higher performance in all the sample sizes. 2. From the density-based method, LOF achieved better description with small sample objects, and K-NN produced better description with all the sample sizes. Gaussian families achieved improved and comparable performance with increased sample objects. Among them, K-NN with only 60 objects achieved comparable performance with larger sample objects. 3. Regarding the reconstruction-based method, K-means and SOM achieved better performance, whereas K-means performed better in all the sample sizes.

Third Case of Infection (Flu)
The boundary and domain-based method achieved better performance with a small sample size compared with the density and reconstruction-based methods. However, as the sample size increased, all the three groups produced comparable descriptions. The detailed numerical values of comparison are given in Multimedia Appendix 4. Specific models such as V-SVM, MST, LOF, and PCA performed better from their respective group. Regarding the raw data, surprisingly, in contrast to the previous two infection cases, all the models achieved higher generalization from the 1-month data set (Multimedia Appendix 4): 1. From the boundary and domain-based method, SVDD, V-SVM, MST, and incSVDD performed better in all the cases, with MST achieving better performance.
2. From the density-based method, normal and MCD Gaussian achieved better description of the data with 1-month sample objects. K-NN and LOF performed better with sample sizes larger than 1-month sample objects, and LOF outperformed all sample sizes. The LOF with only 60 objects achieved comparable performance with the higher sample objects. 3. From the reconstruction-based method, PCA produced better description for all sample sizes, whereas K-means and SOM achieved comparable performance with sample size larger than 1-month sample objects.
Smoothing the data allowed the models to generalize well and significantly improved the performance of the model even with 30 objects, compared with the raw data (Multimedia Appendix 4): 1. From the boundary and domain-based method, the V-SVM and MST achieved higher performance in all the sample sizes, whereas V-SVM outperformed all the models. 2. From the density-based method, the Gaussian families, LOF, and K-NN achieved better performance, whereas LOF achieved better performance in all sample sizes. 3. Regarding the reconstruction-based method, K-means and PCA achieved better performance, whereas PCA performed better in all the sample sizes.

Fourth Case of Infection (Flu)
The boundary and domain-based method achieved better performance with small sample sizes compared with the density and reconstruction-based methods. All the three groups improved with increasing sample size. The detailed numerical values of comparison are given in Multimedia Appendix 4. Specific models such as V-SVM, LOF, and K-means performed better from their respective group. Regarding the raw data, surprisingly, in contrast to all the previous three infection cases, all the models achieved higher generalization from the 1-month data set (Multimedia Appendix 4): 1. From the boundary and domain-based method, SVDD, V-SVM, and incSVDD performed better for all the sample sizes. 2. From the density-based method, MCD Gaussian performed better with a 1-month sample size, and all the models produced comparable descriptions as the sample size increased, whereas the LOF performed better for all the sample sizes. 3. From the reconstruction-based method, PCA performed relatively better for all the sample sizes, and K-means and SOM achieved comparable performance with a larger sample size.
Smoothing the data significantly improved the model performance even with 30 objects compared with the raw data (Multimedia Appendix 4): 1. From the boundary and domain-based method, the V-SVM achieved higher performance in all the sample sizes. As the sample size increased, the incSVDD and MST achieved comparable performance.
2. From the density-based method, K-NN and LOF produced better descriptions with a 1-month sample size. K-NN performed better in almost all sample sizes. 3. From the reconstruction-based method, K-means achieved better performance for all sample sizes.

Hourly
As can be seen in Table 4 (see also Multimedia Appendix 4), the performance of the model generally improved as more training sample data were presented. The models produced comparable performance even with the 1-month data set compared with the daily scenario. This is mainly because of the presence of more samples per day (24 samples per day), which enables the models to reach a better generalization. Generally, the results indicate that the models generalize well after 2 months. Both the boundary and domain-based method and reconstruction-based method achieved better performance even with a 1-month sample size. However, the density-based method suffers from large variation with 1-month training samples. In general, the boundary and domain-based method performed better in all the infection cases compared with the other two methods. In addition, specific models such as V-SVM, K-NN, and K-means performed well from their respective groups.

First Case of Infection (Flu)
The boundary and domain-based method achieved better performance compared with the density and reconstruction-based methods. As can be seen in Table 4, the boundary and domain-based method achieved better generalization from the 1-month data set. Specific models such as V-SVM, K-NN, and K-means performed better from their respective group: 1. From the boundary and domain-based method, V-SVM achieved better description in all sample sizes, whereas SVDD, incSVDD, and V-SVM achieved comparable performance with a larger sample size. 2. From the density-based method, Gaussian families and naïve Parzen performed better at large sample sizes, whereas K-NN and LOF achieved better performance in all the sample sizes. K-NN outperformed all the models. 3. From the reconstruction-based method, K-means performed better in all the sample sizes, and all the other models performed better with larger sample sizes.

Second Case of Infection (Flu)
The boundary and domain-based method and reconstruction-based method achieved better performance for all sample sizes compared with the density-based method. Specifically, the boundary and domain-based method achieved better generalization from the 1-month data set. The detailed numerical values of comparison are given in Multimedia Appendix 4. Specific models such as V-SVM, K-NN, and K-means performed better from their respective group: 1. From the boundary and domain-based method, V-SVM achieved better description for all the sample sizes, and SVDD, NN, and incSVDD improved with larger training sample size; however, V-SVM outperformed all the models for all the sample sizes. 2. From the density-based method, normal and MCD Gaussian performed better with the 1-and 2-month sample sizes, and models such as K-NN performed better on all the sample sizes, whereas naïve Parzen outperformed all the models with the 3-and 4-month data sets. 3. From the reconstruction-based method, K-means produced better description for all the sample sizes and the auto-encoder and SOM performed better with larger sample sizes.

Third Case of Infection (Flu)
Generally, in comparison, all the groups performed better at large training sample sizes; however, the boundary and domain-based method achieved better performance with small training sample sizes. It achieved comparable generalization from the 1-month data set. The detailed numerical values of comparison are given in Multimedia Appendix 4. Specific models such as V-SVM, families that utilize nearest neighbor distance (K-NN and LOF), and PCA performed better from their respective group: 1. From the boundary and domain-based method, SVDD, NN, MST, incSVDD, and V-SVM achieved better performance at larger training sample sizes, whereas V-SVM outperformed all the models for all the sample sizes. 2. From the density-based method, the Gaussian families, K-NN, LOF, and naïve Parzen achieved better performance at larger training sample sizes, whereas K-NN and LOF outperformed all the models for all the sample sizes. 3. From the reconstruction-based method, K-means, PCA, auto-encoder, and SOM achieved better performance at larger training sample sizes, whereas PCA performed better for all sample sizes.

Fourth Case of Infection (Flu)
Generally, in comparison, all the group performed better at large training sample size; however, the boundary and domain-based method achieved better performance with small training sample sizes, for example, 1-month data set. It achieved comparable generalization from the 1-month data set. The detailed numerical values of comparison are given in Multimedia Appendix 4. Specific models such as V-SVM, Gaussian families (Gaussian, MOG, and MCD Gaussian), and PCA performed better from their respective groups:

Average Performance Across all the Infection Cases
The average performances of the models across all the infection cases for different sample sizes, levels of data granularity (hourly and daily), and nature of data (raw and smoothed) are shown in Tables 5-7. In general, the boundary and domain-based method performed better than the other two groups in both daily and hourly smoothed data sets; however, all the groups achieved comparable performance with respect to the daily raw data set. Specific models such as V-SVM, K-NN, and K-means performed better in all these circumstances.

Daily Raw Data Set
Regarding the daily raw data set, as shown in Table 5, specific models such as V-SVM, MCD Gaussian, K-NN, and K-means produced relatively better descriptions of the 1-month data. For the 2-month sample size, models such as incSVDD, K-NN, LOF, and K-means achieved better performance. For the 3-month sample size, SVDD, incSVDD, V-SVM, Gaussian, MCD Gaussian, K-NN, LOF, and K-means produced comparable descriptions. As expected, SVDD and most of the density-based method improved with larger training sizes. For the 4-month sample size, almost all the models produced much improved performance. In the group comparison, all three groups produced comparable descriptions in all the sample sizes.

Daily Smoothed Data Set
Regarding the daily smoothed data set, as shown in Table 6, almost all models achieved excellent performance and much improved data description compared with the daily raw data set. As shown in Table 6, specific models such as V-SVM, K-NN, and K-means produced excellent descriptions of the data for all the sample sizes; however, V-SVM achieved superior performance compared with these models. In the group comparison, the boundary and domain-based method produced excellent description of the data for all sample sizes.

Hourly Smoothed Data Set
Regarding the hourly smoothed data set, as shown in Table 7, almost all the models failed to produce acceptable data description from the 1-month sample size except V-SVM, which achieved the best description. The high variability between the performance of the models with the 1-month hourly data set could be associated with the high data granularity, and, in fact, the models require more data sets to capture the high variability among the data objects. Models such as V-SVM, MCD Gaussian, and K-means achieved superior performance from their respective groups. In general, V-SVM outperformed in all the sample sizes. The density and reconstruction-based models improved with larger sample size. In the group comparison, the boundary and domain-based method produced better description in all the sample sizes, and the density and reconstruction-based method achieved equivalent performance with larger sample sizes.

Unsupervised Methods
Two density-based unsupervised models were tested and evaluated on the same set of data as used in the one-class classifiers: LOF and COF. The average AUC, specificity, and F1-score were computed after 20 runs. The best performing thresholds for all the infection cases along with the optimal value of k (number of neighbors) are given in Table 8. As can be seen from the table, both the LOF and the COF achieved better performance on the smoothed data set as compared with its raw version. In all the infection cases, LOF performed better than COF. This is mainly because of the characteristics of the data sets, which fulfill the LOF spherical assumption of neighbor distribution. Considering the average F1-score across all the infection cases, LOF achieved 74.7% on the raw daily data, 91.1% on the smoothed daily data, and 72.7% on the hourly data, whereas COF achieved 71.9% on the raw daily data, 85.8% on the smoothed daily data, and 68.9% on the hourly data. However, compared with the one-class classifier, it suffers from performance degradation mainly because the data are not distributed uniformly, where some regions may contain high density and others might be sparse. However, the region of sparse density does not always signify anomalies (infection incidence). For example, an individual patient on certain days might prefer to take little insulin compared with most of the days and perform heavy physical activity to replace their insulin needs. This scenario could generate an outlier, a small ratio of insulin-to-carbohydrate, which will be considered and detected as outliers by unsupervised models. A detailed score plot of each model for the different infection cases can be found in Multimedia Appendix 3. Table 8. Average area under the receiver operating characteristic curve, specificity, and F1-score for both with and without smoothed versions of the data. The parameters kd and kh represent the optimal number of nearest neighbors for the daily and hourly cases, respectively.

Computational Time
Computational time is the amount of time a particular model needs to learn and execute a given task [12]. It can be regarded as one of the best performance indicators for real-time systems. For a real-time application, an optimal model is the one that achieves superior detection performance with small training and testing time. Depending on the application, sometimes models can be trained offline, which makes the training time less important [12]. In this regard, the computational times of all the models were estimated and compared with each other. The computational time was measured for different sample sizes of the training and testing data sets. The sample size of the training and testing data includes 240, 480, 720, 960, 1200, 1440, 1680, 1920, 2160, 2400, 2640, and 2880 sample objects (data points) each. The required computational time for both training and testing each model is depicted in Figures 5 and 6. The figures demonstrate a rough estimation of the computational time, where each model learns the data set and classifies the sample objects. During the training phase, NN, SVDD, and SOM took considerable time. For a training sample size of 2880 objects, NN requires 296 times, SVDD requires 206 times, and SOM requires 42 times the time taken by K-NN on the same sample size. Generally, as the number of sample objects increases, these models require much more time. However, K-means, Gaussian families, LOF, MST, K-NN, V-SVM, PCA, auto-encoder, and incSVDD took less time. These models took almost constant time even when the number of samples increased. During the testing phase, only the LOF took considerable time compared with the other models, as can be seen in Figure 6.

Principal Findings
Anomaly or novelty detection problem has been widely used in various applications including machine fault and sensor failure detection, prevention of credit card or identity fraud, health and medical diagnostics and monitoring, cyber-intrusion detection, and others [1][2][3]. In applications related to health and medical diagnostics and monitoring, the anomaly detection problem has been used to detect and identify the abnormal health state of an individual, for example, detecting abnormal patterns of heartbeat recorded using an electrocardiogram [1,[51][52][53][54]. The omnipresence of various physiological sensors has facilitated circumstances for individuals to easily self-record health-related events and data for the purpose of self-informatics and management [55]. Currently, people are generating huge amounts of data on a daily basis that can contribute to both individual and public health purposes [54]. To this end, people with diabetes are not an exception, generating rich data in both quality and quantity, which is expected to further improve with advances in diabetes technologies. These data can provide valuable information if processed with the right tools and methodology, and in this regard, particular instance includes detecting novel or anomalous data points for various purposes. The availability of labeled data constrains the choice of methods in the anomaly detection problem [3,[9][10][11]. Supervised anomaly detection methods are impractical for applications such as detecting infection incidences in people with type 1 diabetes for a number of reasons [10,12]. Blood glucose dynamics are affected by various other factors apart from infection incidences [19,56,57], and characterization of infection-induced anomalies (abnormal class) from the normal class [13] is a challenging task because of the following reasons: 1. There are no well-defined boundaries regarding how different pathogens affect various key parameters of blood glucose dynamics, including blood glucose levels, insulin injections, carbohydrate ingestions, physical activity or exercise load, and others. This results in poor boundary demarcation between the normal and abnormal classes. 2. Class boundaries defined for a single pathogen might not work for the other pathogens because the effect of different pathogens on the blood glucose dynamics could be different. 3. It is expensive and time consuming to collect infection-related data to explore and characterize pathogen-specific class boundaries. This results in ill-defined class boundaries even for an infection related to a single pathogen. 4. The degree of effect of the same pathogens on the blood glucose dynamics could differ between different individuals because of the difference in individual immunity, which further complicates the characterization task. 5. Lack of sufficient sample size for both the abnormal and the normal classes results in poor training and testing data sample size or imbalanced class problems.
Given these challenges, the best possible approach is to identify methods that can learn from the normal health state of an individual and classify abnormalities relying on the boundaries learnt from the normal health state, which is a one-class classifier approach. This definitely reduces the challenge because it only requires the characterization of what is believed to be a normal health state. For instance, assume a health diagnostic and monitoring system that detects health changes in an individual by tracking the individual's physiological parameters, where the current health status is examined based on set of parameters, and raises a notification alarm when the individual health deteriorates [12]. In such a system, it becomes feasible to rely on a method that can be trained using only the regular or normal day measurements (target days) so as to detect deviation from normality [12,14]. Another possible alternative approach is to identify a method that does not require any characterization and labeling of classes, which is unsupervised methods [7]. Accordingly, considering the previously mentioned challenges, one-class classifiers and unsupervised models were proposed for detecting infection incidence in people with type 1 diabetes. The objective was to develop a personalized health model that can automatically detect the incidence of infection in people with type 1 diabetes using blood glucose levels and insulin-to-carbohydrate ratio as input variables. The model is expected to detect any deviations from the norm as a result of infection incidences considering blood glucose level (hyperglycemia incidences) coupled with unusual changes in the insulin-to-carbohydrate ratio, that is, frequent insulin injections and unusual reduction in the amount of carbohydrate intake [19]. A personalized health model based on one-class classifiers and unsupervised methods was tested using blood glucose levels and the insulin-to-carbohydrate ratio as a bivariate input. The result demonstrated the potential of the proposed approach, which achieved excellent performance in describing the data set, that is, detecting infection days from the regular or normal days, and, in particular, the boundary and domain-based method performed better. Among the respective group, particular models such as V-SVM, K-NN, and K-means achieved excellent performance in all the sample sizes and infection cases. However, the unsupervised approaches suffer performance degradation compared with the one-class classifier mainly because of the atypical nature of the data, which are not distributed uniformly, where some regions may contain high density and others might be sparse (Multimedia Appendix 2). There are rare events (sparse region) of blood glucose dynamics that are a normal response; however, the unsupervised methods can still detect and flag false alarms including the following: 1. Carbohydrate action: a situation in which the ratio of insulin-to-carbohydrate is small and the blood glucose levels are high (hyperglycemia), Carb Action-Quadrant 1 in Figure 7. This is a normal response to blood glucose dynamics as consumption of more carbohydrates and less insulin intake can derive blood glucose dynamics into the hyperglycemia region (high blood glucose levels) if there is no physical activity session. A typical example of this particular situation is holiday seasons, where people consume too many carbohydrates. 2. Physical activity action: despite a small ratio of insulin-to-carbohydrate, the blood glucose levels still drop to low levels (hypoglycemia), PA Action-Quadrant 2 in Figure 7. Normally, a small ratio of insulin-to-carbohydrate signifies that the patient consumed more carbohydrates and injected less insulin, which normally derives the blood glucose dynamics into the hyperglycemia region. However, despite taking more carbohydrates and less insulin, a rigorous physical exercise can still derive the blood glucose dynamics into the hypoglycemia region. Therefore, this is a normal response of blood glucose dynamics as the action of physical activity or exercise can derive the patient into hypoglycemic regions even if the patient consumes more carbohydrates. For example, an individual patient on certain days might prefer to take little insulin as compared with most of the days and perform heavy physical activity to replace their insulin needs. This scenario could generate an outlier, a small ratio of insulin-to-carbohydrate, which will be considered and detected as anomalies by the unsupervised models. However, this could be mitigated by incorporating physical activity data as an input variable. 3. Insulin action: the ratio of insulin-to-carbohydrate is large, that is, high insulin intake and low carbohydrate consumption, and blood glucose levels are low (hypoglycemia), Insulin Action-Quadrant 3 in Figure 7. This is a normal response to blood glucose dynamics as administration of high insulin with little carbohydrate consumption can derive the blood glucose dynamics into the hypoglycemic region. The drawback of unsupervised methods is that they do not have any mechanism to handle rare events even if the events are normal. This is mainly because unsupervised methods define an anomaly on the basis of the entire data set. However, the one-class classifier can learn and handle such scenarios appropriately if presented during the training phase. This is mainly because one-class classifiers produce a reference description based on the available normal (target) data set, including the rare events. With regard to the one-class classifiers, the boundary and domain-based method achieved a better description of the data set compared with the density and reconstruction-based methods, mainly because of the ability of such models to handle the atypical nature of the data [12]. Detectability of the infection incidence is directly related to the extent and degree of the effect it induces on the blood glucose dynamics. The type of pathogen, individual's immunity, and hormones involved could play a role in determining the degree of severity in this regard [19,24,[58][59][60][61][62]. To this end, the results demonstrated that the models were capable of detecting all the infection incidences that can significantly alter the blood glucose dynamics, such as influenza. Moreover, infection incidence that had a moderate effect on the blood glucose dynamics, such as mild common cold without fever, was also detected. However, as expected, infection incidences that had almost little effect on the blood glucose dynamics, such as light common cold without fever, as reported by the individual patient, were not detected. Regarding the computational time, NN, SVDD, and SOM took considerable training time, which typically increased as the number of sample objects increased. Moreover, compared with the other models, only LOF and COF took considerable testing time.

Comparative Analysis of the Methods
Selecting the proper model for implementation in a real-world setting requires considering different characteristics of the model. This includes typical model characteristics such as performance in limited training sample size, robustness to outliers in the training data, required training and testing time, and complexity of the model (in terms of the number of model parameters).

Performance and Sample Size
The sample size, N, is the number of sample objects used during the training phase and highly affects the generalization power of the model [12,13]. Models trained with small sample sizes often fail to produce satisfactory descriptions mainly associated with the presence of large variance in the sample objects [3,12,13,63]. To this end, the results indicate that most of the models fail to make good descriptions with a 1-month (30 objects) data set, mainly with the daily raw data set, as shown in Figure 8. The figure depicts the average performance of each model across all the infection cases over the 1-and 4-month sample sizes. Specifically, MST, Gaussian families, SOM, and auto-encoders require a considerable amount of training sample objects to better describe the data. There is some exception, for instance V-SVM, which produces a satisfactory description of the 1-month data sets in all the infection cases and data granularity. Models such as NN and PCA produced the worst description in most cases. As the number of training sample objects increased, all the models improved and produced a comparable description of the data. As a rule of thumb, for the daily scenario, a 3-month training sample (90 sample objects) produces a good description of the data, which can be considered for real-world applications. Moreover, if smoothing is considered, a 1-month sample size produces better description than the 4-month sample size without smoothing, as shown in Figure 8. However, for the hourly scenario, a 1-month training sample object produces a comparable description and anything more than this size will be enough.

Computational Time
For real-time applications, the time a model takes to learn and classify the sample object is essential in model selection. Table  9 depicts the rough estimation of average training and testing time required by different classifiers, both the one-class classifiers and the unsupervised models, based on 2880 training and testing sample objects each. Most of the models, as shown in Figures 5 and 6 and Table 9, require reasonable training and testing time, except NN, SVDD, and SOM, which took a considerably longer time. However, it is possible that in some cases models can be trained offline, which makes the training time less important. With regard to the testing time, most of the models executed the classification task in a reasonable time except COF and one class classifier version of LOF, which consume considerable time to classify the 2880 objects. The computational time in these particular models grows exponentially as the sample size increases, which makes them resource demanding in a big data setting.

Robustness to Outliers in the Training Data Set
The presence of outliers in the training data set could significantly affect the model's generalization ability. Outlier objects are samples that exhibit different characteristics compared with the rest of the objects in the data set [8,63]. For instance, an individual might forget a previous infection incident and could label these days as a regular or normal period during self-reporting, which could end up being used as target data sets for training. Another important example could be error recorded during data registration, that is, carbohydrate, blood glucose levels, and insulin registration. Such errors could occur during the manual registration of carbohydrates, associated with infusion set failures and other similar situations. In this scenario, an individual could record lower or higher values incorrectly affecting the input features, for example, ratio of insulin-to-carbohydrate and blood glucose levels, resulting in an outlier that could greatly affect the model's generalization ability. In this type of situation, a model's sensitivity to outliers in the training data is crucial to curb the influence of outliers on the accuracy of the description generated. To some extent, a user-specified empirical rejection rate is incorporated in the models to reduce the effect of outliers in the training data by rejecting the most dissimilar objects from the description generated. For example, a rejection rate of 1% on training data sets implies that 1% of outliers in the training data set are rejected. Nevertheless, the sensitivity of models to outliers in the training data sets differs greatly between models. Among the models, NN is regarded as the most sensitive model to outliers in the training data set [12]. The presence of outliers in the training data changes the shape of the description generated by the model, forcing a larger portion of the feature space to be accepted as the target class [10,12]. Furthermore, models that rely on an estimation of the covariance matrix, for example, Gaussian families, also suffer from the presence of outliers in the training data sets [12,36]. However, when equipped with regularization, Gaussian models can withstand such outliers. Local density estimators such as Parzen can withstand outliers, considering the fact that only the local density is affected [12]. Models that rely on prototype estimation, such as SOM and K-means, are highly affected by the presence of outliers in the training data set, which could force the estimated prototype to be placed near or at the nontarget data set [2,12,13]. Nevertheless, boundary and domain-based method such as SVDD and V-SVM and reconstruction-based method such as auto-encoders are more or less insensitive to outliers and can generate acceptable solutions [3,12,64].

Model Parameters and Associated Complexity
The parameters of a model can be either free or user defined. These two parameters, free and user defined, provide insight into how flexible the model is, how sensitive the model is to overtraining, and how easy the model is to configure (simplicity) [12,16]. Considering the number of these parameters, there exist large variations among the models. For instance, NN does not possess any free parameters; therefore, its performance completely relies on the training data set [12]. This constraint has limitations, mainly because training data that contain outliers could ruin the model's performance [12,15,16]. A model that possess large number of free and user defined parameters is too flexible and complex [12]. Regarding the user-defined parameters, also known as hyper-parameters, a model equipped with small number of parameters and preferably with intuitive meaning are easy to configure. Setting up the user defined parameters incorrectly can degrade the model's performance and selecting the proper values (optimization) becomes complex and vague as the number of model parameters become too large. One of the simplest models is Parzen density and NN, which do not require the user to specify any parameters [3,12,13]. Some models, such as support vector families, require the user to specify parameters that have intuitive meaning, for example, the ratio of training objects to be rejected by the description [12,65]. There are also models that are complex enough given that the user is expected to specify many parameters, which are not intuitive and require careful choice. Examples of such models include SOM and auto-encoders, where the user is expected to supply the number of neuron, hidden units, and learning rate [10,12,37,66].

Practical Illustration and Area of Applications
For a real-world application, apart from the performance of the model, it is important to consider two important aspects of the data set, the time window of detection (data granularity) and the required sample size. The time window or data granularity, that is, hourly and daily, defines the frequency (continuity) of computation one needs to conduct throughout the day to screen the health status of the individual with type 1 diabetes. In an hourly time window, one is expected to carry out the computation at the end of each hour throughout the day. However, in the daily time window, one needs to carry out one aggregate computation at the end of the day. Decreasing the time window (increasing the granularity of the data) enhances early detections; however, at the coast of accuracy, for example, more unwanted features (noise) in the data. The results demonstrated that almost all the models produced fairly comparable detection performances in both time windows. Moreover, the required sample size determines the necessary amount of data an individual with type 1 diabetes needs to collect in advance before joining such an infection detection system. Models that could generalize well with small sample sizes could be preferred in a real-world application to enable more people to join the system with ease. Generally, the results demonstrated that the models require at least a sample size of 3-month data for the daily case and 2-month data for hourly case to perform better. Automating the detection of infection incidences among people with type 1 diabetes can deliver a means to provide personalized decision support and learning platforms for the individuals and, at the same time, can be used to detect infectious disease outbreaks on a large scale through spatio-temporal cluster detection [19,67,68]. Detailed descriptions of these instances are given below: 1. A personalized decision support system and learning platform relies on an individual's self-recorded data to provide relevant information in relation to decision making to assist the individuals during crises [19,67,68]. Moreover, it can also provide a learning platform concerning the extent to which infection incidence affects the key parameters of the blood glucose dynamics. Information regarding what to expect at each stage of the course of infection could be very important to the individuals [19]. During infection incidences, various kinds of information could be vital for an individual to properly manage blood glucose levels, including time in range (blood glucose), to what extent is the evolution of blood glucose affected during the course of infection, to what extent does insulin sensitivity change, and how much does the insulin-to-carbohydrate ratio shift, that is, changes in insulin requirements for each gram of carbohydrate intake. 2. A population-based early outbreak detection system relies on self-recorded information from an individual with type 1 diabetes to detect individuals' infection cases and, thereby, detect a group of infected individuals on a spatio-temporal basis. Such a system should collect individuals' self-recoded data to a central server, analyze individuals' data on a timely basis, identify and locate a cluster of people based on space and time, and notify the responsible bodies if there is an ongoing outbreak [19,[67][68][69][70][71].

Conclusions
Anomaly or novelty detection problem has been widely used in various applications including machine fault and sensor failure detection, prevention of credit card or identity fraud, health and medical diagnostics and monitoring, cyber-intrusion detection, and others. In this study, we demonstrated the applicability of one-class classifiers and unsupervised anomaly detection methods for the purpose of detecting infection incidences in people with type 1 diabetes. In general, the proposed methods produced excellent performance in describing the data set, and particularly the boundary and domain-based method performed better. In contrast to the specific models, V-SVM, K-NN, and K-means achieved better generalization in describing the data set in all infection cases. Detecting the incidence of infection in people with type 1 diabetes can provide an opportunity to devise tailored services, that is, personalized decision support and a learning platform for the individuals, and can simultaneously be used for detecting potential public health threats, that is, infectious disease outbreaks, on a large-scale basis through a spatio-temporal cluster detection. Generally, we foresee that the results presented could encourage researchers to further examine the presented features along with other additional features of self-recorded data, for example, various CGM features and physical activity data, on a large-scale basis.