This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Semisupervised and unsupervised anomaly detection methods have been widely used in various applications to detect anomalous objects from a given data set. Specifically, these methods are popular in the medical domain because of their suitability for applications where there is a lack of a sufficient data set for the other classes. Infection incidence often brings prolonged hyperglycemia and frequent insulin injections in people with type 1 diabetes, which are significant anomalies. Despite these potentials, there have been very few studies that focused on detecting infection incidences in individuals with type 1 diabetes using a dedicated personalized health model.
This study aims to develop a personalized health model that can automatically detect the incidence of infection in people with type 1 diabetes using blood glucose levels and insulin-to-carbohydrate ratio as input variables. The model is expected to detect deviations from the norm because of infection incidences considering elevated blood glucose levels coupled with unusual changes in the insulin-to-carbohydrate ratio.
Three groups of one-class classifiers were trained on target data sets (regular days) and tested on a data set containing both the target and the nontarget (infection days). For comparison, two unsupervised models were also tested. The data set consists of high-precision self-recorded data collected from three real subjects with type 1 diabetes incorporating blood glucose, insulin, diet, and events of infection. The models were evaluated on two groups of data: raw and filtered data and compared based on their performance, computational time, and number of samples required.
The one-class classifiers achieved excellent performance. In comparison, the unsupervised models suffered from performance degradation mainly because of the atypical nature of the data. Among the one-class classifiers, the boundary and domain-based method produced a better description of the data. Regarding the computational time, nearest neighbor, support vector data description, and self-organizing map took considerable training time, which typically increased as the sample size increased, and only local outlier factor and connectivity-based outlier factor took considerable testing time.
We demonstrated the applicability of one-class classifiers and unsupervised models for the detection of infection incidence in people with type 1 diabetes. In this patient group, detecting infection can provide an opportunity to devise tailored services and also to detect potential public health threats. The proposed approaches achieved excellent performance; in particular, the boundary and domain-based method performed better. Among the respective groups, particular models such as one-class support vector machine, K-nearest neighbor, and K-means achieved excellent performance in all the sample sizes and infection cases. Overall, we foresee that the results could encourage researchers to examine beyond the presented features into other additional features of the self-recorded data, for example, continuous glucose monitoring features and physical activity data, on a large scale.
Anomaly or novelty detection problem involves identifying the anomalous or novel instances, which exhibit different characteristics, from the rest of the data set and has been widely used in various applications including machine fault and sensor failure detection, prevention of credit card or identity fraud, health and medical diagnostics and monitoring, cyber-intrusion detection, and others [
Type 1 diabetes, also known as insulin-dependent diabetes, is a chronic disease of blood glucose regulation (hemostasis), and is caused by the lack of insulin secretion from pancreatic cells [
A group of one-class classifiers and unsupervised models were tested and compared. The one-class classifier incorporates 3 groups: boundary and domain-based, density-based, and reconstruction-based methods. The boundary and domain-based method contains support vector data description (SVDD) [
Equipments used in the self-management of diabetes.
Patients | Self-management | ||||
|
BGa | Insulin administration | Diet | Body weight (kg) | HbA1cb (%) |
Subject 1 | Finger pricks recorded in the Diabetes Diary mobile app and Dexcom CGMc | Insulin Pen (multiple bolus and 1-time basal in the morning) recorded in the Diabetes Diary mobile app | Carbohydrate in grams recorded in the Diabetes Diary mobile app; level 3 (advanced carb counting) | 83 | 6.0 |
Subject 2 | Finger pricks recorded in the Spike mobile app and Dexcom G4 CGMc | Insulin Pen (multiple bolus [Humalog] and 1-time basal [Toujeo] before bed) recorded in the Spike mobile app | Carbohydrate in grams recorded in the Spike mobile app; level 3 (advanced carb counting) | 77 | 7.3 |
Subject 3 | Enlite (Medtronic) CGMc and Dexcom G4 | Medtronic MinMed G640 insulin pump (basal rates profile [Fiasp] and multiple bolus [Fiasp]) | Carbohydrate in grams recorded in pump information; level 3 (advanced carb counting) | 70 | 6.2 |
aBG: blood glucose.
bHbA1c: hemoglobin A1c.
cCGM: continuous glucose monitoring.
Daily scatter plot of average blood glucose levels versus total insulin (bolus) to total carbohydrate ratio for a specific regular or normal patient year without any infection incidences.
Hourly scatter plot of average blood glucose levels versus total insulin (bolus) to total carbohydrate ratio for a specific regular or normal patient year without any infection incidences.
Daily scatter plot of average blood glucose levels versus total insulin (bolus) to total carbohydrate ratio for a specific patient year with an infection incidence (flu).
Hourly scatter plot of average blood glucose levels versus total insulin (bolus) to total carbohydrate ratio for a specific patient year with an infection incidence (flu).
The performance of the one-class classifiers was evaluated using 20 times 5-fold stratified cross-validation. For both daily and hourly cases, the user-specified outlier fraction threshold ß was set to 0.01 such that 1% of the training target data are allowed to be classified as outlier or get rejected [
Computation time: this characteristic defines the amount of time taken to train and test the model. Regarding personal use, response time is crucial for acceptance of the services by a wide range of users. Furthermore, with regard to the outbreak detection settings, this is an important parameter given that a system that uses data from many participants needs to have an acceptable response time. However, in real-world applications, the training phase can be performed in an offline mode, which makes the testing response time very crucial.
Sample size: this characteristic specifies the minimum amount of training data required to generate an acceptable performance. This is an important factor given that the system relies on self-recorded data; it is difficult to accumulate a large set of data for an individual initially.
Number of user-defined parameters: this characteristic defines the complexity of the model. It is simpler and less data are required to estimate a model with fewer parameters. This is an important factor because it is easier for an individual to implement the simple model compared with the complex model.
Sensitivity to outliers in the training data sets: this characteristic defines how the model estimation is affected by outliers in the training set. This is a crucial characteristic because the model training depends on self-reported data, which are highly dependent on the accuracy of the user data registration. It is possible that the user might forget to report some infection incidence and hence might be considered as target data sets and be used as a training data set. Furthermore, errors incurred during manual registration of data can also affect model generalization.
The study protocol has been submitted to the Norwegian Regional Committees for Medical Health Research Ethics Northern Norway for evaluation and was found exempted from regional ethics review because it is outside the scope of medical research (reference number: 108435). Written consent was obtained, and the participants donated the data sets. All data from the participants were anonymized.
The models were evaluated based on two different versions of the same data set: raw and filtered. The input variables to the models were the average blood glucose levels and the ratio of total insulin (bolus)-to-total carbohydrate. The necessary computational time for both training and testing of the models was also estimated. A comparison of the classifiers was carried out taking into account their performance, necessary sample size for producing acceptable performance, and computational time. These models were further compared based on their theoretical guarantee provided for robustness to outliers in the target data set and based on their complexity. In addition, these classifiers were compared with the unsupervised version of some selected models.
Model training and evaluations were carried out on an individual basis taking into account different characteristics of the data, specified time window or resolution (hourly and daily), and nature of the data (raw data and its smoothed version). For daily evaluation, we compared the performance of the models on raw data and its smoothed version with a 2-day moving average filter. For hourly evaluation, we compared the performance of the model on a smoothed version of the data set. The purpose of the comparison was to study the performance gain achieved by removing short-time noises from the data set through smoothing. The average and SD of AUC, specificity, and F1-score are computed and reported for each model. The top performing models from each category are highlighted in italics within each tables.
The regular or normal days were labeled as the target class data set and the infection period as the nontarget class data set. Three groups of one-class classifiers were trained on the target class and tested on a data set containing both the target and the nontarget classes. In addition to the data characteristics stated above, resolution and data nature, the one-class classifier performance was also assessed taking into account the required sample object size to produce acceptable data description. In this direction, we consider four groups of sample size: 1 month, 2 months, 3 months, and 4 months data sets. In the model evaluation, the data set containing the infection period was presented during testing. The evaluation was carried out based on 20 times 5-fold stratified cross-validation. The performance of the model was reported as the average and SD of AUC, specificity, and F1-score of the rounds. A score plot of each model for both the hourly and the daily scenarios using the smoothed version of the data can be found in
As can be seen in
The boundary and domain-based method achieved a better description of the data with a small sample size when compared with the other two groups. However, as the sample size increased, all the three groups achieved relatively comparable descriptions of the data. Specific models such as V-SVM, K-NN, and K-means performed better from their respective group. Regarding the raw data, as seen in
From the boundary and domain-based method, V-SVM performed better in all the sample sizes and achieved comparable performance even with 60 objects and improved significantly afterward. SVDD produced a comparable description with higher sample sizes, that is, 3 months and later.
From the density-based method, K-NN performed better in all the sample sizes and achieved better performance even with 60 objects. Naïve Parzen produced comparable performance with higher sample sizes, that is, 3 months and later.
From the reconstruction-based method, K-means achieved better performance for all sample sizes.
Smoothing the data, as shown in
From the boundary and domain-based method, V-SVM achieved better performance in all sample sizes.
From the density-based method, K-NN achieved better performance for all sample sizes, minimum covariance determinant (MCD) Gaussian produced a comparable description with 30 and 60 sample objects, and naïve Parzen achieved comparable description of the data with 4-month sample objects.
Regarding the reconstruction-based method, PCA achieved good performance with 30 and 60 sample objects, whereas K-means performed better with larger sample objects.
Average (SD) of area under the receiver operating characteristic curve, specificity, F1-score for the raw data set (without smoothing), and different sample size. Fraction=0.01.
Models | 1 month | 2 months | 3 months | 4 months | |||||||||
|
AUCa, mean (SD) | Specificity, mean (SD) | F1, mean (SD) | AUC, mean (SD) | Specificity, mean (SD) | F1, mean (SD) | AUC, mean (SD) | Specificity, mean (SD) | F1, mean (SD) | AUC, mean (SD) | Specificity, mean (SD) | F1, mean (SD) | |
|
|||||||||||||
|
SVDDb | 90.7 (8.8) | 71.7 (7.7) | 73.6 (5.5) | 93.4 (6.2) | 81.7 (5.0) | 87.4 (8.1) | 96.4 (2.9) | 87.8 (3.3) | 91.3 (6.0) | 94.6 (3.7) | 81.7 (5.0) | 90.0 (4.6) |
|
IncSVDDc | 90.4 (8.9) | 66.7 (7.5) | 72.7 (4.9) | 91.8 (5.9) | 66.7 (7.5) | 84.4 (3.2) | 95.8 (2.9) | 70.0 (7.1) | 85.4 (1.2) | 93.7 (3.6) | 55 (10.7) | 81.0 (2.7) |
|
V-SVMd | 93.1 (6.0) | 63 (10.6) |
|
96.5 (2.3) | 81.9 (4.7) |
|
97.9 (1.5) | 88.9 (0.0) |
|
96.2 (2.3) | 83.3 (0.0) |
|
|
NNf | 74.2 (9.3) | 38.3 (7.7) | 61.0 (4.7) | 89.5 (9.3) | 20.0 (6.7) | 70.0 (4.6) | 90.1 (6.6) | 11.1 (18) | 69.2 (3.8) | 92.8 (3.3) | 33.3 (0.0) | 75.1 (0.4) |
|
MSTg | 89.4 (8.1) | 50.0 (0.0) | 62.7 (6.6) | 95.4 (5.6) | 61.7 (7.7) | 82.3 (5.9) | 96.6 (2.7) | 68.9 (4.5) | 83.6 (4.7) | 94.1 (2.8) | 55.0 (7.7) | 80.6 (2.3) |
|
|||||||||||||
|
Gaussian | 90.6 (7.1) | 60.0 (8.2) | 68.8 (8.4) | 95.4 (4.6) | 70.0 (6.7) | 85.3 (4.6) | 97.3 (2.5) | 80.0 (4.5) | 89.2 (3.3) | 95.5 (3.2) | 66.7 (0.0) | 84.5 (2.0) |
|
MOGh | 88.1 (9.9) | 80.1 (17.3) | 67.8 (16.4) | 93.1 (7.1) | 75.8 (14.8) | 82.5 (10.1) | 95.6 (3.4) | 80.2 (7.5) | 86.0 (6.7) | 93.7 (3.9) | 68.7 (11.6) | 84.2 (5.7) |
|
MCDi Gaussian | 89.0 (8.5) | 55.0 (7.7) | 66.4 (9.0) | 94.0 (4.6) | 68.3 (5.0) | 84.6 (6.3) | 97.0 (2.7) | 80.0 (4.5) | 89.9 (2.4) | 94.5 (3.2) | 65.0 (5.0) | 84.0 (3.2) |
|
Parzen | 89.0 (9.2) | 70.0 (6.7) | 70.7 (5.9) | 94.6 (4.9) | 83.3 (0.0) | 87.9 (6.3) | 97.2 (2.4) | 88.9 (0.0) | 90.5 (5.9) | 95.2 (2.9) | 83.3 (0.0) | 88.9 (3.3) |
|
Naïve Parzen | 90.1 (7.6) | 55 (10.7) | 65.0 (5.0) | 95.7 (3.9) | 76.7 (8.2) | 87.2 (3.5) | 98.3 (1.4) | 88.9 (0.0) |
|
96.8 (2.1) | 83.3 (0.0) | 90.7 (2.0) |
|
K-NNj | 91.8 (6.9) | 50.0 (0.0) | 66.0 (2.0) | 95.6 (3.1) | 81.7 (5.0) |
|
97.9 (1.6) | 88.9 (0.0) | 93.5 (3.7) | 97.0 (2.2) | 83.3 (0.0) |
|
|
LOFk | 88.5 (6.1) | 66.7 (7.5) |
|
97.0 (1.9) | 71.7 (7.7) | 86.1 (2.4) | 96.8 (2.8) | 78.9 (3.3) | 88.7 (2.8) | 92.6 (4.8) | 50.0 (0.0) | 79.3 (2.6) |
|
|||||||||||||
|
PCAl | 87.8 (11.9) | 50.0 (7.5) | 62.4 (8.5) | 93.5 (6.2) | 51.7 (5.0) | 78.2 (4.1) | 93.6 (4.7) | 60 (10.2) | 81.8 (4.4) | 91.3 (5.2) | 46.7 (6.7) | 78.7 (2.3) |
|
Auto-encoder | 82.2 (12.0) | 57.9 (15.3) | 64.7 (12.0) | 88.2 (9.5) | 61.6 (14.0) | 81.4 (7.1) | 93.4 (5.7) | 74.4 (11) | 86.4 (5.9) | 88.4 (8.8) | 61.3 (14.3) | 82.7 (5.7) |
|
SOMm | 86.9 (9.4) | 78.3 (13.3) | 66.7 (16.9) | 92.8 (7.3) | 64.2 (12.4) | 80.9 (7.0) | 95.8 (3.7) | 80.1 (6.3) | 86.9 (5.5) | 92.2 (4.1) | 76.5 (9.0) | 87.5 (4.5) |
|
K-means | 91.8 (6.9) | 65.0 (9.0) |
|
96.0 (2.4) | 83.3 (0.0) |
|
97.6 (1.6) | 88.9 (0.0) |
|
96.2 (2.2) | 83.3 (0.0) |
|
aAUC: area under the receiver operating characteristic curve.
bSVDD: support vector data description.
cIncSVDD: incremental support vector data description.
dV-SVM: one-class support vector machine.
eItalicized values indicates the top performing models.
fNN: nearest neighbor.
gMST: minimum spanning tree.
hMOG: mixture of Gaussian.
iMCD: minimum covariance determinant.
jK-NN: K-nearest neighbor.
kLOF: local outlier factor.
lPCA: principal component analysis.
mSOM: self-organizing maps.
Average of area under the receiver operating characteristic curve, specificity, and F1-score for smoothed version of the data with a 2-day moving average filter and different sample size. Fraction=0.01.
Models | 1 month | 2 months | 3 months | 4 months | |||||||||
|
AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | |
|
|||||||||||||
|
SVDDb | 99.6 (1.3) | 100 (0.0) | 93.6 (15.2) | 100 (0.0) | 100 (0.0) | 94.8 (10.1) | 100 (0.0) | 100 (0.0) | 97.0 (4.1) | 100 (0.0) | 100 (0.0) | 96.9 (4.0) |
|
IncSVDDc | 99.6 (1.3) | 100 (0.0) | 93.6 (15.2) | 100 (0.0) | 100 (0.0) | 97.1 (6.3) | 100 (0.0) | 100 (0.0) | 97.6 (4.1) | 100 (0.0) | 100 (0.0) | 98.3 (2.8) |
|
V-SVMd | 100 (0.0) | 99.5 (2.9) |
|
100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
|
NNf | 98.1 (3.9) | 58.3 (15.4) | 72.3 (9.9) | 86.9 (12.5) | 16.7 (22.4) | 70.5 (5.3) | 88.1 (6.5) | 54.4 (22.5) | 80.0 (8.6) | 92.4 (5.3) | 8.3 (17.1) | 69.0 (4.8) |
|
MSTg | 98.5 (2.4) | 85.0 (5.0) | 85.5 (2.1) | 99.7 (0.8) | 100 (0.0) | 97.1 (6.3) | 99.9 (0.4) | 97.8 (4.5) | 97.2 (4.0) | 99.7 (0.8) | 100 (0.0) | 97.0 (7.9) |
|
|||||||||||||
|
Gaussian | 100 (0.0) | 98.3 (5.0) | 92.1 (15.2) | 100 (0.0) | 100 (0.0) | 97.1 (6.3) | 99.8 (0.7) | 100 (0.0) | 97.6 (4.1) | 99.4 (1.7) | 100 (0.0) | 97.0 (7.9) |
|
MOGh | 98.6 (3.2) | 99.8 (1.7) | 88.5 (16.8) | 99.6 (1.2) | 100 (0.0) | 92.2 (11.1) | 99.7 (0.7) | 99.8 (1.4) | 94 (10.3) | 99.3 (2.0) | 99.9 (1.2) | 94.4 (11.8) |
|
MCDi Gaussian | 98.9 (2.2) | 91.7 (8.4) |
|
100 (0.0) | 100 (0.0) |
|
99.5 (1.1) | 96.7 (5.1) | 96.6 (5.9) | 99.4 (1.7) | 88.3 (7.7) | 92.0 (6.8) |
|
Parzen | 99.6 (1.3) | 100 (0.0) | 87.7 (17.0) | 100 (0.0) | 100 (0.0) | 95.1 (8.0) | 100 (0.0) | 100 (0.0) | 94.6 (9.8) | 99.9 (0.4) | 100 (0.0) | 94.6 (12.3) |
|
Naïve Parzen | 99.2 (2.5) | 100 (0.0) | 94.7 (11.1) | 100 (0.0) | 100 (0.0) | 93.8 (11.0) | 99.6 (1.1) | 100 (0.0) | 97.5 (5.0) | 100 (0.0) | 100 (0.0) |
|
|
K-NNj | 98.1 (3.9) | 68.3 (5.0) | 75.2 (4.3) | 100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
|
LOFk | 98.6 (2.9) | 75.0 (13.5) | 80.2 (10.8) | 100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) | 96.9 (5.0) | 99.7 (0.8) | 100 (0.0) | 97.4 (7.9) |
|
|||||||||||||
|
PCAl | 98.9 (2.2) | 85.0 (5.0) |
|
99.2 (1.3) | 85.0 (5.0) |
|
98.6 (1.9) | 88.9 (0.0) | 92.2 (6.0) | 97.8 (2.2) | 83.3 (0.0) | 89.1 (9.7) |
|
Auto-encoder | 97.4 (6.0) | 89.1 (13.0) | 86.0 (14.2) | 98.5 (3.2) | 94.5 (9.6) | 91.8 (9.4) | 99.2 (2.4) | 93.7 (10.2) | 93.7 (8.3) | 98.6 (3.8) | 94.4 (9.5) | 93.7 (9.7) |
|
SOMm | 99.3 (1.9) | 99.9 (1.2) | 84.7 (19.8) | 99.8 (0.7) | 100 (0.0) | 91.4 (9.6) | 99.9 (0.3) | 100 (0.0) | 95.2 (7.9) | 99.6 (1.3) | 100 (0.0) | 93.4 (12.1) |
|
K-means | 99.2 (2.5) | 85.0 (11.7) | 87.0 (10.4) | 100 (0.0) | 100 (0.0) | 97.1 (6.3) | 100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
aAUC: area under the receiver operating characteristic curve.
bSVDD: support vector data description.
cIncSVDD: incremental support vector data description.
dV-SVM: one-class support vector machine.
eItalicized values indicates the top performing models.
fNN: nearest neighbor.
gMST: minimum spanning tree.
hMOG: mixture of Gaussian.
iMCD: minimum covariance determinant.
jK-NN: K-nearest neighbor.
kLOF: local outlier factor.
lPCA: principal component analysis.
mSOM: self-organizing maps.
The boundary and domain-based method achieved better performance with a small sample size compared with the density and reconstruction-based methods. However, as the sample size increased, all the three groups achieved comparable performance. The detailed numerical values of comparison are given in
From the boundary and domain-based method, SVDD, MST, and incremental support vector data description (incSVDD) performed better with a larger sample object, and V-SVM achieved better description with 30 sample objects.
From the density-based method, all the models exhibited similar performance. Naïve Parzen and K-NN, with only 60 sample objects, achieved comparable performance with the higher sample objects.
From the reconstruction-based method, K-means achieved better performance for all sample sizes.
Smoothing the data significantly improved the performance of the model even with 30 objects, compared with the raw data (
From the boundary and domain-based method, the V-SVM achieved higher performance in all the sample sizes.
From the density-based method, LOF achieved better description with small sample objects, and K-NN produced better description with all the sample sizes. Gaussian families achieved improved and comparable performance with increased sample objects. Among them, K-NN with only 60 objects achieved comparable performance with larger sample objects.
Regarding the reconstruction-based method, K-means and SOM achieved better performance, whereas K-means performed better in all the sample sizes.
The boundary and domain-based method achieved better performance with a small sample size compared with the density and reconstruction-based methods. However, as the sample size increased, all the three groups produced comparable descriptions. The detailed numerical values of comparison are given in
From the boundary and domain-based method, SVDD, V-SVM, MST, and incSVDD performed better in all the cases, with MST achieving better performance.
From the density-based method, normal and MCD Gaussian achieved better description of the data with 1-month sample objects. K-NN and LOF performed better with sample sizes larger than 1-month sample objects, and LOF outperformed all sample sizes. The LOF with only 60 objects achieved comparable performance with the higher sample objects.
From the reconstruction-based method, PCA produced better description for all sample sizes, whereas K-means and SOM achieved comparable performance with sample size larger than 1-month sample objects.
Smoothing the data allowed the models to generalize well and significantly improved the performance of the model even with 30 objects, compared with the raw data (
From the boundary and domain-based method, the V-SVM and MST achieved higher performance in all the sample sizes, whereas V-SVM outperformed all the models.
From the density-based method, the Gaussian families, LOF, and K-NN achieved better performance, whereas LOF achieved better performance in all sample sizes.
Regarding the reconstruction-based method, K-means and PCA achieved better performance, whereas PCA performed better in all the sample sizes.
The boundary and domain-based method achieved better performance with small sample sizes compared with the density and reconstruction-based methods. All the three groups improved with increasing sample size. The detailed numerical values of comparison are given in
From the boundary and domain-based method, SVDD, V-SVM, and incSVDD performed better for all the sample sizes.
From the density-based method, MCD Gaussian performed better with a 1-month sample size, and all the models produced comparable descriptions as the sample size increased, whereas the LOF performed better for all the sample sizes.
From the reconstruction-based method, PCA performed relatively better for all the sample sizes, and K-means and SOM achieved comparable performance with a larger sample size.
Smoothing the data significantly improved the model performance even with 30 objects compared with the raw data (
From the boundary and domain-based method, the V-SVM achieved higher performance in all the sample sizes. As the sample size increased, the incSVDD and MST achieved comparable performance.
From the density-based method, K-NN and LOF produced better descriptions with a 1-month sample size. K-NN performed better in almost all sample sizes.
From the reconstruction-based method, K-means achieved better performance for all sample sizes.
As can be seen in
The boundary and domain-based method achieved better performance compared with the density and reconstruction-based methods. As can be seen in
From the boundary and domain-based method, V-SVM achieved better description in all sample sizes, whereas SVDD, incSVDD, and V-SVM achieved comparable performance with a larger sample size.
From the density-based method, Gaussian families and naïve Parzen performed better at large sample sizes, whereas K-NN and LOF achieved better performance in all the sample sizes. K-NN outperformed all the models.
From the reconstruction-based method, K-means performed better in all the sample sizes, and all the other models performed better with larger sample sizes.
Average (SD) of area under the receiver operating characteristic curve, specificity, F1-score for the smoothed version of the data with a 48-hour moving average filter and different sample size. Fraction=0.01.
Models | 1 month | 2 months | 3 months | 4 months | |||||||||
|
AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | |
|
|||||||||||||
|
SVDDb | 97.6 (1.9) | 83.2 (3.4) | 85.8 (1.7) | 97.8 (1.2) | 85.7 (5.0) | 90.5 (9.6) | 97.7 (1.2) | 90.4 (5.1) | 94.2 (2.9) | 98.1 (0.9) | 91.0 (3.7) | 96.8 (0.9) |
|
IncSVDDc | 97.4 (1.9) | 84.5 (2.8) | 86.8 (1.9) | 97.7 (1.2) | 86.7 (2.0) | 93.9 (1.0) | 97.5 (1.2) | 88.5 (1.5) | 96.0 (1.1) | 97.9 (0.9) | 88.9 (1.2) |
|
|
V-SVMd | 98.1 (2.1) | 84.5 (1.1) |
|
99.0 (1.1) | 92.6 (0.0) |
|
99.5 (0.6) | 93.8 (0.5) |
|
99.4 (0.4) | 94.2 (0.0) | 97.1 (1.3) |
|
NNf | 84.8 (6.0) | 75.9 (4.5) | 74.8 (6.0) | 89.3 (2.2) | 76.5 (4.1) | 87.1 (3.3) | 89.0 (4.0) | 77.5 (3.9) | 89.3 (4.4) | 90.2 (4.7) | 77.5 (3.8) | 91.4 (6.4) |
|
MSTg | 90.5 (3.1) | 85.4 (3.9) | 67.6 (14.5) | 94.4 (2.0) | 85.7 (4.0) | 85.1 (7.0) | 94.7 (2.4) | 88.8 (3.5) | 87.8 (8.5) | 95.8 (2.2) | 88.8 (3.0) | 90.9 (5.9) |
|
|||||||||||||
|
Gaussian | 98.1 (2.2) | 79.8 (4.9) | 83.9 (2.7) | 99.5 (0.9) | 90.1 (1.7) | 95.2 (1.8) | 99.6 (0.7) | 92.9 (1.3) | 97.1 (2.5) | 99.5 (0.5) | 92.2 (1.0) | 97.7 (1.1) |
|
MOGh | 95.8 (3.6) | 82.7 (4.3) | 83.7 (5.0) | 98.3 (1.5) | 86.2 (2.7) | 92.3 (2.7) | 98.7 (1.4) | 88.7 (4.6) | 94.7 (3.5) | 98.6 (1.6) | 88.2 (3.1) | 95.3 (3.2) |
|
MCDi Gaussian | 98.6 (2.1) | 75.3 (6.9) | 81.3 (2.5) | 99.6 (0.9) | 89.6 (1.9) | 95.0 (1.8) | 99.6 (0.7) | 92.5 (1.8) | 97.0 (2.3) | 99.6 (0.4) | 92.0 (1.2) | 97.7 (1.1) |
|
Parzen | 91.9 (2.9) | 93.6 (2.0) | 63.4 (16.5) | 96.2 (2.3) | 94.4 (2.0) | 81.6 (10.2) | 96.6 (2.6) | 94.8 (1.7) | 84.2 (9.5) | 97.4 (2.2) | 95.6 (1.2) | 87.9 (7.1) |
|
Naïve Parzen | 94.8 (3.7) | 76.4 (5.6) | 77.6 (7.9) | 98.7 (1.2) | 85.2 (3.3) | 91.8 (2.9) | 99.1 (1.1) | 89.1 (3.8) | 94.8 (2.5) | 98.9 (0.9) | 89.7 (2.4) | 96.2 (1.6) |
|
K-NNj | 97.1 (3.4) | 78.8 (2.0) |
|
99.1 (1.0) | 92.9 (0.7) |
|
99.6 (0.4) | 93.8 (0.7) |
|
99.5 (0.3) | 94.0 (0.6) |
|
|
LOFk | 96.9 (3.5) | 78.3 (3.0) | 84.2 (2.4) | 99.2 (1.1) | 91.9 (0.9) |
|
99.6 (0.5) | 93.7 (0.8) | 97.3 (2.1) | 99.5 (0.4) | 93.1 (0.4) | 97.8 (1.2) |
|
|||||||||||||
|
PCAl | 97.1 (3.4) | 63.9 (8.8) | 75.4 (0.3) | 99.4 (1.2) | 76.4 (6.6) | 90.2 (1.1) | 99.1 (1.3) | 75.1 (6.8) | 92.4 (1.1) | 98.9 (1.2) | 69.1 (4.1) | 93.1 (0.8) |
|
Auto-encoder | 92.0 (4.8) | 79.5 (7.6) | 78.9 (8.3) | 96.2 (2.6) | 83.1 (7.2) | 91.1 (3.9) | 96.3 (3.2) | 84.3 (7.7) | 92.7 (5.0) | 96.7 (3.0) | 84.0 (8.0) | 94.6 (4.4) |
|
SOMm | 94.1 (2.3) | 82.2 (3.3) | 82.6 (4.9) | 95.6 (1.1) | 82.9 (3.1) | 91.6 (1.9) | 94.8 (2.3) | 83.4 (5.8) | 92.3 (4.1) | 95.5 (1.9) | 84.1 (3.8) | 94.3 (3.8) |
|
K-means | 97.3 (3.2) | 80.9 (2.5) |
|
98.9 (1.1) | 92.6 (0.7) |
|
99.3 (0.6) | 92.9 (0.7) |
|
99.4 (0.4) | 94.1 (0.2) |
|
aAUC: area under the receiver operating characteristic curve.
bSVDD: support vector data description.
cIncSVDD: incremental support vector data description.
dV-SVM: one-class support vector machine.
eItalicized values indicates the top performing models.
fNN: nearest neighbor.
gMST: minimum spanning tree.
hMOG: mixture of Gaussian.
iMCD: minimum covariance determinant.
jK-NN: K-nearest neighbor.
kLOF: local outlier factor.
lPCA: principal component analysis.
mSOM: self-organizing maps.
The boundary and domain-based method and reconstruction-based method achieved better performance for all sample sizes compared with the density-based method. Specifically, the boundary and domain-based method achieved better generalization from the 1-month data set. The detailed numerical values of comparison are given in
From the boundary and domain-based method, V-SVM achieved better description for all the sample sizes, and SVDD, NN, and incSVDD improved with larger training sample size; however, V-SVM outperformed all the models for all the sample sizes.
From the density-based method, normal and MCD Gaussian performed better with the 1- and 2-month sample sizes, and models such as K-NN performed better on all the sample sizes, whereas naïve Parzen outperformed all the models with the 3- and 4-month data sets.
From the reconstruction-based method, K-means produced better description for all the sample sizes and the auto-encoder and SOM performed better with larger sample sizes.
Generally, in comparison, all the groups performed better at large training sample sizes; however, the boundary and domain-based method achieved better performance with small training sample sizes. It achieved comparable generalization from the 1-month data set. The detailed numerical values of comparison are given in
From the boundary and domain-based method, SVDD, NN, MST, incSVDD, and V-SVM achieved better performance at larger training sample sizes, whereas V-SVM outperformed all the models for all the sample sizes.
From the density-based method, the Gaussian families, K-NN, LOF, and naïve Parzen achieved better performance at larger training sample sizes, whereas K-NN and LOF outperformed all the models for all the sample sizes.
From the reconstruction-based method, K-means, PCA, auto-encoder, and SOM achieved better performance at larger training sample sizes, whereas PCA performed better for all sample sizes.
Generally, in comparison, all the group performed better at large training sample size; however, the boundary and domain-based method achieved better performance with small training sample sizes, for example, 1-month data set. It achieved comparable generalization from the 1-month data set. The detailed numerical values of comparison are given in
From the boundary and domain-based method, NN, incSVDD, and V-SVM achieved better performance at larger training sample sizes, whereas V-SVM outperformed all the models for all the sample sizes.
From the density-based method, Gaussian families, K-NN, LOF, and naïve Parzen achieved better performance at larger training sample sizes, whereas Gaussian families outperformed all the models for all the sample sizes.
From the reconstruction-based method, K-means, SOM, auto-encoder, and PCA achieved better performance at larger training sample sizes, whereas PCA performed better for all sample sizes.
The average performances of the models across all the infection cases for different sample sizes, levels of data granularity (hourly and daily), and nature of data (raw and smoothed) are shown in
Regarding the daily raw data set, as shown in
Average performance of each model across all the infection cases for the daily raw data set (without smoothing) and different sample sizes. Fraction=0.01.
Models | 1 month | 2 months | 3 months | 4 months | |||||||||
|
AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | |
|
|||||||||||||
|
SVDDb | 87.1 (11) | 66.0 (13.5) | 74.8 (9.5) | 91.7 (7.3) | 61.7 (10.6) | 84.1 (5.5) | 93.3 (4.6) | 67.3 (10.5) | 86.2 (4.4) | 91.4 (4.3) | 61.7 (10.6) |
|
|
IncSVDDd | 85.2 (11) | 63.0 (4.6) | 74.7 (10.4) | 90.5 (8.5) | 57.9 (11) |
|
92.8 (5.1) | 62.8 (10.9) | 84.9 (3.2) | 90.8 (4.4) | 55.0 (11.7) | 83.5 (3.7) |
|
V-SVMe | 91.5 (8.0) | 55.7 (7.0) |
|
92.2 (5.1) | 60.6 (5.0) | 82.8 (4.5) | 94.2 (3.8) | 66.9 (6.1) |
|
93.8 (4.1) | 63.1 (11.9) | 84.5 (5.1) |
|
NNf | 73.4 (12) | 31.3 (6.5) | 65.0 (5.4) | 72.1 (11.9) | 25.0 (9.6) | 75.7 (3.7) | 70.8 (11.2) | 8.6 (17.6) | 72.0 (4.7) | 70.0 (9.0) | 16.0 (14.4) | 75.7 (3.4) |
|
MSTg | 82.4 (8.7) | 52.1 (0.0) | 71.2 (6.1) | 82.6 (9.1) | 50.4 (9.0) | 82.0 (5.1) | 84.0 (6.3) | 56.2 (9.3) | 82.9 (3.5) | 84.2 (6.6) | 50.0 (11.4) | 82.6 (2.7) |
|
|||||||||||||
|
Gaussian | 91.5 (9.9) | 56.9 (7.7) | 72.9 (7.8) | 93.6 (6.1) | 58.8 (10.9) | 84.0 (4.0) | 95.1 (4.3) | 65.3 (10.6) | 86.3 (3.2) | 95.0 (3.5) | 57.9 (10.3) | 84.6 (3.2) |
|
MOGh | 89.9 (12) | 69.2 (11.9) | 71.3 (14.3) | 91.7 (6.1) | 64.1 (14.0) | 83.8 (6.8) | 94.0 (4.4) | 67.0 (11.4) | 85.0 (5.6) | 94.5 (3.7) | 61.6 (12.6) | 84.9 (5.1) |
|
MCDi Gaussian | 90.8 (9.1) | 54.0 (5.5) |
|
93.1 (6.0) | 58.0 (8.1) | 84.1 (4.3) | 95.3 (4.2) | 65.3 (10.6) | 86.4 (3.0) | 94.8 (3.5) | 57.9 (10.6) | 84.9 (3.0) |
|
Parzen | 89.7 (10) | 59.6 (8.3) | 70.6 (9.4) | 91.7 (6.5) | 62.1 (10.3) | 83.9 (5.3) | 93.9 (5.0) | 68.7 (11.2) | 85.6 (5.4) | 94.3 (3.8) | 66.1 (12.7) | 86.1 (3.8) |
|
Naïve Parzen | 88.1 (8.7) | 54.2 (6.5) | 69.1 (9.6) | 90.2 (7.1) | 60.4 (11.2) | 83.7 (4.9) | 91.9 (5.5) | 66.5 (12.8) | 86.6 (4.4) | 92.8 (4.7) | 64.6 (10.0) |
|
|
K-NNj | 91.1 (7.8) | 52.9 (5.1) | 71.6 (7.9) | 91.6 (5.0) | 61.1 (11.3) |
|
94.8 (4.8) | 66.9 (11.2) |
|
95.0 (3.8) | 62.1 (10.3) |
|
|
LOFk | 89.2 (8.9) | 56.3 (3.9) | 73.0 (8.6) | 92.4 (6.0) | 59.2 (11.1) | 84.9 (2.8) | 94.0 (4.8) | 64.4 (11.4) | 86.2 (2.8) | 93.7 (4.3) | 53.8 (10.3) | 83.8 (2.5) |
|
|||||||||||||
|
PCAl | 87.6 (8.8) | 58.8 (4.6) | 73.7 (8.3) | 90.2 (6.4) | 55.0 (6.8) | 82.7 (4.5) | 91.4 (4.9) | 59.7 (6.2) | 84.1 (3.2) | 90.5 (4.5) | 53.8 (7.2) | 83.6 (2.9) |
|
Auto-encoder | 83.6 (14) | 58.3 (17.7) | 71.0 (12.5) | 84.6 (12.5) | 53.1 (20.0) | 82.1 (7.0) | 88.4 (10.0) | 57.7 (21.5) | 83.3 (6.8) | 88.5 (10.6) | 52.3 (21.0) | 83.2 (5.8) |
|
SOMm | 85.6 (12) | 63.4 (10.3) | 72.7 (11.7) | 87.6 (7.2) | 57.1 (10.2) | 81.6 (5.8) | 93.5 (5.4) | 64.4 (8.5) | 84.8 (4.0) | 94.7 (4.0) | 59.0 (5.8) | 85.0 (3.1) |
|
K-means | 94.2 (7.6) | 57.2 (7.6) |
|
93.7 (6.2) | 62.2 (10.5) |
|
96.0 (4.4) | 67.6 (10.3) |
|
95.8 (3.9) | 62.1 (10.3) |
|
aAUC: area under the receiver operating characteristic curve.
bSVDD: support vector data description.
cItalicized values indicates the top performing models.
dIncSVDD: incremental support vector data description.
eV-SVM: one-class support vector machine.
fNN: nearest neighbor.
gMST: minimum spanning tree.
hMOG: mixture of Gaussian.
iMCD: minimum covariance determinant.
jK-NN: K-nearest neighbor.
kLOF: local outlier factor.
lPCA: principal component analysis.
mSOM: self-organizing maps.
Regarding the daily smoothed data set, as shown in
Average performance of each model across all the infection cases for the daily smoothed data set (with filter) and different sample size. Fraction=0.01.
Models | 1 month | 2 months | 3 months | 4 months | ||||||||||||
|
AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | ||||
|
||||||||||||||||
|
SVDDb | 99.9 (0.7) | 100 (0.0) | 94.1 (14.2) | 100 (0.0) | 100 (0.0) | 96.1 (7.6) | 100 (0.0) | 100 (0.0) | 96.5 (6.5) | 100 (0.0) | 100 (0.0) | 97.9 (3.9) | |||
|
99.9 (0.7) | 100 (0.0) | 94.1 (14.2) | 100 (0.0) | 100 (0.0) | 96.9 (6.5) | 100 (0.0) | 100 (0.0) | 97.3 (5.9) | 100 (0.0) | 100 (0.0) | 98.6 (2.9) | ||||
|
V-SVMd | 100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
|||
|
NNf | 90.1 (14.5) | 40.0 (30.5) | 69.5 (13.2) | 88.9 (9.9) | 33.1 (22.6) | 78.4 (6.8) | 89.2 (7.9) | 33.6 (14.6) | 77.7 (5.3) | 90.5 (6.8) | 23.5 (18.6) | 77.1 (5.7) | |||
|
MSTg | 98.9 (3.6) | 85 (6.1) | 86.7 (9.4) | 99.8 (0.7) | 96.7 (3.4) | 95.1 (6.2) | 99.9 (0.2) | 98.9 (4.1) | 98.0 (3.5) | 99.9 (0.5) | 100 (0.0) | 98.0 (5.4) | |||
|
||||||||||||||||
|
Gaussian | 99.2 (5.1) | 92.6 (9.0) | 87.2 (15.2) | 99.5 (2.5) | 96.7 (7.5) | 94.8 (10.4) | 99.9 (0.4) | 100 (0.0) | 98.1 (4.9) | 99.8 (0.8) | 100 (0.0) | 98.3 (5.9) | |||
|
MOGh | 98.8 (5.4) | 92.9 (8.6) | 85.2 (17.1) | 99.4 (2.6) | 97.0 (5.4) | 92.1 (11.6) | 99.9 (0.4) | 99.9 (0.7) | 95.4 (7.8) | 99.8 (1.0) | 99.9 (0.6) | 96.4 (7.7) | |||
|
MCDi Gaussian | 98.4 (5.6) | 86.6 (8.8) | 86.6 (11.9) | 99.3 (2.7) | 90.0 (8.7) | 93.4 (8.1) | 99.8 (0.5) | 99.2 (2.6) | 98.0 (5.3) | 99.8 (0.9) | 97.1 (3.9) | 97.0 (5.5) | |||
|
Parzen | 99.2 (3.5) | 100 (0.0) | 90.8 (16.4) | 99.9 (0.4) | 100 (0.0) | 93.7 (9.8) | 100 (0.0) | 100 (0.0) | 93.6 (8.9) | 99.9 (0.3) | 100 (0.0) | 95.8 (8.2) | |||
|
Naïve Parzen | 99.8 (1.2) | 100 (0.0) | 94.4 (14.6) | 100 (0.0) | 100 (0.0) | 96.1 (7.9) | 99.9 (0.5) | 100 (0.0) | 97.4 (5.6) | 100 (0.0) | 100 (0.0) | 98.2 (4.2) | |||
|
K-NNj | 99.5 (2.0) | 91.6 (3.6) |
|
99.9 (0.4) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
|||
|
LOFk | 99.6 (1.5) | 93.3 (7.3) | 92.4 (10.6) | 99.9 (0.5) | 99.2 (3.4) | 97.1 (7.3) | 99.9 (0.2) | 98.6 (2.8) | 97.4 (4.5) | 99.9 (0.4) | 100 (0.0) | 98.2 (5.9) | |||
|
||||||||||||||||
|
PCAl | 93.8 (6.7) | 82.0 (7.3) | 83.8 (10.4) | 91.3 (4.3) | 77.9 (7.3) | 89.3 (8.7) | 88.7 (5.9) | 76.3 (8.6) | 89.5 (5.3) | 90.7 (3.6) | 76.2 (8.6) | 89.0 (6.9) | |||
|
Auto-encoder | 97.0 (8.1) | 91.6 (14.6) | 87.7 (16.0) | 98.1 (5.4) | 92.6 (15.3) | 92.0 (10.7) | 98.6 (4.6) | 92.8 (14.8) | 94.0 (8.3) | 98.7 (4.0) | 92.7 (15.8) | 94.9 (7.7) | |||
|
SOMm | 99.1 (3.2) | 99.9 (0.6) | 85.2 (20.5) | 99.8 (0.7) | 100 (0.0) | 88.9 (16.1) | 99.9 (0.2) | 100 (0.0) | 94.6 (8.0) | 99.8 (0.6) | 100 (0.0) | 95.9 (8.1) | |||
|
K-means | 99.8 (1.2) | 96.2 (6.0) |
|
100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
100 (0.0) | 100 (0.0) |
|
aAUC: area under the receiver operating characteristic curve.
bSVDD: support vector data description.
cIncSVDD: incremental support vector data description.
dV-SVM: one-class support vector machine.
eItalicized values indicates the top performing models.
fNN: nearest neighbor.
gMST: minimum spanning tree.
hMOG: mixture of Gaussian.
iMCD: minimum covariance determinant.
jK-NN: K-nearest neighbor.
kLOF: local outlier factor.
lPCA: principal component analysis.
mSOM: self-organizing maps.
Regarding the hourly smoothed data set, as shown in
Average performance of each model across all the infection cases for the hourly data set with smoothing and different sample size. Fraction=0.01.
Models | 1 month | 2 months | 3 months | 4 months | |||||||||
|
AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | AUCa, mean (SD) | Specificity | F1 | |
|
|||||||||||||
|
SVDDb | 97.4 (2.9) | 89.0 (3.4) | 89.4 (7.1) | 97.4 (1.8) | 86.7 (4.4) | 91.5 (10.9) | 97.2 (2.6) | 80.1 (5.5) | 93.5 (3.4) | 97.6 (1.7) | 81.8 (5.3) | 94.6 (6.0) |
|
IncSVDDc | 97.1 (2.9) | 87.7 (2.7) | 89.5 (5.9) | 97.2 (1.8) | 86.4 (2.8) | 93.6 (4.8) | 97.0 (2.7) | 76.2 (6.3) | 93.2 (2.6) | 97.4 (1.7) | 79.0 (4.8) |
|
|
V-SVMe | 98.1 (2.0) | 85.5 (0.6) |
|
98.9 (1.4) | 89.8 (0.2) |
|
98.7 (1.4) | 86.4 (0.4) |
|
99.0 (0.9) | 89.2 (0.3) |
|
|
NNf | 93.2 (7.8) | 92.0 (2.4) | 83.9 (12.0) | 94.4 (2.5) | 88.4 (3.4) | 90.9 (5.3) | 93.3 (2.8) | 83.0 (3.7) | 92.0 (4.2) | 94.0 (2.8) | 82.9 (3.6) | 94.0 (4.0) |
|
MSTg | 96.1 (2.6) | 94.4 (2.2) | 72.9 (18.5) | 97.3 (1.4) | 94.2 (2.1) | 86.1 (11.0) | 96.1 (2.1) | 93.5 (1.9) | 90.2 (7.3) | 97.0 (1.4) | 93.6 (1.7) | 92.6 (5.0) |
|
|||||||||||||
|
Gaussian | 98.4 (1.6) | 91.2 (2.6) | 89.6 (12.5) | 99.3 (0.9) | 92.3 (1.7) | 95.7 (4.9) | 98.8 (1.3) | 88.1 (4.0) | 95.9 (2.7) | 99.2 (0.7) | 89.8 (3.1) |
|
|
MOGh | 97.5 (3.0) | 91.7 (3.2) | 87.8 (13.3) | 98.9 (1.2) | 90.9 (2.7) | 94.0 (6.3) | 98.2 (2.0) | 85.4 (6.6) | 94.2 (4.1) | 98.5 (1.5) | 88.0 (4.9) | 96.0 (3.1) |
|
MCDi Gaussian | 98.5 (1.5) | 89.9 (3.7) |
|
99.5 (0.9) | 92.2 (92.2) |
|
98.9 (1.1) | 87.9 (3.3) |
|
99.2 (0.7) | 90.4 (3.4) |
|
|
Parzen | 96.4 (2.6) | 97.8 (1.1) | 59.9 (18.9) | 98.0 (1.6) | 97.7 (1.1) | 79.5 (14.5) | 97.2 (2.3) | 96.4 (1.2) | 85.1 (10) | 98.1 (1.6) | 96.7 (1.1) | 88.6 (7.1) |
|
Naïve Parzen | 96.4 (3.0) | 87.5 (3.5) | 85.1 (10.9) | 98.7 (1.5) | 89.2 (2.8) | 92.8 (7.5) | 96.0 (2.3) | 90.8 (2.6) | 95.0 (4.1) | 98.2 (1.6) | 90.0 (1.8) | 96.2 (2.8) |
|
K-NNj | 97.6 (2.9) | 91.1 (1.6) | 87.6 (13.6) | 99.0 (1.4) | 92.4 (2.4) | 94.5 (6.6) | 98.4 (1.4) | 92.6 (1.4) | 95.7 (4.8) | 98.7 (1.1) | 93.3 (1.3) |
|
|
LOFk | 96.9 (2.9) | 91.2 (1.6) | 86.2 (13.0) | 97.4 (1.8) | 89.8 (4.8) | 93.1 (4.9) | 95.0 (3.0) | 85.2 (4.6) | 92.9 (4.8) | 95.8 (1.7) | 85.3 (4.7) | 94.7 (3.2) |
|
|||||||||||||
|
PCAl | 97.4 (3.2) | 78.2 (6.1) | 82.5 (10.9) | 94.8 (3.8) | 77.6 (4.5) | 90.9 (3.6) | 92.6 (4.2) | 72.4 (3.8) | 92.5 (1.9) | 93.4 (3.2) | 71.1 (2.5) | 93.9 (1.1) |
|
Auto-encoder | 95.4 (5.3) | 88.7 (9.5) | 86.1 (13.1) | 96.9 (3.2) | 87.1 (9.9) | 92.8 (6.4) | 95.0 (5.3) | 79.3 (14.5) | 93.1 (4.8) | 95.9 (4.3) | 80.3 (14.4) | 95.0 (3.6) |
|
SOMm | 95.9 (2.9) | 91.6 (2.6) | 86.1 (14.4) | 95.7 (1.7) | 87.6 (4.1) | 92.7 (5.7) | 93.9 (3.5) | 79.1 (10.9) | 92.3 (4.5) | 96.0 (2.5) | 87.5 (7.0) | 96.1 (3.2) |
|
K-means | 97.1 (3.9) | 89.7 (6.7) |
|
98.6 (1.7) | 91.1 (4.2) |
|
98.5 (1.5) | 92.3 (2.9) |
|
98.9 (1.0) | 93.9 (1.3) |
|
aAUC: area under the receiver operating characteristic curve.
bSVDD: support vector data description.
cIncSVDD: incremental support vector data description.
dItalicized values indicates the top performing models.
eV-SVM: one-class support vector machine.
fNN: nearest neighbor.
gMST: minimum spanning tree.
hMOG: mixture of Gaussian.
iMCD: minimum covariance determinant.
jK-NN: K-nearest neighbor.
kLOF: local outlier factor.
lPCA: principal component analysis.
mSOM: self-organizing maps.
Two density-based unsupervised models were tested and evaluated on the same set of data as used in the one-class classifiers: LOF and COF. The average AUC, specificity, and F1-score were computed after 20 runs. The best performing thresholds for all the infection cases along with the optimal value of
Average area under the receiver operating characteristic curve, specificity, and F1-score for both with and without smoothed versions of the data. The parameters kd and kh represent the optimal number of nearest neighbors for the daily and hourly cases, respectively.
Frequencies, density-based methods | ||||||||||||||
|
Pre-pro | Models (threshold) | 1st case of infection (kd=30, kh=240) | 2nd case of infection (kd=30, kh=240) | 3rd case of infection (kd=30, kh=240) | 4th case of infection (kd=30, kh=240) | ||||||||
|
|
|
AUCa | Specific | F1 | AUCa | Specific | F1 | AUCa | Specific | F1 | AUCa | Specific | F1 |
|
||||||||||||||
|
Without filter | LOFb (T1=2.4, T2=1.2, T3=1.45, T4=1.8)c | 75.0 | 50.0 |
|
90.0 | 100 | 67.4 | 92.1 | 66.7 |
|
98.2 | 100 |
|
|
|
COFd (T1=1.4, T2=1.3, T3=1.4, T4=1.4) | 82.1 | 66.7 | 72.6 | 97.4 | 100 |
|
75.2 | 66.7 | 67.6 | 96.7 | 100 | 71.8 |
|
With filter | LOFb (T1=1.7, T2=1.6, T3=1.95, T4=2.2) | 99.0 | 100 |
|
99.2 | 100 |
|
100 | 100 |
|
99.9 | 100 | 94.7 |
|
|
COFd |
97.6 | 100 | 76.6 | 97.9 | 100 | 77.6 | 99.5 | 100 | 88.8 | 100 | 100 |
|
|
||||||||||||||
|
|
LOFb (T1=1.4, T2=1.3, T3=1.35, T4=1.5) | 98.0 | 86.0 |
|
95.5 | 100 |
|
94.3 | 91.4 |
|
85.2 | 72.6 |
|
|
|
COFd (T1=1.2, T2=1.1, T3=, T4=1.1) | 92.4 | 88.4 |
|
77.0 | 66.0 | 62.5 | 90.3 | 82.7 | 74.6 | 82.6 | 82.2 | 63.7 |
aAUC: area under the receiver operating characteristic curve.
bLOF: local outlier factor.
cTk: threshold for the kth month.
dCOF: connectivity-based outlier factor.
Computational time is the amount of time a particular model needs to learn and execute a given task [
Plot of models’ average computational time for the training phase. The x-axis depicts the sample size, and each label stands for total sample size divided by 24. The y-axis depicts the computational time required by each model. Gauss: Gaussian; IncSVDD: incremental support vector data description; K-NN: K-nearest neighbor; LOF: local outlier factor; MCD: minimum covariance determinant; MOG: mixture of Gaussian; MST: minimum spanning tree; NN: nearest neighbor; NParzen: naïve Parzen; PCA: principal component analysis; SOM: self-organizing maps; SVDD: support vector data description; V-SVM: one-class support vector machine.
Plot of models’ average computational time for the testing phase. The x-axis depicts the sample size, and each label stands for total sample size divided by 24. The y-axis depicts the computational time required by each model. Gauss: Gaussian; IncSVDD: incremental support vector data description; K-NN: K-nearest neighbor; LOF: local outlier factor; MCD Gauss: Gaussian: SOM: self-organizing maps; MOG: mixture of Gaussian; MST: minimum spanning tree; NN: nearest neighbor; NParzen: naïve Parzen; PCA: principal component analysis; SVDD: support vector data description; V-SVM: one-class support vector machine.
Anomaly or novelty detection problem has been widely used in various applications including machine fault and sensor failure detection, prevention of credit card or identity fraud, health and medical diagnostics and monitoring, cyber-intrusion detection, and others [
There are no well-defined boundaries regarding how different pathogens affect various key parameters of blood glucose dynamics, including blood glucose levels, insulin injections, carbohydrate ingestions, physical activity or exercise load, and others. This results in poor boundary demarcation between the normal and abnormal classes.
Class boundaries defined for a single pathogen might not work for the other pathogens because the effect of different pathogens on the blood glucose dynamics could be different.
It is expensive and time consuming to collect infection-related data to explore and characterize pathogen-specific class boundaries. This results in ill-defined class boundaries even for an infection related to a single pathogen.
The degree of effect of the same pathogens on the blood glucose dynamics could differ between different individuals because of the difference in individual immunity, which further complicates the characterization task.
Lack of sufficient sample size for both the abnormal and the normal classes results in poor training and testing data sample size or imbalanced class problems.
Given these challenges, the best possible approach is to identify methods that can learn from the normal health state of an individual and classify abnormalities relying on the boundaries learnt from the normal health state, which is a one-class classifier approach. This definitely reduces the challenge because it only requires the characterization of what is believed to be a normal health state. For instance, assume a health diagnostic and monitoring system that detects health changes in an individual by tracking the individual’s physiological parameters, where the current health status is examined based on set of parameters, and raises a notification alarm when the individual health deteriorates [
Carbohydrate action: a situation in which the ratio of insulin-to-carbohydrate is small and the blood glucose levels are high (hyperglycemia),
Physical activity action: despite a small ratio of insulin-to-carbohydrate, the blood glucose levels still drop to low levels (hypoglycemia),
Insulin action: the ratio of insulin-to-carbohydrate is large, that is, high insulin intake and low carbohydrate consumption, and blood glucose levels are low (hypoglycemia),
Quadrants of wellness in people with type 1 diabetes. The figure depicts the 4 possible scenarios of different parameters: carbohydrate action, insulin action, physical activity action, and abnormality because of metabolic change such as infection and stress. BG: blood glucose; PA: physical activity.
The drawback of unsupervised methods is that they do not have any mechanism to handle rare events even if the events are normal. This is mainly because unsupervised methods define an anomaly on the basis of the entire data set. However, the one-class classifier can learn and handle such scenarios appropriately if presented during the training phase. This is mainly because one-class classifiers produce a reference description based on the available normal (target) data set, including the rare events. With regard to the one-class classifiers, the boundary and domain-based method achieved a better description of the data set compared with the density and reconstruction-based methods, mainly because of the ability of such models to handle the atypical nature of the data [
Selecting the proper model for implementation in a real-world setting requires considering different characteristics of the model. This includes typical model characteristics such as performance in limited training sample size, robustness to outliers in the training data, required training and testing time, and complexity of the model (in terms of the number of model parameters).
The sample size, N, is the number of sample objects used during the training phase and highly affects the generalization power of the model [
Average performance (F1-score) of each model across all the infection cases. AE: auto-encoder; Gauss: Gaussian; IncSVDD: incremental support vector data description; K-NN: K-nearest neighbor; LOF: local outlier factor; MCD: minimum covariance determinant; MOG: mixture of Gaussian; MST: minimum spanning tree; NN: nearest neighbor; NP: naïve Parzen; PCA: principal component analysis; SOM: self-organizing maps; SVDD: support vector data description; V-SVM: one-class support vector machine.
For real-time applications, the time a model takes to learn and classify the sample object is essential in model selection.
Rough estimation of average training and testing time required by the different classifiers.
Methods | Training time, mean (SD) | Testing time, mean (SD) | |||
|
|||||
|
SVDDa | 105.2 (2.03) | 0.008 (0.002) | ||
|
IncSVDDb | 0.05 (0.16) | 2.41 (0.83) | ||
|
K-means | 0.0047 (0.0014) | 0.0032 (0.0010) | ||
|
Gaussian | 0.0055 (0.0032) | 0.0032 (0.0012) | ||
|
MOGc | 0.076 (0.018) | 0.0036 (0.0011) | ||
|
MCDd Gaussian | 0.27 (0.075) | 0.0034 (0.0015) | ||
|
SOMe | 21.62 (5.91) | 0.0033 (0.00087) | ||
|
K-NNf | 0.51 (0.11) | 0.52 (0.12) | ||
|
Parzen | 2.02 (0.41) | 0.21 (0.052) | ||
|
Naïve Parzen | 4.02 (0.82) | 0.40 (0.10) | ||
|
LOFg | 1.15 (0.28) | 1198.05 (323.07) | ||
|
NNh | 151.34 (22.52) | 0.18 (0.024) | ||
|
MSTi | 2.39 (0.31) | 1.24 (0.19) | ||
|
PCAj | 0.046 (0.20) | 0.0031 (0.00086) | ||
|
Auto-encoder | 0.65 (0.094) | 0.017 (0.0034) | ||
|
V-SVMk | 0.32 (0.024) | 0.035 (0.0066) | ||
|
|||||
|
LOFl | N/Am | 0.2 (0.0) | ||
|
COFn | N/A | 82.8 (1.5) |
aSVDD: support vector data description.
bIncSVDD: incremental support vector data description.
cMOG: mixture of Gaussian.
dMCD: minimum covariance determinant.
eSOM: self-organizing maps.
fK-NN: K-nearest neighbor.
gLOF: local outlier factor.
hNN: nearest neighbor.
iMST: minimum spanning tree.
jPCA: principal component analysis.
kV-SVM: one-class support vector machine.
lLOF: local outlier factor.
mN/A: not applicable.
nCOF: connectivity-based outlier factor.
The presence of outliers in the training data set could significantly affect the model’s generalization ability. Outlier objects are samples that exhibit different characteristics compared with the rest of the objects in the data set [
The parameters of a model can be either free or user defined. These two parameters, free and user defined, provide insight into how flexible the model is, how sensitive the model is to overtraining, and how easy the model is to configure (simplicity) [
For a real-world application, apart from the performance of the model, it is important to consider two important aspects of the data set, the time window of detection (data granularity) and the required sample size. The time window or data granularity, that is, hourly and daily, defines the frequency (continuity) of computation one needs to conduct throughout the day to screen the health status of the individual with type 1 diabetes. In an hourly time window, one is expected to carry out the computation at the end of each hour throughout the day. However, in the daily time window, one needs to carry out one aggregate computation at the end of the day. Decreasing the time window (increasing the granularity of the data) enhances early detections; however, at the coast of accuracy, for example, more unwanted features (noise) in the data. The results demonstrated that almost all the models produced fairly comparable detection performances in both time windows. Moreover, the required sample size determines the necessary amount of data an individual with type 1 diabetes needs to collect in advance before joining such an infection detection system. Models that could generalize well with small sample sizes could be preferred in a real-world application to enable more people to join the system with ease. Generally, the results demonstrated that the models require at least a sample size of 3-month data for the daily case and 2-month data for hourly case to perform better. Automating the detection of infection incidences among people with type 1 diabetes can deliver a means to provide personalized decision support and learning platforms for the individuals and, at the same time, can be used to detect infectious disease outbreaks on a large scale through spatio-temporal cluster detection [
A personalized decision support system and learning platform relies on an individual’s self-recorded data to provide relevant information in relation to decision making to assist the individuals during crises [
A population-based early outbreak detection system relies on self-recorded information from an individual with type 1 diabetes to detect individuals’ infection cases and, thereby, detect a group of infected individuals on a spatio-temporal basis. Such a system should collect individuals’ self-recoded data to a central server, analyze individuals’ data on a timely basis, identify and locate a cluster of people based on space and time, and notify the responsible bodies if there is an ongoing outbreak [
Anomaly or novelty detection problem has been widely used in various applications including machine fault and sensor failure detection, prevention of credit card or identity fraud, health and medical diagnostics and monitoring, cyber-intrusion detection, and others. In this study, we demonstrated the applicability of one-class classifiers and unsupervised anomaly detection methods for the purpose of detecting infection incidences in people with type 1 diabetes. In general, the proposed methods produced excellent performance in describing the data set, and particularly the boundary and domain-based method performed better. In contrast to the specific models, V-SVM, K-NN, and K-means achieved better generalization in describing the data set in all infection cases. Detecting the incidence of infection in people with type 1 diabetes can provide an opportunity to devise tailored services, that is, personalized decision support and a learning platform for the individuals, and can simultaneously be used for detecting potential public health threats, that is, infectious disease outbreaks, on a large-scale basis through a spatio-temporal cluster detection. Generally, we foresee that the results presented could encourage researchers to further examine the presented features along with other additional features of self-recorded data, for example, various CGM features and physical activity data, on a large-scale basis.
Theoretical background of the methods.
Detailed description of the models input features.
Score plot of the models for each patient year.
Model evaluations – performance of the models for each patient year.
area under the receiver operating characteristic curve
connectivity-based outlier factor
incremental support vector data description
K-nearest neighbor
local outlier factor
minimum covariance determinant
mixture of Gaussian
minimum spanning tree
nearest neighbor
principal component analysis
self-organizing maps
support vector data description
receiver operating characteristic curve
one-class support vector machine
The work presented in this paper is part of the project
The first author, AW, conceived the study, designed and performed the experiments, and wrote the manuscript. IK, EÅ, JI, DA, and GH provided successive inputs and revised the manuscript. All authors approved the final manuscript.
None declared.