This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Reference intervals (RIs) play an important role in clinical decisionmaking. However, due to the time, labor, and financial costs involved in establishing RIs using direct means, the use of indirect methods, based on big data previously obtained from clinical laboratories, is getting increasing attention. Different indirect techniques combined with different data transformation methods and outlier removal might cause differences in the calculation of RIs. However, there are few systematic evaluations of this.
This study used data derived from direct methods as reference standards and evaluated the accuracy of combinations of different data transformation, outlier removal, and indirect techniques in establishing complete blood count (CBC) RIs for largescale data.
The CBC data of populations aged ≥18 years undergoing physical examination from January 2010 to December 2011 were retrieved from the First Affiliated Hospital of China Medical University in northern China. After exclusion of repeated individuals, we performed parametric, nonparametric, Hoffmann, Bhattacharya, and truncation points and Kolmogorov–Smirnov distance (kosmic) indirect methods, combined with log or BoxCox transformation, and Reed–Dixon, Tukey, and iterative mean (3SD) outlier removal methods in order to derive the RIs of 8 CBC parameters and compared the results with those directly and previously established. Furthermore, bias ratios (BRs) were calculated to assess which combination of indirect technique, data transformation pattern, and outlier removal method is preferrable.
Raw data showed that the degrees of skewness of the white blood cell (WBC) count, platelet (PLT) count, mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), and mean corpuscular volume (MCV) were much more obvious than those of other CBC parameters. After log or BoxCox transformation combined with Tukey or iterative mean (3SD) processing, the distribution types of these data were close to Gaussian distribution. Tukeybased outlier removal yielded the maximum number of outliers. The lowerlimit bias of WBC (male), PLT (male), hemoglobin (HGB; male), MCH (male/female), and MCV (female) was greater than that of the corresponding upper limit for more than half of 30 indirect methods. Computational indirect choices of CBC parameters for males and females were inconsistent. The RIs of MCHC established by the direct method for females were narrow. For this, the kosmic method was markedly superior, which contrasted with the RI calculation of CBC parameters with high BR qualification rates for males. Among the top 10 methodologies for the WBC count, PLT count, HGB, MCV, and MCHC with a highBR qualification rate among males, the Bhattacharya, Hoffmann, and parametric methods were superior to the other 2 indirect methods.
Compared to results derived by the direct method, outlier removal methods and indirect techniques markedly influence the final RIs, whereas data transformation has negligible effects, except for obviously skewed data. Specifically, the outlier removal efficiency of Tukey and iterative mean (3SD) methods is almost equivalent. Furthermore, the choice of indirect techniques depends more on the characteristics of the studied analyte itself. This study provides scientific evidence for clinical laboratories to use their previous data sets to establish RIs.
Reference intervals (RIs) play an important role in clinical decisionmaking. For most clinical laboratories, objective RIs are critical benchmarks for identifying healthy and unhealthy populations [
Previous studies have demonstrated that RIs can be established through direct, indirect, and transference methods [
The establishment of RIs using indirect methods roughly involves data acquisition, data cleaning, transformation of skewed data, elimination of outliers or error values, and selection of appropriate statistical methods to calculate the reference limits (RLs) [
In 2020, Hickman et al [
Hence, we systematically and comprehensively explored the effects of various combinations of different statistical techniques used in indirect methods on RI determination and compared the RIs established using different indirect and direct methods. Our results will provide a scientific basis for clinicians to use their own laboratory data to establish RIs suitable for their own service population.
The dataprocessing flowchart of this study is shown in Figure S1 in
The probability of illness and interference factors among populations undergoing physical examination is much lower than that of outpatient and emergency patients; thus, in the absence of other clinical diagnostic information, relatively healthy populations undergoing physical examination are more suitable for establishing complete blood count (CBC) RIs using indirect methods. Furthermore, it is too difficult for clinical laboratory researchers to obtain clinical information, which creates a barrier to setting inclusion and exclusion standards for indirect methods. To simulate practical application scenarios, indirect methods derived from physical realworld data have great application and promotion value. A data set of 8 CBC parameters—the red blood cell (RBC) count, hemoglobin (HGB), hematocrit (HCT), the mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), the mean corpuscular hemoglobin concentration (MCHC), the platelet (PLT) count, and the white blood cell (WBC) count—for populations aged ≥18 years undergoing physical examination, measured with an XE2100 hematology analyzer (Sysmex Corp), from January 2010 to December 2011 (24month period) was retrieved from the laboratory information system in the physical examination center of the First Affiliated Hospital of China Medical University, Shenyang. The relevant instruments and equipment for CBC testing during the study period remained the same. A daily control was performed following the International Council for Standardization in Hematology (ICSH) guidelines.
Next, preliminary data cleaning was performed, as reported by Jones et al [
The data distribution normality was evaluated by calculating skewness and kurtosis. We used the 2 most commonly used data normalization methods, namely log transformation and BoxCox transformation. After skewed data were transformed through these 2 normalization methods, combined with 3 outlier removal techniques, namely Reed–Dixon, Tukey, and iterative mean (3SD), we obtained 6 processed initial data sets.
Data were analyzed using R version 4.1.2 (R Core Team and the R Foundation for Statistical Computing) [
Subsequent statistical operations were based on the 6 data sets obtained after data transformation and outlier removal, except for nonparametric and kosmic methods. Nonparametric and kosmic methods used the original data set after data transformation and outlier removal, whereas the parametric, Hoffmann, and Bhattacharya methods used the transformed data set after outlier removal to establish RIs.
In this method, we first calculated the mean (
This method was used per CLSI guidelines. RIs were determined based on the central 95% range of reference values, that is, the lower limits (LLs) and upper limits (ULs) were interpreted as the 2.5th and 97.5th percentiles, respectively.
The Hoffmann method mainly requires Gaussian distribution data. We used the data after normalization and outlier removal. This method relies on Q–Q plots and visual inspection of manual intercepts of linear segments [
This is a graphical method requiring Gaussian distribution data. Thus, we used data after normalization and outlier removal [
The kosmic method primarily uses a truncated power normal distribution family (Gaussian or truncated Gaussian after using BoxCox transformation) to model the proportion of physiological samples. Specifically, this approach minimizes the Kolmogorov–Smirnov distance between an estimated normal distribution and the truncated part of the observed distribution of test results after BoxCox transformation; more specific principles are described by Zierk et al [
The data set for establishing RIs using the direct method pertained to the data of Han Chinese adults from September 2010 to January 2011 from our previously published study [
This study adopted 2 approaches to evaluate the differences and biases in RLs determined using various methods. We selected RLs determined using the direct method as the reference standard for evaluating the differences among various indirect methods for calculating RIs. One approach was to calculate the relative deviation between RLs. The following formula was used:
where d% is the difference between the LL/UL for indirect and direct methods, and the LLs and ULs of the RIs to be evaluated and the “reference standard” RIs are LL_{e}, UL_{e}, LL_{r}, and UL_{r}, respectively.
The other method was to calculate bias ratios (BRs) [
Next, the BRs for LL_{e} and UL_{e} were calculated as follows [
Finally, we regarded BRs<0.375 as the allowable minimum bias [
This study complies with all the relevant national regulations and institutional policies and the tenets of the Declaration of Helsinki. The study was approved by the Ethics Committee of the First Affiliated Hospital of China Medical University (approval number 2021 442). Informed consent for the direct method was obtained from all individuals included in this study, whereas the indirect method research was exempt from informed consent due to the use of previous laboratory data.
Scatter plots and fitting curves describing the original CBC parameter changes with age for both sexes revealed notable relationships between all the CBC parameters and age (
Trends of variation in the levels of CBC parameters with age. (A) WBCs (×10^{9}/L), (B) PLTs (×10^{9}/L), (C) RBCs (×10^{12}/L), (D) HGB (g/L), (E) MCH (pg), (F) MCV (fL), (G) MCHC (g/L), and (H) HCT (L/L). CBC: complete blood count; HCT: hematocrit; HGB: hemoglobin; MCH: mean corpuscular hemoglobin; MCHC: mean corpuscular hemoglobin concentration; MCV: mean corpuscular volume; PLT: platelet; RBC: red blood cell; WBC: white blood cell.
Sex differences in 8 CBC^{a} parameters after data cleaning.
Analyte  Unit  Male (n=24,073), median (IQR)  Female (n=18,048), median (IQR)  Overall (N=42,121), median (IQR)  
WBC^{c}  ×10^{9}/L  6.64 (5.687.75)  6.09 (5.197.11)  6.40 (5.457.50)  <.001 
RBC^{d}  ×10^{12}/L  4.98 (4.755.21)  4.39 (4.194.58)  4.72 (4.395.05)  <.001 
HGB^{e}  g/L  152 (146.00159.00)  130 (124.00136.00)  143 (131.00154.00)  <.001 
HCT^{f}  L/L  0.45 (0.430.46)  0.39 (0.380.41)  0.42 (0.390.45)  <.001 
MCV^{g}  fL  90 (87.0092.00)  89 (87.0092.00)  90 (87.0092.00)  <.001 
MCH^{h}  pg  30.6 (29.8031.50)  29.8 (28.9030.70)  30.3 (29.4031.20)  <.001 
MCHC^{i}  g/L  341 (335.00347.00)  332 (327338)  337 (331344)  <.001 
PLT^{j}  ×10^{9}/L  208 (180239)  230 (199266)  217 (187251)  <.001 
^{a}CBC: complete blood count.
^{b}
^{c}WBC: white blood cell.
^{d}RBC: red blood cell.
^{e}HGB: hemoglobin.
^{f}HCT, hematocrit.
^{g}MCV: mean corpuscular volume.
^{h}MCH: mean corpuscular hemoglobin.
^{i}MCHC: mean corpuscular hemoglobin concentration.
^{j}PLT: platelet.
Skewness and kurtosis of raw data and log transformation–processed data for 8 CBC^{a} parameters after different data transformation and outlier removal methods were applied.
Analyte and sex  Raw data  Reed–Dixon  Tukey  Iterative mean (3SD)  
Skewness  Kurtosis  Skewness  Kurtosis  Skewness  Kurtosis  Skewness  Kurtosis  



WBC^{b}  6.264  243.709  0.158  4.483  0.020  2.732  0.042  2.887  

PLT^{c}  0.873  8.840  –1.003  14.253  –0.060  2.742  –0.093  2.937  

RBC^{d}  –0.445  5.622  –1.172  11.090  –0.171  2.787  –0.218  2.975  

HGB^{e}  –0.752  6.617  –1.530  13.064  –0.106  2.734  –0.202  3.014  

HCT^{f}  –0.690  6.484  –1.349  11.595  –0.203  2.717  –0.194  3.034  

MCV^{g}  –0.204  8.406  –0.768  10.718  0.090  2.788  0.093  2.902  

MCH^{h}  –0.704  11.389  –1.601  17.466  0.060  2.762  0.068  2.981  

MCHC^{i}  –0.015  5.041  –0.193  5.897  0.072  2.814  0.088  2.907  



WBC  1.332  11.619  0.085  3.490  0.010  2.737  0.008  2.883  

PLT  0.698  6.632  –0.634  6.622  –0.076  2.765  –0.106  2.925  

RBC  0.064  5.316  –0.565  9.993  –0.056  2.748  –0.066  2.876  

HGB  –1.157  6.853  –1.907  11.560  –0.145  2.757  –0.283  3.080  

HCT  –0.700  5.509  –1.104  6.803  –0.163  2.746  –0.253  3.031  

MCV  –1.614  9.333  –2.115  12.179  –0.107  2.894  –0.199  3.044  

MCH  –2.162  11.334  –2.867  15.873  –0.138  2.822  –0.262  3.131  

MCHC  –1.012  7.856  –1.305  9.112  0.029  2.822  –0.033  3.056 
^{a}CBC: complete blood count.
^{b}WBC: white blood cell.
^{c}PLT: platelet.
^{d}RBC: red blood cell.
^{e}HGB: hemoglobin.
^{f}HCT, hematocrit.
^{g}MCV: mean corpuscular volume.
^{h}MCH: mean corpuscular hemoglobin.
^{i}MCHC: mean corpuscular hemoglobin concentration.
Skewness and kurtosis of raw data and BoxCox transformation–processed data for 8 CBC^{a} parameters after different data transformation and outlier removal methods were applied.
Analyte and sex  Raw data  Reed–Dixon  Tukey  Iterative mean (3SD)  
Skewness  Kurtosis  Skewness  Kurtosis  Skewness  Kurtosis  Skewness  Kurtosis  



WBC^{b}  6.264  243.709  –0.019  4.133  –0.022  2.733  –0.031  2.901 

PLT^{c}  0.873  8.840  0.116  5.262  0.066  2.753  0.077  2.949 

RBC^{d}  –0.445  5.622  0.034  4.838  0.023  2.760  –0.001  2.948 

HGB^{e}  –0.752  6.617  –0.268  4.680  0.008  2.889  0.005  2.998 

HCT^{f}  –0.690  6.484  –0.254  4.736  0.059  2.644  0.009  2.989 

MCV^{g}  –0.204  8.406  0.120  8.233  0.172  2.778  0.211  2.941 

MCH^{h}  –0.704  11.389  –0.001  10.061  0.141  2.762  0.206  2.975 

MCHC^{i}  –0.015  5.041  0.026  4.902  0.112  2.819  0.145  2.923 



WBC  1.332  11.619  –0.003  3.424  –0.022  2.740  –0.037  2.888 

PLT  0.698  6.632  0.085  4.578  0.054  2.767  0.074  2.925 

RBC  0.064  5.316  0.064  5.316  0.039  2.731  0.048  2.891 

HGB  –1.157  6.853  –0.608  5.046  0.006  2.801  –0.117  3.113 

HCT  –0.700  5.509  –0.280  4.372  0.035  2.850  –0.136  3.105 

MCV  –1.614  9.333  –1.134  7.729  –0.075  3.040  –0.075  3.040 

MCH  –2.162  11.334  –1.544  8.769  –0.055  2.830  –0.146  3.153 

MCHC  –1.012  7.856  –0.814  6.484  0.037  2.838  0.041  3.109 
^{a}CBC: complete blood count.
^{b}WBC: white blood cell.
^{c}PLT: platelet.
^{d}RBC: red blood cell.
^{e}HGB: hemoglobin.
^{f}HCT, hematocrit.
^{g}MCV: mean corpuscular volume.
^{h}MCH: mean corpuscular hemoglobin.
^{i}MCHC: mean corpuscular hemoglobin concentration.
The RIs of CBC parameters, established using 30 different indirect methods, are displayed in a bar chart for males (
Comparison of RIs for males using 31 calculation methods. (A) WBC (×10^{9}/L), (B) PLT (×10^{9}/L), (C) RBC (×10^{12}/L), (D) HGB (g/L), (E) MCH (pg), (F) MCV (fL), (G) MCHC (g/L), and (H) HCT (L/L). 3SD: mean (3SD) with iteration; Bhatt: Bhattacharya; box: BoxCox transformation; Direct: direct methods; Hoff: Hoffmann; HCT: hematocrit; HGB: hemoglobin; log: log transformation; MCH: mean corpuscular hemoglobin; MCHC: mean corpuscular hemoglobin concentration; MCV: mean corpuscular volume; NP: nonparametric; P: parametric; PLT: platelet; RBC: red blood cell; RI: reference interval; WBC: white blood cell.
Comparison of RIs for females using 31 calculation methods. (A) WBC (×10^{9}/L), (B) PLT (×10^{9}/L), (C) RBC (×10^{12}/L), (D) HGB (g/L), (E) MCH (pg), (F) MCV (fL), (G) MCHC (g/L), and (H) HCT (L/L). 3SD: mean (3SD) with iteration; Bhatt: Bhattacharya; box: BoxCox transformation; Direct: direct methods; Hoff: Hoffmann; HCT: hematocrit; HGB: hemoglobin; log: log transformation; MCH: mean corpuscular hemoglobin; MCHC: mean corpuscular hemoglobin concentration; MCV: mean corpuscular volume; NP: nonparametric; P: parametric; PLT: platelet; RBC: red blood cell; RI: reference interval; WBC: white blood cell.
When the outlier removal methods and indirect techniques remained fixed, the UL, LL, or both RLs for all CBC parameters, except the WBC count, among males shifted rightward along the x axis after log transformation compared to BoxCox transformation (
For WBC data among males, we found that the raw data demonstrated a rightskewed distribution (skewness=6.264, kurtosis=243.709;
For female CBC parameter data, the tendency for change was the same as that for males (
More outliers were eliminated using the Tukey method than using the iterative mean (3SD) and Reed–Dixon methods. For males, the elimination rate using the Tukey method ranged from 1.15% to 3.26% compared to 0.64%1.77% using the iterative mean (3SD) method and 0% using the Reed–Dixon method (Table S3 in
With fixed data transformation and indirect techniques, the RIs of analytes were mostly affected by outlier removal methods (
Nonparametric and parametric methods yielded wider RIs of HGB, MCH, MCV, MCHC, and HCT than did the other 3 methods among females (
The representative Q–Q plots of quantiles of the Gaussian distribution of CBC parameters in males and females, after using different data transformation and outlier removal methods, are displayed in Figures S2 and S3 in
Gaussian populations for establishing RIs were selected by identifying points that formed a straight line with a negative slope (Figures S4S19 in
The estimated distributions of physiological test results and RIs are shown in Figures S20 and S21 in
First, the BR values of the LL and UL were sorted in ascending order.
We also found the following:
Only a large bias of LLs (ratio of BRLL<85%): HGB (female), MCV (female), MCH (female), and MCHC (female)
Only a large bias of ULs (ratio of BRUL<85%): WBC (male/female) and RBC (male/female)
A large bias of both limits (ratios of BRLL and BRUL both<85%): MCH (male) and HCT (male/female)
See
Data transformation would slightly affect the comparison between the direct and indirect methods. When we compared the BR of RLs for males and females, we found that log (43 times) and BoxCox (41 times) transformations were similar in the top 10 methodologies (
For the selection of indirect techniques, computational choices of CBC parameters for males and females were inconsistent. The RIs of MCHC established using the direct method for females were narrow. For this, the kosmic method was markedly superior (
Comparison of RIs for (A) WBC count, (B) PLT count, (C) RBC, and (D) HGB for males using 31 indirect methods with calculation of bias at RLs. (A) WBC (×10^{9}/L), (B) PLT (×10^{9}/L), (C) RBC (×10^{12}/L), and (D) HGB (g/L). 3SD: mean (3SD) with iteration; Bhatt: Bhattacharya; box: BoxCox transformation; Hoff: Hoffmann; BR: bias ratio; d% between LL: relative deviation of lower RL between indirect and direct methods; d% between UL: relative deviation of upper RL between indirect and direct methods; HGB: hemoglobin; LL: lower limit; log: log transformation; PLT: platelet; RBC: red blood cell; RI: reference interval; RL: reference limit; UL: upper limit; WBC: white blood cell.
Comparison of RIs of (A) MCH, (B) MCV, (C) MCHC, and (D) HCT for males using 31 indirect methods with calculation of bias at RLs. (A) MCH (pg), (B) MCV (fL), (C) MCHC (g/L), and (D) HCT (L/L). 3SD: mean (3SD) with iteration; Bhatt: Bhattacharya; box: BoxCox transformation; HCT: hematocrit; Hoff: Hoffmann; BR: bias ratio; d% between LL: relative deviation of lower RL between indirect and direct methods; d% between UL: relative deviation of upper RL between indirect and direct methods; LL: lower limit; log: log transformation; MCH: mean corpuscular hemoglobin; MCHC: mean corpuscular hemoglobin concentration; MCV: mean corpuscular volume; RI: reference interval; RL: reference limit; UL: upper limit.
Ratio of the BR^{a} for lower and upper RLs^{b} of <0.375, based on 30 indirect methods and the intersection of the top 10 combinations of data transformation, outlier removal, and indirect techniques for descending BRLL^{c} or BRUL^{d}.
Analyte and sex  Ratio of BRLL<0.375 (%)  Ratio of BRUL<0.375 (%)  Top methods  



WBC^{e}  100.00  80.00  log^{f}DixonBhatt^{g} 

PLT^{h}  100.00  100.00  box^{i}DixonP^{j}, logDixonNP^{k}, boxDixonNP 

RBC^{l}  100.00  46.67  log3SD^{m}NP 

HGB^{n}  90.00  100.00  logDixonHoff^{o}, boxDixonP 

MCH^{p}  13.33  76.67  box3SDBhatt, boxTukeyHoff, boxTukeyBhatt, box3SDP, box3SDHoff 

MCV^{q}  100.00  100.00  box3SDBhatt 

MCHC^{r}  96.67  93.33  boxTukeyBhatt, box3SDBhatt, log3SDBhatt, boxDixonHoff, boxTukeyP, boxTukeyHoff, box3SDP, box3SDHoff 

HCT^{s}  73.33  40.00  logDixonHoff, logTukeyHoff, logTukeyBhatt, log3SDP, log3SDNP, log3SDHoff, log3SDBhatt 



WBC  100.00  53.33  boxDixonHoff, boxTukeyBhatt, box3SDBhatt 

PLT  100.00  93.33  boxTukeyNP, box3SDP, box3SDHoff, logDixonkosmic, boxDixonkosmic, box3SDBhatt 

RBC  100.00  70.00  logTukeyP, logTukeyHoff, log3SDP, log3SDHoff, log3SDNP 

HGB  83.33  96.67  logDixonHoff, log3SDHoff, boxDixonBhatt, logTukeyNP, logTukeyHoff, log3SDP, log3SDNP 

MCH  43.33  86.67  boxTukeyNP, boxTukeyP, boxTukeyHoff, boxTukeyBhatt 

MCV  66.6  93.33  log3SDNP, boxDixonHoff, boxDixonBhatt, boxTukeyP, boxTukeyNP, boxTukeyHoff, box3SDP, box3SDNP, box3SDHoff 

MCHC  73.33  90.00  logDixonkosmic, logTukeykosmic, log3SDkosmic, boxDixonkosmic, boxTukeykosmic, box3SDkosmic, logTukeyP, logTukeyNP, logTukeyHoff, logTukeyBhatt, log3SDP, log3SDHoff, log3SDBhatt, box3SDBhatt 

HCT  56.67  63.33  logDixonHoff, logDixonBhatt, logTukeyP, logTukeyNP, logTukeyHoff, logTukeyBhatt, log3SDP, log3SDHoff 
^{a}BR: bias ratio.
^{b}RL: reference limit.
^{c}LL: lower limit.
^{d}UL: upper limit.
^{e}WBC: white blood cell.
^{f}log: log transformation.
^{g}Bhatt: Bhattacharya method.
^{h}PLT: platelet.
^{i}box: BoxCox transformation.
^{j}P: parametric.
^{k}NP: nonparametric.
^{l}RBC: red blood cell.
^{m}3SD: iterative mean (3SD).
^{n}HGB: hemoglobin.
^{o}Hoff: Hoffmann method.
^{p}MCH: mean corpuscular hemoglobin.
^{q}MCV: mean corpuscular volume.
^{r}MCHC: mean corpuscular hemoglobin concentration.
^{s}HCT: hematocrit.
To the best of our knowledge, this is the first study to comprehensively evaluate the effects of combinations of different data transformation, outlier removal, and indirect techniques on establishing RIs for largescale data. For most laboratories that are unable to carry out direct method research, this study provides a scientific and reasonable basis for their use of previous laboratory data sets to establish RIs using indirect methods. Moreover, we used data derived from the direct method as reference standards. We found that for data with different distribution characteristics, outlier removal method and indirect technique use markedly influenced the final RIs, whereas data transformation had negligible effects except for obviously skewed data.
There are several strengths of this study. First, the samples for direct and indirect methods were derived from the same study area and were tested in the same laboratory using the same instrumentation as well. This eliminated many confounding factors for the subsequent combined evaluation of indirect methods. Second, only harmonization of results requires multicenter studies in a given region or country. Thus, this singlecenter study will encourage laboratory researchers to attempt to establish RIs suitable for their own laboratories. Third, to mostly eliminate the interference of “diseased populations,” we selected subjects undergoing physical examination rather than outpatient or inpatient individuals. Furthermore, to ensure that the “reference population” excluded as much as possible individuals who were sick and considering that individuals who have repeated measurement data are more likely to have abnormalities or diseases, we included only the earliest visit records [
For CBC parameters, we found significant differences between the sexes and minor variation among age groups based on the data used for indirect methods. This phenomenon is essentially consistent with Takami et al [
When exploring the impact of different data transformation methods used with indirect techniques on calculating RIs, there were slight variances between log and BoxCox transformations for different types of data distribution. In most cases, the absolute skewness values for BoxCox transformation were lower than those for log transformation, which means that the effect of BoxCox transformation is better than that of log transformation when λ is not equal to 0, like in this study. For some skewed distribution data, BoxCox transformation might result in overcorrection. For example, raw WBC data in males showed a rightskewed distribution. After log transformation, the data could approximate a Gaussian distribution, although the skewness value still exceeded 0. In contrast to the effect of log transformation, the skewness value of the transformed data was less than 0 after BoxCox transformation, implying correction to a leftskewed distribution. Consequently, raw WBC data for males shifted rightward on the abscissa after BoxCox transformation, which explains the horizontal rightward shift of RIs calculated after BoxCox transformation. Data transformation may bring the pathological population closer to the healthy population, thereby making it more difficult to separate them. However, in this study, compared with the impact of outlier processing and indirect technology on the results, data transformation had a slight impact on the final results.
There have been many studies on the significant effects of the chosen outlier removal method on RIs [
When we explored the influence of different indirect techniques on RIs, we found that data transformation and outlier removal methods had different effects on the results of subsequent parametric, nonparametric, Hoffmann, and Bhattacharya methods. Outlier removal had a larger effect than data transformation on indirect techniques. There was a recent publication examining 8 different indirect methods for RI derivation [
This study has a few limitations. First, it is essential to consider the requirement of correct stratifications when comparing directly and indirectly estimated RIs. Both direct and indirect methods lead to erroneous RIs if stratification is performed for unknown variables. However, this study explored 2 important factors affecting RIs, namely sex and age, in order to eliminate the interference of these factors in the comparison of results as much as possible. Second, the 5 indirect methods used in this study were all based on unimodal data. Thus, the calculation was simply based on the distribution rule of the data itself, which inevitably causes contamination of the “reference population” with the “diseased population” to the fullest extent. The exclusion of “diseased populations” using statistical tools is obviously insufficient. Therefore, in the future, we will obtain as much multimodal data as possible during the datacleaning phase in order to construct an unsupervised classification model to ensure that the included “reference population” is healthy. Next, this study selected results derived using the direct method as a reference. However, the direct sampling of the population may not be truly representative and is always subject to methodological bias and variability. To ensure the comparability of the results of the indirect and direct methods, this study selected the same research period, the same detecting system, and the same sampling location to avoid the risk of bias as much as possible. In addition, the direct method was used as a measurement standard of the indirect method results, which was also applied by Ozarda et al [
In summary, this comparative study investigated indirect methods for establishing RIs, and the results provide a valuable scientific basis for method selection by laboratory clinicians. Compared to the results of the direct method, the selection of outlier removal methods and indirect techniques markedly affects the final RIs, whereas the effects of data transformation are negligible except for obviously skewed data. Specifically, the outlier removal efficiency of Tukey and iterative mean (3SD) methods is almost equivalent. Furthermore, the choice of indirect techniques depends more on the characteristics of the studied analyte itself. Use of the kosmic method to establish RIs of analytes with large intraindividual variations is not recommended. Furthermore, each laboratory should develop its own RIs under the applicable conditions. This study provides a new scientific basis for establishing RIs for laboratories at any level. In the future, we will explore more efficient indirect techniques based on multimodal data. Evaluation of the accuracy and applicability of RIs estimated using indirect methods is also needed, particularly in the absence of direct data as a reference.
Supplementary figures.
Supplementary tables.
bias ratio
complete blood count
Clinical and Laboratory Standards Institute
hematocrit
hemoglobin
International Council for Standardization in Hematology
lower limit
mean corpuscular hemoglobin
mean corpuscular hemoglobin concentration
mean corpuscular volume
platelet
red blood cell
reference interval
reference limit
upper limit
white blood cell
The authors thank the Reference Interval Working Group for the Chinese population for providing data used for the direct method. We also thank all the participants for their cooperation and sample contributions.
This research was supported by the National Key Technologies R&D Program of China provided by the Ministry of Science and Technology of the People’s Republic of China (2019YFC0840701) and the Chinese Academy of Medical Sciences (CAMS) Innovation Fund for Medical Sciences (2019I2M5027).
The data sets used and analyzed during the study are available from the corresponding author upon reasonable request.
All authors have accepted responsibility for the entire content of this manuscript and have approved its submission. MZ conceived, designed, supervised, and coordinated the study, and reviewed the manuscript. DY performed the statistical analysis, wrote and revised the manuscript; YD, RM, YL, and XZ revised and supervised the manuscript; ZS and LZ contributed in editing the manuscript. SW, XW, and HW coordinated subject recruitment and sample management.
None declared.