This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Quality patient care requires comprehensive health care data from a broad set of sources. However, missing data in medical records and matching field selection are 2 real-world challenges in patient-record linkage.
In this study, we aimed to evaluate the extent to which incorporating the missing at random (MAR)–assumption in the Fellegi-Sunter model and using data-driven selected fields improve patient-matching accuracy using real-world use cases.
We adapted the Fellegi-Sunter model to accommodate missing data using the MAR assumption and compared the adaptation to the common strategy of treating missing values as disagreement with matching fields specified by experts or selected by data-driven methods. We used 4 use cases, each containing a random sample of record pairs with match statuses ascertained by manual reviews. Use cases included health information exchange (HIE) record deduplication, linkage of public health registry records to HIE, linkage of Social Security Death Master File records to HIE, and deduplication of newborn screening records, which represent real-world clinical and public health scenarios. Matching performance was evaluated using the sensitivity, specificity, positive predictive value, negative predictive value, and F1-score.
Incorporating the MAR assumption in the Fellegi-Sunter model maintained or improved F1-scores, regardless of whether matching fields were expert-specified or selected by data-driven methods. Combining the MAR assumption and data-driven fields optimized the F1-scores in the 4 use cases.
MAR is a reasonable assumption in real-world record linkage applications: it maintains or improves F1-scores regardless of whether matching fields are expert-specified or data-driven. Data-driven selection of fields coupled with MAR achieves the best overall performance, which can be especially useful in privacy-preserving record linkage.
Quality patient care requires comprehensive health care data from a broad set of sources. Electronic medical record (EMR) data are increasingly distributed across many sources as the era of digital health care is accelerated in the United States. However, EMR data from independent databases often lack a common patient identifier, which impedes data aggregation, causes inefficiencies (eg, tests repeated unnecessarily), affects patient care, and hinders research. Record linkage is a requisite step for effective and efficient patient care and research. Without a unique universal patient identifier, linkage of patient records is a nontrivial task. The simplest class of approaches is the deterministic method, which requires the strict identity of the selected data elements of a pair of records, such as name, birthdate, gender, and Social Security number. Although deterministic algorithms are generally simple to implement and achieve excellent specificity, they have low sensitivity, are not robust to missing data, cannot quantify the uncertainty of the matching process, and are inflexible to changing data characteristics.
The Fellegi-Sunter (FS) [
First, it is well known that missing data are prevalent in real-world data in EMRs [
Second, although there may be numerous fields (or attributes) across record files not all of them are useful for matching. For example, if matching 2 obstetrics and gynecology databases, the field “gender” is not informative. In real-world data, there are likely also dependencies among the data fields. As we have demonstrated [
We will evaluate the effects of incorporating missing data treatment and matching field selection into the FS algorithm on linkage performance using 4 real-world use cases in our local operational data aggregation system—a health information exchange (HIE) environment, into which different data sources are integrated. The 4 use cases included health information exchange record deduplication (labeled as Indiana Network for Patient Care [INPC]), linkage of a public health registry Marion County Health Department records to HIE (labeled as MCHD), linkage of Social Security Death Master File records of the Social Security Administration to HIE (labeled as SSA), and deduplication of newborn screening records (labeled as NBS). We hypothesize that proper treatment of missing data and data-driven matching field selection will enhance linkage performance.
Records need to be compared in record linkage to ascertain whether they belong to the same entity. Forming record pairs by Cartesian product from the 2 files (or to a file itself in the case of deduplication) results in an enormously large number of pairs. For example, the data set from the INPC (the INPC use case) has 47,334,986 records (
Summary of four use cases, Indiana Network for Patient Care (INPC), newborn screening (NBS), Social Security Administration (SSA), and Marion County Health Department (MCHD), with information on the number of records in each use case, blocking schemes, and the numbers of record pairs in blocking schemes.
Block | Pairs | |
|
||
|
SSNa | 53,054,690 |
|
FN-TELb | 41,729,402 |
|
DB-MB-YB-ZIPc | 133,553,036 |
|
FN-LN-YBd | 193,865,283 |
|
DB-LN-MB-YBe | 191,181,498 |
|
||
|
MRNf | 4,147,098 |
|
TELg | 2,644,454 |
|
MB-DB-ZIP | 8,083,396 |
|
LN-FNh | 3,005,368 |
|
NK_LN-NK_FNi | 1,217,736 |
|
||
|
SSN | 805,331 |
|
FN-LN-ZIP | 18,103 |
|
FN-LN-MI-YB | 1,395,395 |
|
FN-LN-MI-DB-MB | 547,376 |
|
FN-LN-DB-MB-YB | 722,167 |
|
||
|
SSN | 869,454 |
|
TEL | 28,238 |
|
DB-MB-YB-zip | 5,083,429 |
|
FN-LN-YB | 3,378,017 |
|
DB-LN-MB-YB | 3,701,460 |
aSSN: Social Security number.
bFN-TEL: first name and telephone number.
cDB-MB-YB-ZIP: day, month, and year of birth and zip code.
dFN-LN-YB: first name, last name, and year of birth.
eDB-LN-MB-YB: day, month, and year of birth and last name.
fMRN: medical record number.
gTEL: telephone no.
hLN-FN: last name, first name.
iNK_LN, NK_FN: next of kin last name and first name.
Formally, for the
A popular algorithm, named after Fellegi and Sunter [
Match scores are defined as the logarithm of likelihood ratios,
Formally describing the missing data mechanism is important for devising an approach to account for missing data. Missing data are generally classified into 3 types [
In record linkage applications, missing values in matching fields are typically handled by excluding records with missing values on one of the matching fields when estimating match weights [
The predictive results are obtained the same way for the FS model with MAR and MAD. The difference lies in the manner in which missing data are treated. When MAD is used, fields with missing data are set to “disagreement” (coded as 0), and the FS algorithm as is can proceed on the data with missing values replaced by zeros. When MAR is used, the FS algorithm is used on nonmissing data. In either cases, parameters
Fields missing 100% within a blocking scheme contain no information and will not be considered further. We examined 2 approaches selecting matching fields: the standard practice of subject matter expert-guided field selection and a data-driven approach. In the data-driven approach, all fields were considered to be putative matching fields. A necessary condition for a field to be useful in matching is that it should exhibit variability. For example, if the value of a field is fixed (no variation), it cannot separate matches from nonmatches. Thus, a blocking variable can no longer be used as a matching field in a block formed using the blocking variable. When running an FS model, we started with the largest possible set of fields; more fields may be dropped from the model, starting with fields with the least variations, until the FS algorithm converges.
We evaluated the matching performance of the missing data treatment (MAD and MAR) and matching field selection (expert-specified fields vs data-driven fields) by conducting a 2-by-2 factorial design using 4 real-world use cases in our local HIE environment. The 4 use cases contain data that were generated as part of clinical or public health processes.
The 4 use cases included deduplicating clinical records in a state-level HIE, linking a public health registration file to clinical data in the HIE, linking death records to clinical data in the HIE, and deduplication of the Health Level Seven International (HL7) messages for newborns less than 1 month of age from the HIE. For each use case, blocking was performed to confine the total number of record pairs to be compared with a subspace of record pairs enriched with true matches [
Manual review results for the 4 use cases.
Use case | Number of pairsa | Number of pairs deemed as matches | Number of pairs deemed as nonmatches | Match prevalenceb |
INPCc | 15,000 | 7840 | 7160 | 0.523 |
SSAd | 16,500 | 5950 | 10,550 | 0.361 |
NBSe | 15,000 | 7967 | 7033 | 0.531 |
MCHDf | 15,500 | 5927 | 9573 | 0.382 |
aNumber of pairs is the total number of pairs sampled for manual review, which determines the pairs as either matches or nonmatches.
bMatch prevalence is the ratio of the number of pairs deemed as matches and the total number of pairs for manual review for each use case.
cINPC: Indiana Network for Patient Care.
dSSA: Social Security Administration.
eNBS: newborn screening.
fMCHD: Marion County Health Department.
This data set reflected demographic records from geographically proximal hospital systems that participate in HIE. Blocking is as described earlier. The data contained a subset of 15,000 sampled gold standard pairs with 7840 (52.3%) true positives and 7160 (47.7%) true negatives. Patients from hospitals in close proximity cross over to nearby institutions, creating the need to identify common records. New value-based purchasing models such as Accountable Care Organizations dramatically increased the need to identify and capture information on patients seeking care from other institutions.
These data reflect a combination of the Social Security Death Master File and HIE data. We applied five blocking schemes (
This data set included demographic data for newborns derived from multiple hospitals, clinics, and within the HIE. These data were limited to patients aged <2 months. We applied five blocking schemes (
This data set comes from the MCHD, Indiana’s largest public health department. The registry contains a master list of demographic information for clients who receive public health services such as immunization; Women, Infants, and Children’s nutrition support; and laboratory testing [
The 4 data sets contained subsets of the following fields: MRN, SSN, last name (LN), first name (FN), middle initial (MI), nickname (NICK_SET), ethnicity (ETH_IMP), sex, month of birth (MB), day of birth (DB), YB, street address (ADR), city, state (ST), zip code (ZIP), telephone number (TEL), email, last name of next of kin (NK_LN), first name of next of kin (NK_FN), last name of treating physician (DR_FN), and first name of treating physician (DR_LN). The last 4 fields were used only in the NBS use case.
For each use case, blocking was performed first, and five blocks of record pairs were generated. The blocking schemes are listed in
Within each run of the FS model, the estimate of block-specific prevalence under each missing treatment was used to classify record pairs as matches and nonmatches (see Classification of Record Pairs); the union of matches from all 5 blocks is the set of matches obtained.
To evaluate the accuracy of these matching models, we calculated the following metrics: sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and
This study was reviewed and approved by the Indiana University Institutional Review Board (IRB#: 1703755361).
The 4 use cases contained missing data to various extents. Notably, 45.8% of the 47,334,986 records in the INPC use case had no SSN, making it necessary to add other blocking schemes that do not rely on the SSN. For the NBS use case, the SSN is typically missing because infants do not receive an SSN for at least 2 to 6 weeks after birth and often later if parents do not initially request the identifier. When linking the other 2 use cases SSA and MCHD to INPC, due to the INPC data set missing SSN in 45.8% of its records, blocking on SSN alone yielded only 4547 out of 5950 (76%) and 1531 out of 5927 (26%) of true matches in SSA and MCHD, respectively, based on the manually reviewed subsets (
Additional blocking schemes are essential to increase match sensitivity. As the FS algorithm is performed using paired data per blocking scheme, its performance is directly affected by the extent of missing values in the agreement vectors obtained by comparing pairs of records within each block. We summarized the proportions of missing data in the 5 blocking schemes of each use case in
Matching fields with even substantial missing values nonetheless proved to be useful in discriminating matches from nonmatches. For example, the agreement status of email address comparison is missing for 99% of record pairs in the DB-LN-MB-YB blocking scheme of the INPC use case; the m- and u-probabilities were estimated to be 0.01147 and 0.000204 under MAD and 0.3830 and 0.02553 under MAR, respectively. The large ratios of the m-probability over the u-probability in either case indicate the utility of email address in linkage. As another example, the agreement status of zip code comparison is also missing for 99% of record pairs in the FN-LN-MI-YB block of the SSA use case, and the m- and u-probabilities were estimated to be 0.02073 and 8.49×10−7 under MAD and 0.7538 and 0.000137 under MAR, respectively. In both examples, the estimates of m- and u-probabilities are much larger under MAR than under MAD, suggesting that a downward bias might be incurred by artificially setting missing values to disagreement in the MAD approach.
The fields used by the final FS models, either expert-specified or data-driven per block per use case, are summarized in
Summary of modeling information by data use case and by blocking scheme.
Data and block | Expert-specified fieldsa | Data-driven fieldsa | |
|
|||
|
DB-LN-MB-YBb | MRNc FNd SEXe TELf ADRg ZIPh SSNi | MRN FN SEX TEL ADR ZIP SSN |
|
DB-MB-YB-ZIP | MRN LN FN SEX TEL ADR SSN | MRN LN FN SEX TEL ADR SSN |
|
FN-LN-YB | MRN SEX DB MB TEL ADR ZIP SSN | MRN SEX DB MB TEL ADR ZIP SSN |
|
FN-TEL | MRN LN SEX DB MB YB ADR ZIP SSN | MRN LN SEX DB MB YB ADR ZIP SSN |
|
SSN | MRN LN FN SEX DB MB YB TEL ADR ZIP | MRN LN FN SEX DB MB YB TEL ADR ZIP |
|
|||
|
FN-LN-DB-MB-YB | SSN MI ZIP | SSN MI ZIP |
|
FN-LN-MI-DB-MB | ZIP YB SSN | ZIP YB SSN |
|
FN-LN-MI-YB | DB MB ZIP SSN | DB MB ZIP SSN |
|
FN-LN-ZIP | MI DB MB YB SSN | MI DB MB YB SSN |
|
SSN | LN FN MI DB MB YB ZIP | LN FN MI DB MB YB ZIP |
|
|||
|
LN-FN | MRN SEX DB MB YB TEL ADR ZIP | MRN SEXm DB MB YBm TEL ADR ZIP |
|
MB-DB-ZIP | MRN LN FN SEX YB TEL ADR | MRN LN FN SEX YB TEL ADR |
|
MRN | LN FN SEX DB MB YB TEL ADR ZIP | LN FN SEXm DB MB YB TELm ADR ZIP |
|
NK_LN-NK_FN | MRN LN FN SEX DB MB YB TEL ADR ZIP | MRN LNm FN SEX DB MB YB TEL ADR ZIP |
|
TEL | MRN LN FN SEX DB MB YB ADR ZIP | MRN LN FN SEX DB MB YBm ADR ZIP |
|
|||
|
LN-FN | MRN SEX DB MB YB TEL ADR ZIP | MRN SEXm DB MB YBm TEL ADR ZIP |
|
MB-DB-ZIP | MRN LN FN SEX YB TEL ADR | MRN LN FN SEX YB TEL ADR |
|
MRN | LN FN SEX DB MB YB TEL ADR ZIP | LN FN SEXm DB MB YB TELm ADR ZIP |
|
NK_LN-NK_FN | MRN LN FN SEX DB MB YB TEL ADR ZIP | MRN LNm FN SEX DB MB YB TEL ADR ZIP |
|
TEL | MRN LN FN SEX DB MB YB ADR ZIP | MRN LN FN SEX DB MB YBm ADR ZIP |
aColumns “Expert-specified fields” and “Data-driven fields” display the fields used in the Fellegi-Sunter (FS) model.
bDB-LN-MB-YB: day, month, and year of birth and last name.
cMRN: medical record number.
dFN: first name.
eSEX: sex.
fTEL: telephone number.
gADR: address.
hZIP: zip code.
iSSN: Social Security number.
jFields (italicized) selected only by data-driven methods.
kSSA: Social Security Administration.
lNBS: newborn screening.
mFields not selected by the data-driven method but specified by experts.
nMCHD: Marion County Health Department.
The matching metrics of the 4 use cases evaluated on their respective ground truth sets of randomly selected and manually reviewed record pairs are displayed in
MAR improves the
MAD using expert-specified fields had higher
MAR coupled with data-driven fields yielded
In the SSA use case, the
The algorithms performed differently, partly because of the different data quality of the use cases. The
Matching results of the four use cases evaluated on their respective ground truth sets of random-selected and manually reviewed record pairs.
Data | Value, N | Sensitivity (95% CI) | Specificity (95% CI) | Positive predictive value (95% CI) | Negative predictive value (95% CI) | |||||||||
|
||||||||||||||
|
|
|||||||||||||
|
|
MADb | 15,000 | 0.962 (0.958-0.967) | 0.990 (0.987-0.992) | 0.990 (0.988-0.992) | 0.960 (0.955-0.964) | 0.976 (0.974-0.978) | ||||||
|
|
MARc | 15,000 | 0.970 (0.966-0.974) | 0.988 (0.986-0.991) | 0.989 (0.987-0.991) | 0.968 (0.964-0.972) | 0.980 (0.977-0.982) | ||||||
|
|
|||||||||||||
|
|
MAD | 16,500 | 0.781 (0.770-0.792) | 0.995 (0.994-0.996) | 0.989 (0.986-0.992) | 0.890 (0.884-0.895) | 0.873 (0.866-0.879) | ||||||
|
|
MAR | 16,500 | 0.785 (0.775-0.796) | 0.995 (0.993-0.996) | 0.989 (0.985-0.991) | 0.892 (0.886-0.897) | 0.875 (0.869-0.882) | ||||||
|
|
|||||||||||||
|
|
MAD | 15,000 | 0.795 (0.786-0.804) | 0.881 (0.874-0.889) | 0.883 (0.876-0.891) | 0.791 (0.782-0.801) | 0.837 (0.830-0.843) | ||||||
|
|
MAR | 15,000 | 0.860 (0.852-0.868) | 0.873 (0.865-0.881) | 0.885 (0.877-0.892) | 0.846 (0.838-0.855) | 0.872 (0.866-0.878) | ||||||
|
|
|||||||||||||
|
|
MAD | 15,500 | 0.944 (0.937-0.949) | 0.989 (0.987-0.991) | 0.982 (0.979-0.986) | 0.966 (0.962-0.969) | 0.963 (0.959-0.966) | ||||||
|
|
MAR | 15,500 | 0.946 (0.940-0.952) | 0.988 (0.986-0.990) | 0.980 (0.976-0.983) | 0.967 (0.964-0.971) | 0.963 (0.959-0.966) | ||||||
|
||||||||||||||
|
|
|||||||||||||
|
|
MAD | 15,000 | 0.579 (0.568-0.590) | 0.988 (0.986-0.991) | 0.982 (0.978-0.985) | 0.682 (0.672-0.690) | 0.729 (0.719-0.737) | ||||||
|
|
MAR | 15,000 | 0.970 (0.966-0.974) | 0.987 (0.984-0.989) | 0.988 (0.985-0.990) | 0.968 (0.964-0.972) | 0.979 (0.976-0.981) | ||||||
|
|
|||||||||||||
|
|
MAD | 16,500 | 0.781 (0.770-0.792) | 0.995 (0.994-0.996) | 0.989 (0.986-0.992) | 0.890 (0.884-0.895) | 0.873 (0.866-0.879) | ||||||
|
|
MAR | 16,500 | 0.785 (0.775-0.796) | 0.995 (0.993-0.996) | 0.989 (0.985-0.991) | 0.892 (0.886-0.897) | 0.875 (0.869-0.882) | ||||||
|
|
|||||||||||||
|
|
MAD | 15,000 | 0.813 (0.805-0.822) | 0.875 (0.867-0.883) | 0.880 (0.873-0.888) | 0.805 (0.796-0.814) | 0.845 (0.839-0.852) | ||||||
|
|
MAR | 15,000 | 0.865 (0.858-0.873) | 0.870 (0.863-0.878) | 0.883 (0.876-0.890) | 0.851 (0.842-0.859) | 0.874 (0.868-0.880) | ||||||
|
|
|||||||||||||
|
|
MAD | 15,500 | 0.635 (0.622-0.648) | 0.970 (0.967-0.974) | 0.929 (0.921-0.937) | 0.811 (0.804-0.818) | 0.754 (0.745-0.764) | ||||||
|
|
MAR | 15,500 | 0.954 (0.948-0.959) | 0.988 (0.985-0.990) | 0.979 (0.976-0.983) | 0.972 (0.968-0.975) | 0.967 (0.963-0.970) |
aINPC: Indiana Network for Patient Care.
bMAD: missing as disagreement.
cMAR: missing at random.
dSSA: Social Security Administration.
eNBS: newborn screening.
fMCHD: Marion County Health Department.
Cross-tabulation of ground truth and classification results by the Fellegi-Sunter model under missing as disagreement (MAD) and missing at random (MAR) for the Social Security Administration use case.
MAD | MAR | Values, N | ||
|
Nonmatch | Match |
|
|
|
||||
|
Nonmatch | 1277 | 26 | 1303 |
|
Match | 0 | 4647 | 4647 |
|
Value, N | 1277 | 4673 | 5950 |
|
||||
|
Nonmatch | 10,495 | 3 | 10,498 |
|
Match | 1 | 51 | 52 |
|
Value, N | 10,496 | 54 | 10,550 |
The US health care system will likely not have a unique and universal patient ID in the near future, so innovations such as incorporating missing data under MAR and data-driven field selection in the linkage algorithms are necessary to optimize existing methods to ensure accurate patient identity and support patient safety. Our findings are important because they demonstrate improvements in linkage performance among 4 different but representative use cases. Our HIE-based patient-matching laboratory has experience matching clinical data from heterogeneous sources, including hospitals (inpatient and emergency departments) [
Although the assumption of missing at random is not verifiable, the success of the FS algorithm coupled with MAR in our four different use cases indicates that missing at random is a reasonable assumption. As MCAR is a special case of MAR, our algorithm works when data are MCAR. These results will inform future research and development in patient-matching spaces.
Furthermore, the superior performance observed with MAR using data-driven fields over other combinations in the 2×2 design and four use cases suggests its potential value for incorporation into privacy-preserving record linkage (PPRL) methods. In PPRL, to preserve privacy, fields can be tokenized (eg, using bigrams) into smaller parts and compared [
Finally, many data-driven fields may lead to model overfitting, which is a prominent cause of the poor performance of machine learning algorithms. In many applications in medical research using latent class models, many covariates are available, and the number of covariates overwhelms the number of observations. This is the main motivation for most of the variable selection literature to identify a subset of variables to (1) estimate the association between the covariates and the response variable and (2) obtain a parsimonious model that describes the covariates and the response variable [
While we strive to generate results that are applicable to the broadest possible audience using a health informatics research laboratory that captures a diverse set of data elements with varying data characteristics, we cannot assure generalizability with complete certainty. If our data are not representative of other health systems, then our linkage results may not be applicable. If the missing data mechanism is not MAR or MCAR (eg, if the missingness of a data element is related to its value), our algorithm will likely not work. Before applying our methods to a data environment with missing data, we recommend creating a ground truth set of randomly selected record pairs whose match status is manually reviewed to determine whether our methods are applicable to a specific data environment.
Finally, our results suggest that accommodating missingness in patient-matching algorithms can improve accuracy. While the FS model is widely used, different FS implementations and completely different models (eg, decision trees or boosting algorithms) may exhibit a greater or lesser effect. We will explore the potential of these machine learning tools in our future work.
In summary, the combination of data-driven matching field selection and MAR methods produced the best overall performance for four real-world matching use cases. The MAR method maintained or improved
Table S1 Proportion of missing values by matching field in the Indiana Network for Patient Care (INPC) use case. For each blocking scheme (column) the unshaded fields are used for matching in the final Fellegi-Sunter (FS) model for that block in the data-driven approach.
Table S2 Proportion of missing values by field in the Social Security Administration (SSA) use case. For each blocking scheme (column) the unshaded fields are used for matching in the final Fellegi-Sunter (FS) model for that block in the data-driven approach.
Table S3 Proportion of missing values by field in the newborn screening (NBS) use case. For each blocking scheme (column) the unshaded fields are used for matching in the final Fellegi-Sunter (FS) model for that block in the data-driven approach.
Table S4 Proportion of missing values by field in the Marion County Health Department (MCHD) use case. For each blocking scheme (column) the unshaded fields are used for matching in the final Fellegi-Sunter (FS) model for that block in the data-driven approach.
Table S5 Matching results of the SSA use case evaluated on a set of 16,500 randomly selected and manually reviewed record pairs. The first two rows are the overall results combined from all blocks on the manually reviewed sample, with the first row for MAD (missing as disagreement) and the second row for MAR (missing at random). Every subsequent two rows pertain to a specific block, with the first containing the results of MAD and the 2nd row the results of MAR. Columns N, SEN, SPE, PPV, NPV and F1 are the total number of manually reviewed record pairs, sensitivity, specificity, positive predictive value, negative predictive value and F-score.
Table S6 Matching results of the newborn screening (NBS) use case evaluated on a set of 15,000 randomly selected and manually reviewed record pairs. The first two rows are the overall results combined from all blocks on the manually reviewed sample, with the first row for MAD (missing as disagreement) and the second row for MAR (missing at random). Every subsequent two rows pertain to a specific block, with the first containing the results of MAD and the 2nd row the results of MAR. Columns N, SEN, SPE, PPV, NPV, and F1 are the total number of manually reviewed record pairs, sensitivity, specificity, positive predictive value, negative predictive value and F-score.
Table S7 Matching results of the Marion County Health Department (MCHD) use case evaluated on a set of 15,500 randomly selected and manually reviewed record pairs. The first two rows are the overall results combined from all blocks on the manually reviewed sample, with the first row for MAD (missing as disagreement) and the second row for MAR (missing at random). Every subsequent two rows pertain to a specific block, with the first containing the results of MAD and the 2nd row the results of MAR. Columns N, SEN, SPE, PPV, NPV, and F1 are the total number of manually reviewed record pairs, sensitivity, specificity, positive predictive value, negative predictive value and F-score.
Table S8 Data quality of fields of last name and first name in the DOB-ZIP block of the Indiana Network for Patient Care (INPC) and newborn screening (NBS) use cases.
city
day of birth
Expectation-Maximization
electronic medical record
ethnicity
first name
Fellegi-Sunter
health information exchange
Health Level Seven International
Indiana Network for Patient Care
last name
last name and first name
missing at random
month of birth
missing completely at random
Marion County Health Department
middle initial
missing not at random
medical record number
newborn screening
negative predictive value
privacy-preserving record linkage
positive predictive value
Social Security Administration
Social Security number
year of birth
This research was supported by grants (Agency for Healthcare Research and Quality) and (The Patient-Centered Outcomes Research Institute).
None declared.