This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Genealogical information, such as that found in family trees, is imperative for biomedical research such as disease heritability and risk prediction. Researchers have used policyholder and their dependent information in medical claims data and emergency contacts in electronic health records (EHRs) to infer family relationships at a large scale. We have previously demonstrated that online obituaries can be a novel data source for building more complete and accurate family trees.
Aiming at supplementing EHR data with family relationships for biomedical research, we built an end-to-end information extraction system using a multitask-based artificial neural network model to construct genealogical knowledge graphs (GKGs) from online obituaries. GKGs are enriched family trees with detailed information including age, gender, death and birth dates, and residence.
Built on a predefined family relationship map consisting of 4 types of entities (eg, people’s name, residence, birth date, and death date) and 71 types of relationships, we curated a corpus containing 1700 online obituaries from the metropolitan area of Minneapolis and St Paul in Minnesota. We also adopted data augmentation technology to generate additional synthetic data to alleviate the issue of data scarcity for rare family relationships. A multitask-based artificial neural network model was then built to simultaneously detect names, extract relationships between them, and assign attributes (eg, birth dates and death dates, residence, age, and gender) to each individual. In the end, we assemble related GKGs into larger ones by identifying people appearing in multiple obituaries.
Our system achieved satisfying precision (94.79%), recall (91.45%), and F-1 measures (93.09%) on 10-fold cross-validation. We also constructed 12,407 GKGs, with the largest one made up of 4 generations and 30 people.
In this work, we discussed the meaning of GKGs for biomedical research, presented a new version of a corpus with a predefined family relationship map and augmented training data, and proposed a multitask deep neural system to construct and assemble GKGs. The results show our system can extract and demonstrate the potential of enriching EHR data for more genetic research. We share the source codes and system with the entire scientific community on GitHub without the corpus for privacy protection.
Anthropologists often use oral interviews, historical records, genetic analysis, and other means to obtain genealogical information and draw family trees. When combined with a detailed medical history and social and economic relationships, family trees are considered the x-ray of the family and have been used by clinicians to assess disease risk, suggest treatments, recommend changes in diet and other lifestyle habits, and determine a diagnosis. In the United States, the Medicare Access and CHIP Reauthorization Act of 2015 [
Early exploratory works have combined EHR data and family trees for biomedical research. For instance, Mayer et al [
Constructing high-quality large family trees has been challenging. Historically, only famous politicians, philosophers, scientists, religious groups, or royal families were tracked elaborately by genealogists. For such reason, large databases of family trees rarely existed, despite their research value. Recently, a few studies automated family tree collection using innovative informatics approaches. For instance, Mayer and colleagues [
Inspired by the work of Tourassi et al [
Traditionally, NER and RC were considered two separate tasks for information extraction. NER sought to extract named entities mentioned in unstructured text into predefined categories, whereas RC classified the relations between those extracted entity mentions. Researchers built natural language processing (NLP) pipelines with multiple modules to accomplish specific tasks. However, such modular separation suffered from 3 major issues leading to suboptimal results: (1) errors from the NER propagated to RC, (2) it was computationally redundant and time-consuming as the system had to pair up every two named entities to classify their relations, and (3) the pipeline model could not take full advantage of the knowledge inhabitant in the relationships of 2 or more named entities. For instance, if the system detected a
Thus, we look at multitask models that can simultaneously handle multiple related tasks and optimize their learning abilities by sharing the knowledge learned in all or some of the tasks [
In this work, we first updated our annotated corpus by defining a family relationship map to normalize various family relations (see details in Data section). We also used data augmentation technology to generate more synthetic data (sentences), in order to address the imbalanced training data issue and boost the performance on rare classes [
We collected 12,407 obituaries published from October 2008 to September 2018 from 3 funeral services websites and 1 local newspaper in the Twin Cities area, metropolitan Minneapolis–Saint Paul. Our data sources were limited to openly available obituaries. Considering the PII embedded in online obituaries, we decided to take a cautious and conservative position in our work by marking up the last name of any real people with the symbol XX (see more details on privacy protection in the Discussion section). After data cleaning, we randomly sampled 1700 obituaries for annotation. We developed the annotation guideline and trained 3 annotators to annotate each of the 1700 obituaries independently. The interannotator agreement measured by F-1 was 82.80% [
Summary statistics of the annotated corpus.a
Corpus | Count | Deceased person | Count | Special language patterns | Count |
Sentences | 28,317 | Full name | 1551 | Last name distributive | 4954 |
Names | 27,108 | Age | 1379 | Name with parentheses | 7504 |
Family relationship | 25,557 | Death date | 1557 | Spouse’s name | 5993 |
Residence | 7161 | Birth date | 1368 | Previous last name | 1511 |
Name-residence pair | 7954 | —b | — | — | — |
aAll counts are the number of occurrences except for the full name of the deceased. Considering all obituaries have structured metadata giving the full names of the deceased more precisely, we only annotate and extract the first-time mention of a full name of the deceased in an obituary. Spouse’s name and previous last name are 2 categories in the content name with parentheses.
bNot applicable.
Examples of unique language patterns in obituaries.
Language pattern | Example | Explanation | |
Last name distributive | He is survived by grandsons Addison and Owen XX. | XX is also the last name for Addison. | |
|
|||
|
Previous last name | Anne was born March 20, 1952, to William and Isabel (Starr) XX. | Starr is the maiden name for Isabel XX. |
|
Spouse’s name | Survived by her sons, Dale (Mary) and Bruce (Diana). | Dale’s wife is Mary, and Bruce’s wife is Diana. |
In this work, we made two improvements in the corpus annotations. First, we created a family relationship map that normalized various family relationship mentions to 71 family relationship groups. For example, there were many mentions of “born to (name),” “daughter of (name),” and “son of (name)” in obituaries, which were equivalent to express the parent of the deceased. We grouped them into the “parent” relation. Similarly, we treated “married to” the same way as the “spouse (of)” relation.
Family relationship map in the obituary corpus.
It was observed that some family relationships, such as granduncle, uncle-in-law, and half-sister had small numbers of cases that was not sufficient to train a high-performance neural network model. Therefore, we used data augmentation technology [
Where
Where
After deciding
Synonym replacement: randomly replace
Random insertion: randomly insert a word’s synonym before or after the chosen non–stop word in the sentence
Random swap: randomly swap 2 non–stop words in the sentence
Random deletion: randomly remove a non–stop word in the sentence
The 4 entity types of interest in this work, name, residence, birth date and death date, are exempt from the changes. It should also be noted that the generated sentences could not be guaranteed to be grammatically and semantically correct. However, for neural network models, such sentences, when created with appropriate
End-to-end extraction system to parse obituaries and generate genealogical knowledge graphs.
This module aimed to extract family members’ names, relationships, and additional attributes of people (residence, age, death date, birth date). Gender was usually not explicitly mentioned in the obituaries, so we inferred the gender in module 5. We adopted a customized tagging scheme (shown in
For each input token
Tagging scheme for simultaneously extracting entities and kinship. S: single; B: begin; I: inside; E: end.
After identifying the residence entities (eg, Rochester in
where
We identified 2 special language patterns in obituaries, last name distributive and names with parentheses, as shown in
where
Module 4 was a 3-class classifier to determine whether there was a parenthesis in a name, and if so, whether it referred to a previous last name or the name of spouse. The computing process was the same as module 3, which took the input of
This module aimed to infer age, death date, and birth date for the deceased and gender for both the deceased and their family members. First, if an obituary mentioned any 2 attributes out of age, birth date, and death date for the deceased, we calculated the third one. Second, we used both family relationship keyword and name to infer gender. If a family relationship keyword (eg, son, daughter, nephew) suggested gender, we would add the gender tag accordingly. Otherwise, when the family relationship keyword (eg, spouse and parent) did not tell the gender, we used an external human name knowledge base to match the most likely gender with names. For instance, “Tom” and “Emily” indicated male and female, separately.
After constructing the GKGs from each obituary by modules 1 to 5, we assembled the extracted GKGs into bigger ones by matching PII, including people’s names, residence, birth date, death date, and family relationship.
We minimized the negative log likelihood loss of the generated tags for the first 4 modules (module 5 is a rule-based inference layer that did not require training). For module
Where
In the end, we combined all four loss functions
We performed 10-fold cross-validation by randomly selecting 10% of the annotated data for validation and the remaining for training. It is worth noting that the augmented data were only used for training models. Extracted GKGs consists of outputs from modules 1 to 5. They were measured by averaged performance of all modules except module 5 due to this rule-based inference module lacking a gold standard. For modules 1 to 4, we used precision, recall, and F-1 measure for evaluation, which were computed as follows:
In module 1, the outputs were entity mentions with extra entity and relation types. We defined an extracted mention as true positive instances only if the mention’s boundary, entity type, and relation tags were exactly matched with the gold annotation. The instances of false positive were predicted mentions that do not precisely match with gold annotation boundaries, entity, or relation types. False negative instances were those existing in the gold annotation but not recognized by the model.
In module 2, true positive instances were defined as pairs of name and location that matched exactly. If either name or location was wrong, the pair would be considered a false positive. False negative referred to the name-location pairs missed by our system.
Module 3 and module 4 were formulated as generic classification tasks, so we used common definitions of false negative, false positive, and true positive. For all modules, evaluation metrics were precision, recall, and F-1 measure.
As shown in
We also observed the benefits of multitask models through ablation experiments. Extra information gained from modules 3 and 4 seemed to improve module 1 in both macroaveraged precision, recall, and F-1 measure (2.17%, 3.12%, 2.64%, respectively) and microaveraged precision, recall, and F-1 measure (1.27%, 1.12%, and 1.19%, respectively). Modules 1 and 3 improved the performance of module 4 by 2.76%, 1.32%, 2.08% for macroaveraged precision, recall, and F-1 measure, respectively, and 2.51%, 1.7%, 2.13% for microaveraged precision, recall, and F-1 measure, respectively. Similarly, modules 1 and 4 helped to improve the macro/micro precision, recall, and F-1 of module 3 by 2.74%, 1.00%, 1.88%, respectively. And modules 1, 3, and 4 improved module 2 by 1.10%, 5.17%, and 3.49% in macro/micro averaged precision, recall, and F-1 measure.
It should be noticed that module 2 seemed not helpful in improving the overall performance of each module. For module 1, the macroaveraged and microaveraged F-1 measure dropped by 1.41% (compare the first and third row of the macroaverage section of
Model performance of each module with ablation experiments.
Module and ablation test | Macroaveraged performance | Microaveraged performance | |||||
|
Pa (%) | Rb (%) | F1c (%) | P (%) | R (%) | F1(%) | |
|
|||||||
|
Baseline | 81.68 | 79.93 | 80.80 | 94.15 | 92.40 | 93.27 |
|
Joint training (module 2, 3, & 4) + negative transfer | 82.07 | 81.99 | 82.03 | 94.08 | 92.79 | 93.43 |
|
Joint training (module 3 & 4) | 83.85 | 83.05 | 83.44 | 95.42 | 93.52 | 94.46 |
|
|||||||
|
Baseline | 83.17 | 68.43 | 75.08 | —d | — | — |
|
Joint training (module 1, 3, & 4) | 84.27 | 73.60 | 78.57 | — | — | — |
|
|||||||
|
Baseline | 89.64 | 92.01 | 90.81 | — | — | — |
|
Joint training (module 1, 2, & 4) + negative transfer | 91.48 | 91.12 | 91.30 | — | — | — |
|
Joint training (module 1 & 4) | 92.38 | 93.01 | 92.69 | — | — | — |
|
|||||||
|
Baseline | 90.65 | 94.74 | 92.64 | 90.96 | 95.21 | 93.03 |
|
Joint training (module 1, 2, & 3) + negative transfer | 92.34 | 95.76 | 94.02 | 92.37 | 96.31 | 94.30 |
|
Joint training (module 1 & 3) | 93.41 | 96.06 | 94.72 | 93.47 | 96.91 | 95.16 |
aP: precision.
bR: recall.
cF1: F-1 measure.
dThe microaveraged and macroaveraged performances are the same for module 2 and module 3 because they are both binary classification tasks. All results shown are from the curated corpus without data augmentation.
We also adopted data augmentation technology to expand our corpus, aiming to improve the relation extraction performance for family relations (module 1) with too few training examples. By synonym replacement, random insertion, random swap, and random deletion, we augmented the training data to ensure every relation had no less than 200 training examples. However, the automated data augmentation method introduced new noise. We tested a different augmentation ratio (
After extracting GKGs from all obituaries, we assembled them into bigger ones by matching available PII, including name, gender, age, residence, and birth date. Considering obituaries usually provide detailed PII for the deceased but not for their family members and relatives, we did fuzzy matching for the relatives. That is, if the mentioning of 2 people in 2 different obituaries are likely to refer to the same person based on 1 or more shared piece of PII, we would assemble 2 GKGs into 1. In the end, we had 319 GKGs assembled into 149 bigger GKGs after processing all 12,407 downloaded obituaries. Among those 319 obituaries, 22.3% (71/319) had 1 shared PII item, 8.5% (27/319) had 2, and 69.3% (221/319) had more than 2. We manually evaluated those 149 assembled GKGs and confirmed that 71.8% (107/149) were correct, 12.1% (18/149) were wrong, and 16.1% (24/149) were uncertain. We acknowledge that this rule-based matching method is limitedly useful for the selected geographic location of the Twin Cities area in Minnesota. It might be more error prone to apply to the entire country or other densely populated areas with high population mobility. So we did not include the assembly function in the end-to-end system but kept it as an additional resource for cautious users.
Comparing the F-1 measures of raw corpus and augmented corpus.
An example of an assembled genealogical knowledge graph. We removed last names for privacy protection. The symbol ? means we are not sure which children nodes belong to which parent nodes.
The gold standard family tree constructed from manual curation corresponding to Figure 5.
In this work, we proposed an end-to-end system to construct GKGs from online obituaries, aiming at supplementing EHR data for genetic research. This system achieves microaveraged precision of 94.79%, recall of 91.45%, and F-1 measure of 93.09% after data augmentation technology. The work exploits the large availability of obituaries on the internet, which are consistent with the vital records and census records and more reliable and comprehensive than dependent information from medical insurance and emergency contact in EHR systems [
In this work, we use publicly available obituaries. The Association of Internet Researchers, in partnership with their Ethical Working Committee, formulated general principals to guide online research [
As a novel data source, obituaries are informative for constructing family trees. It is hard to obtain such rich genealogical information from other data sources, but there are caveats to their use as genealogical data. First, semantic ambiguity occurs in obituaries as it occurs in many other types of human writing. For example, it is not uncommon to see statements like “...survived by two sons, Marshal and Paul XX and daughter Daisy, and four grandchildren Denny, Gary, Cecil, and Alina.” In this case, it is impossible to tell the exact parents for each of the 4 grandchildren Denny, Gary, Cecil, and Alina. All we know is that their parents are Marshal XX, Paul XX, and Daisy. Additional data sources like birth certificate registries can be helpful in this case.
A second point worth discussing is the slippery slope of genealogy. Compared with medical insurance and emergency contact information [
In addition,
Left: distribution of average numbers of mentioned family members. Right: age and marital status of the deceased person in 12,407 extracted genealogical knowledge graphs.
Technically, the data used in the research are very imbalanced, in which 14 rare relationships have fewer than 10 instances. We adopted the augmentation technology to enhance system performance. For example, in the relationships half-sister, grandchild-in-law, and grandson-in-law, their F-1 measures increased from 20.0%, 30.0%, and 35.71% to 66.67%, 50.0%, and 71.43%, respectively. Next step, we plan to experiment with additional few-shot (extremely imbalanced)–based information extraction and mate learning to improve the system [
In our end-to-end solution, the performance of module 2 was obviously inferior to the other modules. Besides the error propagation problem (module 2 need the results from module 1), the task of module 2 was a semantic matching resolution problem, which is still challenging in the NLP community. In addition, we currently have curated an obituary corpus in English to train the neural network models. To expand to other languages, a new corpus in those specific languages and new gender inference rules would need to be curated. There is some cross-language transfer research in the NLP community which suggests neural models trained on an English corpus can help to build NLP models in other languages by reducing training data and training time. Sometimes such transfers even provide more robust models with better performance [
In our end-to-end solution, module 2 currently is the bottleneck. This module suffered significantly from negative transfer. Generally speaking, when a task or domain was joined with data of no relatedness or similarity, the added data would become noise rather than useful information. It remains challenging to quantitatively measure the relatedness or similarity among different tasks or domains [
Besides the performance benefits shown in the Result section, the multitask solution is also faster to train. We use a single V100 GPU in this study. For the traditional pipeline model, one round 10-fold cross-validation experiment costs about 240 hours in total. However, the multitask model with all 4 modules together takes only 150 hours. For module 1, the training process took about 70 epochs to achieve an F-1 measure of 80% when being trained independently. The multitask method takes less than 5 epochs to achieve the same level of F-1 measure.
The first limitation of our work is the existing potential data bias. Our data are collected from online obituary websites. In such conditions, people who had intact and/or affluent families tended to publish obituaries. The second limitation is that our system is mainly for English obituaries. Modules 2 and 3 are designed for 2 English writing patterns.
GKGs have great potential to enhance many medical research fields, especially combined with EHR data. We believe a high-quality, large-scale genealogical information database will have significant research meaning. In this work, we presented a new corpus with a predefined family relationship map and augmented training data and proposed a multitask deep neural system to construct and assemble GKGs. With the data augmentation technology, the system achieved microaveraged precision, recall, and F-1 measure of 94.79%, 91.45%, and 93.09%, respectively, and macroaveraged precision, recall, and F-1 measure of 92.59%, 90.05%, 91.30%, respectively. Based on such promising results, we developed PII-matching rules to assemble large GKGs, demonstrating the potential of linking GKGs to EHRs. The system is capable of generating a large number of GKGs to support related research, like genetic research, linkage analysis, and disease risk prediction. We share the source codes and system with the entire scientific community on GitHub, without the corpus for privacy protection [
In the future, we will improve the performance of our system further and match GKGs with more medical information, like EHR databases. With the massive obituary data freely available on the internet or other textual data that contain genealogical information, our ultimate goal is to accelerate large-scale disease heritability research and clinical genetics research.
electronic health records
genealogical knowledge graph
long short-term memory
name entity recognition
natural language processing
personally identifiable information
relation classification
This work has been supported by grant 2018YFC0910404 from the National Key Research and Development Program of China, grant 61772409 from the National Natural Science Foundation of China; grant 61721002 from the Innovative Research Group of the National Natural Science Foundation of China, and grant IRT_17R86 from the Innovation Research Team of the Ministry of Education, Project of China Knowledge Centre for Engineering Science and Technology.
None declared.