This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Recently, food science has been garnering a lot of attention. There are many open research questions on food interactions, as one of the main environmental factors, with other health-related entities such as diseases, treatments, and drugs. In the last 2 decades, a large amount of work has been done in natural language processing and machine learning to enable biomedical information extraction. However, machine learning in food science domains remains inadequately resourced, which brings to attention the problem of developing methods for food information extraction. There are only few food semantic resources and few rule-based methods for food information extraction, which often depend on some external resources. However, an annotated corpus with food entities along with their normalization was published in 2019 by using several food semantic resources.
In this study, we investigated how the recently published bidirectional encoder representations from transformers (BERT) model, which provides state-of-the-art results in information extraction, can be fine-tuned for food information extraction.
We introduce FoodNER, which is a collection of corpus-based food named-entity recognition methods. It consists of 15 different models obtained by fine-tuning 3 pretrained BERT models on 5 groups of semantic resources: food versus nonfood entity, 2 subsets of Hansard food semantic tags, FoodOn semantic tags, and Systematized Nomenclature of Medicine Clinical Terms food semantic tags.
All BERT models provided very promising results with 93.30% to 94.31% macro F1 scores in the task of distinguishing food versus nonfood entity, which represents the new state-of-the-art technology in food information extraction. Considering the tasks where semantic tags are predicted, all BERT models obtained very promising results once again, with their macro F1 scores ranging from 73.39% to 78.96%.
FoodNER can be used to extract and annotate food entities in 5 different tasks: food versus nonfood entities and distinguishing food entities on the level of food groups by using the closest Hansard semantic tags, the parent Hansard semantic tags, the FoodOn semantic tags, or the Systematized Nomenclature of Medicine Clinical Terms semantic tags.
Food is one of the most important environmental factors that affects human health [
Computer science can greatly contribute to this research topic, especially in the areas of machine learning, natural language processing (NLP), and data analysis. Data collected in studies carry important information, which is not easily extracted when it has been gathered from different data sources. The main problem is that these data are presented in different formats: structured, semistructured, and unstructured. Additionally, the data consist of entities from different domains such as food and nutrition, medicine, pharmacy, ecology, and agriculture. The extraction of this information allows the creation of knowledge graphs [
To create a knowledge graph, first, we should have methods that can be used for information extraction, which is the task of automatically extracting structured information from unstructured textual data. In most cases, information extraction is performed by using named-entity recognition (NER) methods (ie, a subtask of information extraction), which deal with automatically detecting and identifying phrases (ie, one or more words [tokens]) from the text that represents the domain entities. Let us assume the following recipe example (
Recipe example.
The phrases in bold (
Several types of NER methods exist depending on their underlying methodology: (1)
In the past 2 decades, a large amount of work has been done to address this problem in the biomedical domain [
In contrast to the biomedical domain, the food domain is relatively inadequately resourced. There are few semantic models (ie, ontologies) [
At the end of 2019, an annotated food corpus known as FoodBase [
Enabled by the availability of several food resources that were published toward the end of 2019, we introduce a fine-tuned BERT model that can be used for food information extraction, called as FoodNER. BERT is known to achieve state-of-the-art results in NER tasks [
Food named-entity recognition flowchart. BERT: bidirectional encoder representations from transformers; NER: named-entity recognition; SNOMED CT: Systematized Nomenclature of Medicine Clinical Terms.
The main contributions of this study are as follows:
We fine-tuned different BERT models on different semantic resources from which the food semantic tags are taken. All BERT models have very promising results, obtaining around 73.39%-78.96% macro F1 score. All in all, it represents the new state-of-the-art in food information extraction.
In comparison with the already existing food rule–based (Food Information Extraction) and corpus-based (BuTTER) NER methods regarding the task of distinguish between food or nonfood entity, FoodNER provides similar results. However, it is more robust than the rule-based approaches since it does not require the continuous availability of additional external resources, which can be a problem regarding sustainability. Additionally, comparing it to the corpus-based method BuTTER, it is the first model that can predict food groups instead of just distinguishing between food versus nonfood entities.
The source code used for fine-tuning the different FoodNER models is publicly available. All models are also included in FoodViz [
In this study, we used the FoodBase ground truth corpus for building and evaluating FoodNER models for distinguishing food versus nonfood entity as well as for distinguishing food entities concerning the Hansard semantic tags. The BuTTER approach is used as baseline for comparing the performance of the FoodNER models. The FoodOntoMap extension of the FoodBase ground truth corpus is also used for training and evaluating the FoodNER models concerning the SNOMED CT and FoodOn semantic tags.
The FoodBase data corpus is a recently published corpus with food annotations [
The Hansard corpus [
FoodOn is a farm-to-fork ontology about food, which supports food traceability [
SNOMED CT is the most comprehensive multilingual clinical health care terminology [
FoodOntoMap is a recently published resource that is developed by using the FoodBase corpus [
BERT is a word representation model that achieves state-of-the-art results in many NLP tasks [
In this phase, we did not pretrain a BERT model on our corpus. Instead, we used 3 previously pretrained and publicly available BERT models to fine-tune them for the food NER task. Specifically, the 3 BERT models that were used were the original pretrained BERT model [
The BioBERT was trained to improve the model for tasks in the biomedical domain since the domain consists of a large number of domain-specific proper nouns and terms, which do not appear in normal texts. Different combinations of corpora were experimentally used for pretraining BioBERT. The combinations involved the following corpora: the BookCorpus and the English Wikipedia (same as the BERT model), PubMed abstracts with around 4500 million words, and PubMed Central full-text articles with around 13,500 million words. Finally, the model pretrained on the combination using the BookCorpus, the English Wikipedia, and PubMed abstracts using the BERT-base cased code provided by Google is known as the BioBERT language representation model (ie, BioBERT Standard). The same combination trained using the BERT-large cased code provided by Google is known as BioBERT large.
To perform food NER, we fine-tuned the original BERT and the 2 versions of the BioBERT model. In all the cases, for each class, we used the IOB (inside, outside, and beginning) tagging [
The fine-tuning was performed for the following tasks:
Food classification: This was performed for distinguishing food versus nonfood entity. In this task, we labeled all food phrases annotated in FoodBase with the tag FOOD and used this data set for training and validation.
Hansard parent: This was performed for distinguishing 48 classes from the Hansard corpus. In this task, we selected parent semantic tags from the Hansard hierarchy that correspond to the food phrases in FoodBase. In cases with multiple different parent tags present for the food phrase, we selected the first occurring parent.
Hansard closest: This was performed for distinguishing 92 classes from the Hansard hierarchy. In this task, for each food phrase in FoodBase, we chose the closest Hansard tag to the food phrase being annotated. The closest tag was selected using the minimum cosine distance between the BERT embedding of the food phrase and the BERT embeddings of the Hansard tag labels.
FoodOn: This was performed for distinguishing 205 classes, where the classes are semantic tags from the FoodOn ontology. For each food phrase in FoodBase, we selected the corresponding FoodOn class based on the FoodOntoMap mappings [
SNOMED CT: This was performed for distinguishing 207 classes, where the classes are semantic tags from the SNOMED CT ontology. In this task, we also used FoodOntoMap [
In cases of food versus nonfood entity task and the task of distinguishing food entities with regard to the Hansard semantic tags, we have a ground truth corpus—the curated part of FoodBase. However, in case of FoodOn and SNOMED CT, we fine-tuned BERT and BioBERT only for entities that had semantic tags provided by the FoodOntoMap resource (ie, not all food entities are presented in these 2 resources as was previously explained). All semantic tags (ie, Hansard parent, Hansard closest, FoodOn, and SNOMED CT) for each food entity available in the FoodBase corpus are presented by the FoodViz tool (see
An example of food entities available from one recipe that are present in the training data set. The entities are annotated using Hansard parent, Hansard closest, FoodOn, Systematized Nomenclature of Medicine Clinical Terms, and OntoFood (not studied in this paper) semantic tags.
To compare the results, the Bidirectional Long Short-Term Memory (LSTM) model for sequence tagging with a CRF layer (BiLSTM-CRF) [
In this section, the experimental setups for fine-tuning the BERT and BioBERT models in each classification task are explained, followed by the experimental results obtained by the evaluation. We performed 2 experiments: (1) comparison of the BERT models with the corpus-based BuTTER models presented in a previous study [
The experiments were performed using the Colab platform [
Training and validation loss per fine-tuning epoch for the bio bidirectional encoder representations from transformers large model on the Hansard parent data set.
For the BiLSTM-CRF model architecture or the BuTTER models, we used the default parameters presented in the study of Comeau et al [
The maximum sequence length (ie, sentence length) is 50 since the longest sentence in the data set consists of 45 tokens.
The batch size is 256.
Architecture: input layer with 50 units, embedding layer with 300 units, BiLSTM layer with 50 units (total of 100 parameters), dense (TimeDistributed) layer with 50 units, CRF output layer where the final output dimension is the number of classes + 1 (ie, one for padding).
The aforementioned architecture refers to the complete architecture of the BuTTER BiLSTM-CRF model, that is, the model without character embeddings. The BuTTER Char-BiLSTM-CRF model contains an additional stack of input and embedding layers for generating the character embeddings and a concatenation layer for concatenating the word embeddings with the character embeddings. The additional input layer contains 18 units, while the additional embedding layer contains 20 units. Each of the BuTTER models was trained until the improvement in validation loss of 5 consecutive epochs did not surpass 5*10-3, to a maximum of 1000 epochs, whichever comes first.
The data sets used for training and testing are from the curated version of FoodBase [
Data set statistics.
Annotations | Food classification | Hansard parent | Hansard closest | FoodOn | SNOMED CTa |
Annotated tokens (beginning and inside) | 17,937 | 11,759 | 17,864 | 8730 | 8151 |
Outside tokens | 95,416 | 95,416 | 88,956 | 98,445 | 99,024 |
Number of different inside, outside, and |
3 | 63 | 163 | 342 | 318 |
Number of food phrase classes | 1 | 34 | 91 | 197 | 196 |
Total number of tokens | 107,175 | 107,176 | 106,820 | 107,175 | 107,175 |
aSNOMED CT: Systematized Nomenclature of Medicine Clinical Terms.
The evaluation of the proposed models was done using stratified five-fold cross-validation. Stratified sampling was used to generate the folds since the FoodBase corpus consists of 5 different categories of recipes. For each recipe category, 10% of the training set of each fold was taken sequentially out and used for validation.
Next, the results for both experiments are presented, starting with the comparison of the BERT models with the BuTTER models on the food versus nonfood task, followed by presenting the BERT models trained for distinguishing between different food semantic tags. We present the results for the macro F1 score. The macro averaging scheme computes each metric for each class independently and then calculates the mean. The rationale behind using macro averaging is that it conveys more meaningful information when considering especially a task that consists of more than two semantic tags that should be predicted with heavily unbalanced data. Conversely, simple micro averaging provides insufficient information in tasks where more than two semantic tags (ie, classes) are used, as it conflates the true positives, false positives, true negatives, and false negatives into one confusion matrix and then computes the evaluation metrics. Similarly, weighted averaging is biased in favor of the class most represented in the data, as the weight while computing the average depends on the relative frequency of the class label in the data set.
Macro F1 scores for all considered models for the food versus nonfood entity task. Each macro F1 score is obtained by using stratified k-fold cross-validation (k=5). Underlined values are best per subtable, while the bold value is the best from the whole table. BERT: bidirectional encoder representations from transformers; BiLSTM-CRF: bidirectional long short-term memory conditional random field; BuTTER: bidirectional long short-term memory for food named-entity recognition; NER: named-entity recognition.
Boxplots of macro F1 scores obtained by using stratified five-fold cross-validation for all considered models for the binary food classification task. BERT: bidirectional encoder representations from transformers; BiLSTM-CRF: bidirectional long short-term memory conditional random field.
In this experiment, we present the results of fine-tuning the BERT, BioBERT large, and BioBERT standard models in the tasks of distinguishing food entities concerning different semantic models (ie, FoodOn, Hansard closest, Hansard parent, and SNOMED CT). We have decided to focus only on the BERT models since it provides state-of-the-art results in already all NLP NER tasks. Additionally, in
Keeping in mind the number of classes we are predicting for each task, we can conclude that these are really promising results. Additionally, the FoodNER models trained in the tasks of distinguish food entities concerning semantic tags on the level of food groups are the first corpus-based NERs that can distinguish between different food semantic tags (ie, food groups). Once more, we should emphasize that in the cases of FoodOn and SNOMED CT, the BERT and BioBERT models are tuned only on the entities that have semantic tags provided by the FoodOntoMap resource, in which not all food entities from the semantic resources are present.
Macro F1 scores for the 3 food named-entity recognition models for the tasks concerning different semantic models.
Model, semantic model | Epochsa | Macro F1 score (%) | |||
|
|||||
|
FoodOn | 100 | 78.13 | ||
|
Hansard closest | 85 | 75.87 | ||
|
Hansard parent | 100 | 75.04 | ||
|
SNOMED CTc | 91 | 76.01 | ||
|
|||||
|
FoodOn | 93 | 75.58 | ||
|
Hansard closest | 100 | 78.96 | ||
|
Hansard parent | 100 | 76.26 | ||
|
SNOMED CT | 95 | 74.51 | ||
|
|||||
|
FoodOn | 100 | 74.81 | ||
|
Hansard closest | 100 | 74.18 | ||
|
Hansard parent | 89 | 74.94 | ||
|
SNOMED CT | 89 | 73.39 |
aThis provides information on the number of epochs needed to fine-tune the model.
bBERT: bidirectional encoder representations from transformers.
cSNOMED CT: Systematized Nomenclature of Medicine Clinical Terms.
The models are trained on FoodBase [
Food named-entity recognition integration in FoodViz.
We present a corpus-based NER method for food information extraction, known as FoodNER. It is developed by fine-tuning the BERT model by using 3 previously published predefined BERT representation language models (ie, the original BERT and 2 BioBERTs; standard and large). FoodNER can be used to extract and annotate food entities in 5 different tasks: distinguishing between food versus nonfood entities and distinguishing food entities on the level of food groups by using the closest Hansard semantic tags, the parent Hansard semantic tags, the FoodOn semantic tags, or the SNOMED CT semantic tags. All in all, the models provide very promising results achieving around 93.30%-94.31% macro F1 scores in the food versus nonfood entity task and around 73.39%-78.96% macro F1 scores in the tasks where more semantic tags are recognized. Additionally, the models are included in the FoodViz framework, which allows users to select which FoodNER model they want to use for the annotation of their texts with food entities and additionally provides a visualization of the annotated data with an opportunity to correct the false positive and false negative annotations. Having such a robust state-of-the-art food information extraction method such as FoodNER will allow further research in investigating food-drug and food-disease interactions, thereby providing an opportunity to start building a food knowledge graph, including relations with health-related entities.
bidirectional encoder representations from transformers
bidirectional long short-term memory conditional random field
bidirectional long short-term memory for food named-entity recognition
inside, outside, and beginning
National Center for Biomedical Ontology
named-entity recognition
natural language processing
Systematized Nomenclature of Medicine Clinical Terms
This research was supported by the Slovenian Research Agency (research core grant P2-0098 and grant PR-10465), the European Union’s Horizon 2020 research and innovation program (FNS-Cloud, Food Nutrition Security) (grant agreement 863059), and the Ad Futura grant for postgraduate study. The information and the views set out in this publication are those of the authors and do not necessarily reflect the official opinion of the European Union. Neither the European Union institutions and bodies nor any person acting on their behalf may be held responsible for the use that may be made of the information contained herein.
None declared.