This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Metadata are created to describe the corresponding data in a detailed and unambiguous way and is used for various applications in different research areas, for example, data identification and classification. However, a clear definition of metadata is crucial for further use. Unfortunately, extensive experience with the processing and management of metadata has shown that the term “metadata” and its use is not always unambiguous.
This study aimed to understand the definition of metadata and the challenges resulting from metadata reuse.
A systematic literature search was performed in this study following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for reporting on systematic reviews. Five research questions were identified to streamline the review process, addressing metadata characteristics, metadata standards, use cases, and problems encountered. This review was preceded by a harmonization process to achieve a general understanding of the terms used.
The harmonization process resulted in a clear set of definitions for metadata processing focusing on data integration. The following literature review was conducted by 10 reviewers with different backgrounds and using the harmonized definitions. This study included 81 peer-reviewed papers from the last decade after applying various filtering steps to identify the most relevant papers. The 5 research questions could be answered, resulting in a broad overview of the standards, use cases, problems, and corresponding solutions for the application of metadata in different research areas.
Metadata can be a powerful tool for identifying, describing, and processing information, but its meaningful creation is costly and challenging. This review process uncovered many standards, use cases, problems, and solutions for dealing with metadata. The presented harmonized definitions and the new schema have the potential to improve the classification and generation of metadata by creating a shared understanding of metadata and its context.
Computer-aided medicine is revolutionizing health care and is creating treatment possibilities that are unimaginable without computer assistance: personalized medicine, improved diagnostics by artificial intelligence, and robot-assisted surgery. An immense amount of data fuels this digital revolution, and it is desperately needed for specialized procedures to be developed and optimized. This information is primarily created to document patient care for legal or financial purposes [
The systematic literature review performed in this study was done following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for reporting systematic reviews. A harmonization process preceded the review to gain a general understanding of the used terms.
This review was performed by 10 reviewers with expertise in medical informatics and technical and semantic interoperability [
The PRISMA guidelines were applied in the systematic review as a de facto standard [
Q1: How is the term “metadata” defined in different research fields?
Q2: How are the terms “metadata matching” and “metadata mapping” defined?
Q3: Which standards concerning metadata are in use?
Q4: How are metadata created in other research fields?
Q5: What are the current problems regarding the use of metadata, and which solutions are mentioned?
The review and its results were based on extensive literature analysis; therefore, the selected literature was extremely important to the results. In this review, Scopus and Web of Science was used. The selection phase was 2-fold: in the first step, the very general keyword “metadata” was used to obtain a wide variety of publications. The search query was restricted to include only journal papers, conference proceedings, and book chapters from the last 10 years (2010-2019). About 11.6% (2453/21,161) of the resulting papers were randomly selected and then analyzed by title and abstract to identify papers within the scope of the research questions. Potential publications that were of uncertain use were included at this stage to prevent hasty exclusion. The keywords of suitable papers were used in the second step of the literature search for the full-text analysis. The papers of the second literature query were analyzed by titles and abstracts again to match the research questions for inclusion in the full-text analysis.
Each of the 81 papers was reviewed by the first author and 2 randomly assigned reviewers, resulting in 3 independent interpretations per paper. To standardize the review process, a survey form with 8 questions was created: 6 questions corresponding to the research focus and 2 questions to gain additional information about the selected literature. The main questions focused on the metadata definitions (Q1), scoping metadata matching, mapping, and transformation (Q2), used standards (Q3), applied use cases (Q4), encountered problems, and the corresponding solutions (Q5). The additional questions covered the research field from which the paper originated and which type of metadata is described. For the categorization of the metadata types, a classification published by the National Information Standards Organization (NISO) [
descriptive metadata describe a resource for discovery and identification purposes,
structural metadata describe the schema, data models, and reference data, and
administrative metadata provide information about the management of a resource.
To illustrate the classification, consider this example: a book can be described using 3 different types of metadata. Author, title, and preface are examples for descriptive information, whereas the arrangement in chapters and page ordering is structural metadata. Information about the publication date and copyright information is classified as administrative metadata. The review process was open for 8 weeks. The results were gathered and analyzed by the first author and verified by the reviewers to produce a joint agreement on the final results. Both survey forms and the review results can be found in
Ten reviewers participated in the harmonization process. The reviewers categorized 6 metadata processing tasks concerning the use case of metadata-driven data integration as matching, mapping, or transformation. Furthermore, the reviewers assessed to which degree the metadata processing tasks can be automated. The results showed a strong agreement on every task shown in
Reviewers' categorization of the tasks of a metadata-driven data integration process. Red: matching; yellow: mapping; and blue: transformation.
The matching process describes the alignment of given data structures or metadata and creates an alignment proposal between the individual data elements. These matching candidates can be created by domain experts or matching algorithms by using equivalence classes (eg, equivalent, narrower, broader).
In the mapping process, a domain expert uses the proposals of the matching process to define functions or uses external rule sets (eg, Unified Code for Units of Measure) to transform the source data structure into a target data structure. The conversion functions are not necessarily symmetrical.
The transformation process combines metadata and instance data. It uses the conversion rules defined in the mapping process to transform instance data according to the target data structure.
The first inquiry with the general keyword “metadata” was performed in mid-December 2019 and resulted in 23,233 papers—21,161 after duplication removal. Approximately 11.6% (2453/21,161) of the documents were randomly selected, resulting in 2453 publications whose titles and abstracts were analyzed by the first author. The keywords of the relevant papers extended the search phrase to metadata definition, metadata matching, metadata mapping, and metadata transformation. The literature search was repeated in February 2020 using the extended search phrase in the second phase, resulting in 681 papers and 551 papers after removing the duplicated entries. The titles and abstracts were analyzed to match the scope by the first author, and 81 papers were selected for the full-text analysis (
The process for literature selection in 2 search phases with different keyword sets. Two separate literature inquiries were performed: the first inquiry aimed at identifying suitable keywords for the second literature inquiry, which provided papers for the full-text analysis.
Guerra et al [
Small atomic units describing and constraining a specific object (table fields, attributes of form questions, records) [
Describes data type, range, or set of possible values [
Single units can be composed into complex elements [
Single units are often called Data Element following the International Organization for Standardization (ISO) 11179 [
Metadata can have bindings to terminologies, controlled vocabularies, and taxonomies [
Metadata repositories or data dictionaries are used synonymously and store metadata centrally [
Separation of content information from layout information [
Detailed machine-readable and actionable descriptions to enable data processing without human guidance [
The NISO classification task showed that the majority of the papers were classified as structural or descriptive—papers with a pure focus on administrative metadata were a minority in the selected publications, as shown in
The distribution of the publications included in this review. The categories were letter-encoded: A is structural, B is descriptive, and C is administrative, as well as their resulting combination. Structural (40%) and descriptive (39%) papers were clearly in the majority, while administrative (14%) papers were rarely found. Lastly, 6% of the papers could not be clearly classified.
Besides descriptions of metadata representations, some authors stated their understanding of metadata matching and mapping. Ashish et al [
The Fleiss kappa values to evaluate the interrater reliability of the classification task. Values between 0.00-0.20 are classified as slight agreement and values between 0.21-0.40 as fair agreement [
Task | Metadata processing task | ||
|
Matching | Mapping | Transformation |
Fleiss kappa | 0.13175743 | 0.22358548 | 0.29233227 |
This review served to obtain insights into the standards and core data sets used. The assessments resulted in 37 relevant standards mentioned and used in the selected publications. The identified standards were grouped afterward into 3 categories following the levels of interoperability [
Structure standards: ISO 11179, ISO 15926, ISO 19101, ISO 19763, ISO 20943, ISO 21526, ISO 23081, openEHR, CDISC ODM, OMOP, IHE DEX, Dublin Core, ASTM CCR, CaDSR, EAD, GILS, VRA, CIMI, CSDGM, ONIX, MARC, TMA DES, EXIF, INSPIRE, SKOS, DCAT, W3C PROV
Technical standards: XML, RDF, OWL, JSON-LD, ClaML
Semantic standards: ICD-10, UMLS, SNOMED CT, LOINC, MedDRA, RxNorm.
Metadata are used for various use cases. The papers included in this review showed that metadata were mainly used for 4 tasks: information retrieval (21 papers), data integration (19 papers), core data set definition (10 papers), and the secondary use of data (7 papers). For information retrieval, metadata, especially semantic annotations, were used to improve query-based machine processing. Owing to a broader range of information descriptions, queries can be more accurately matched and thus, return more optional results. The processes of data integration and core data set definition used metadata to describe and harmonize the underlying schema, which can be used for secondary use of (eg, clinical) data. Further encountered use cases were an automatic data quality check [
The reviewed papers addressed several problems regarding the processing and the use of metadata in different research fields and introduced solutions with new approaches to overcome obstacles. On analyzing the papers upon with described issues, we identified 5 problem categories: (1) structural-related problems, (2) semantics-related problems, (3) human interaction–related problems, (4) metadata lifecycle–related problems, and (5) metadata processing–related problems.
According to our review, the largest group of problems were structural-related issues. The authors of the reviewed papers described a lack of standard usage. They criticized a limited or confusingly extensive selection of suitable standards [
Semantics is a big enabler for (meta)data reuse, and therefore, according to the literature, the lack of semantics was a difficult obstacle to overcome. A general problem related to every standardized data capture was the free-text elements [
The reviewed literature described another possible solution: the use of ontologies [
The collaboration was described as an essential aspect mentioned in the reviewed papers from each research field. Sharing and discussing the created information was not only an opportunity to improve the designed data but a necessary step to overcome the hurdles of misinterpretation [
Another vital issue is the divergence of data and the corresponding metadata [
Metadata are often used for data harmonization to reduce labor. However, the process of metadata harmonization was usually performed manually [
The aim of this study was to investigate the anatomy of metadata and point out possible issues by conducting a deep insight into the recent academic literature in the last decade. It would have been desirable to extend the period to the previous 20 or even 30 years, but the amount of work would not be justifiable. The initial search for the actual review was intentionally broad with the generic key phrase “metadata,” resulting in 21,161 papers using Scopus and Web of Science. To maintain the general selection focus and minimize a self-imposed bias, domain-specific search engines such as PubMed were not used. Our selection criteria aimed for recent metadata papers with an emphasis on describing existing data sets to integrate them meaningfully. Papers dealing exclusively only with (instance) data or semantic standards were not included to reduce the immense amount of publications for review and concentrate on our core research interest. After several filtering steps, the resulting 81 papers included in the review were mainly from the field of medical informatics. This might be because metadata were very relevant to this area of research, and thus, a considerable amount of work was done in this area.
The papers’ distribution of the metadata categories was unbalanced: there were hardly any papers with an administrative orientation in the selected papers. The challenges of comprehensible data collection and traceability intensified with a substantial increase in digitization, and administrative metadata can be used to support management processes. Intriguingly, this was apparently not strongly represented in the literature. This was somewhat surprising since this information would be indispensable for the documentation of origin and traceability of data records. It appeared that the field of administrative metadata, including provenance information, has been massively underrepresented in the last decade. The use cases found were in line with our daily experiences: metadata were mainly used to improve information retrieval and data integration. Another expected facet was the sheer amount of standards (see the comparative analysis of Baek and Sugimoto [
Besides the categorization of the NISO schema, other approaches were encountered. Upon closer inspection, the newly introduced models had a considerable overlap with the schema, except for 2 approaches. Chu et al [
To ensure consistency, a harmonization process preceded our review. It had to be assured that all participating reviewers had the same understanding of the definitions. This harmonization step required additional time and effort but resulted in a joint set of definitions that could be evaluated during the review. To evaluate the differences in reviewers’ understandings of these definitions, the Fleiss kappa was calculated. The results showed that the reviewers agreed on when metadata are used for mapping and transformation, although the process of matching had less agreement between experts. This can be explained by the partial mixing of the 2 definitions of matching and mapping in the analyzed publications, resulting in mixed results by the individual reviewer. The definitions and the differentiation between matching and mapping were congruent with the literature.
On the contrary, our understanding of transformation was divergent from the analyzed papers. Our definition was focusing on metadata-driven data integration: the usage of metadata for the transformation of (clinical) instance, whereas the found term
A further important insight was the dependence on context and perspective during the definition and evaluation of metadata, as Chu et al [
Taking the decisive role of the metadata context into account, we derived a new schema for the classification of components for rich metadata objects adapting the model of Haslhofer and Klas [
The schema definition can specify how metadata models are constructed. Well-known representatives are the norms ISO 11179 [
The building blocks of metadata: schema definition, metadata schema, and markup language are jointly used to instantiate metadata with an additional semantic descriptor to describe a real-world object.
This review showed that the term metadata
To our knowledge, there is no comparable systematic review of metadata processing, which includes the analysis of approved solutions from other research fields and applicability to the field of medical informatics. Nevertheless, reviews on metadata have been carried out. Baek and Sugimoto [
Metadata can be a powerful means to identify, describe, and process information, although its meaningful definition is challenging and entails significant hurdles. Different understanding of the same metadata representations is troublesome and hinders the correct utilization of metadata as well as the corresponding data instance. Through this work, 10 experts have gone through a consultation phase that ended in harmonized definitions for metadata in terms of metadata-driven data integration. This review process discovered many standards, use cases, problems, and solutions in dealing with metadata, providing a broad overview of the topic. This summary has led us to introduce a new schema for the classification of components for enriched metadata objects, which explicitly focuses on the creation context of metadata. These harmonized definitions and the new schema will improve the classification and creation of metadata by providing a mutual understanding of the metadata and its context.
The first survey form used for the harmonization process before the review.
The second survey form used for the actual review process.
The completed PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) checklist for the review.
International Organization for Standardization
National Information Standards Organization
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
We acknowledge financial support by Land Schleswig-Holstein within the funding program Open Access Publikationsfonds. Hannes Ulrich was funded by the German Research Foundation (Deutsche Forschungs-gemeinschaft) DFG grants IN 50/3-2. Jürgen Stausberg was funded by the German Federal Ministry of Education and Research under contract 01GY1917B. Martin Dugas was funded by the German Research Foundation grant DU 352/11-2.
None declared.