This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Data provenance refers to the origin, processing, and movement of data. Reliable and precise knowledge about data provenance has great potential to improve reproducibility as well as quality in biomedical research and, therefore, to foster good scientific practice. However, despite the increasing interest on data provenance technologies in the literature and their implementation in other disciplines, these technologies have not yet been widely adopted in biomedical research.
The aim of this scoping review was to provide a structured overview of the body of knowledge on provenance methods in biomedical research by systematizing articles covering data provenance technologies developed for or used in this application area; describing and comparing the functionalities as well as the design of the provenance technologies used; and identifying gaps in the literature, which could provide opportunities for future research on technologies that could receive more widespread adoption.
Following a methodological framework for scoping studies and the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, articles were identified by searching the PubMed, IEEE Xplore, and Web of Science databases and subsequently screened for eligibility. We included original articles covering software-based provenance management for scientific research published between 2010 and 2021. A set of data items was defined along the following five axes: publication metadata, application scope, provenance aspects covered, data representation, and functionalities. The data items were extracted from the articles, stored in a charting spreadsheet, and summarized in tables and figures.
We identified 44 original articles published between 2010 and 2021. We found that the solutions described were heterogeneous along all axes. We also identified relationships among motivations for the use of provenance information, feature sets (capture, storage, retrieval, visualization, and analysis), and implementation details such as the data models and technologies used. The important gap that we identified is that only a few publications address the analysis of provenance data or use established provenance standards, such as PROV.
The heterogeneity of provenance methods, models, and implementations found in the literature points to the lack of a unified understanding of provenance concepts for biomedical data. Providing a common framework, a biomedical reference, and benchmarking data sets could foster the development of more comprehensive provenance solutions.
The replication crisis has exposed a lack of reproducible results in many scientific studies, including those in the biomedical domain [
Although the definitions of data provenance information vary in some aspects, it is generally understood as metadata, which describe all events that influenced a data set. A data set can be altered by some processes, resulting in a changed state. We consider a data set with a changed state to be a new data set. Data provenance tracks information about its conception (eg, who or what created the data) and all transformations and processing operations that may have been applied [
In the biomedical context, data are collected in many forms and types as well as for different purposes, including health care and research. Usually, such data include information about treatments, conditions, and outcomes of a patient, which are often described by measurements or more abstract observations. The origin of such observations and the context in which they have been collected can differ, which can have consequences for their meaning and reliability. For example, observations can be manually captured by a person (eg, a health care professional measuring the heart rate of a patient) or automatically captured by a device (eg, a digital pulse oximeter already placed on the patient’s finger), influencing their precision. Another example would be deriving structured research data from clinical documents, which can be a manual process involving curation or an automated process performed by machines, which impacts reliability. Considering the previously mentioned processing of such data and errors or inaccuracies potentially introduced along the way, the assessment of data provenance metadata (eg, by visualization or analysis) can help clinicians or researchers understand the quality of information and informaticians find the root causes in case of problems.
In this graph, the input data nodes represent data on
Data provenance can be captured prospectively and retrospectively, relative to when data processing occurs [
A simple example provenance graph, where observations are mapped to encounters to be loaded into a data warehouse.
Although data provenance tracking is a common practice in some disciplines, such as physics, geoscience, geography (particularly in geographic information systems), material science, hydrologic science, and environmental modeling [
In this paper, we present a scoping review to (1) provide a detailed overview of research describing data provenance technologies (eg, for imaging data, health records, and omics data) developed for or used in biomedical research; (2) describe and compare the supported functionalities (eg, creating, storing, querying, analyzing, or visualizing data provenance information) as well as the design of the methods (eg, use of standards or types of data storage); and (3) use this information to identify gaps in the literature (eg, combinations of functionalities that are rarely supported), which could provide opportunities for future research on technologies that could receive more widespread adoption.
This systematic scoping review was performed in conformance with the methodological framework developed by Arksey and O’Malley [
Before defining the inclusion criteria, we conducted an unstructured literature search on data provenance and found that the body of literature included many studies from fields that were not within the scope of this review. On this basis, we set up an initial version of criteria to discriminate articles about the use of provenance methods in biomedical research from articles about the use of provenance methods in other capacities or disciplines, such as supply chains for pharmaceutical products or animal taxonomies. The description of the criteria was refined after a preliminary sample screening to mitigate the differences in interpretation among the authors.
We included articles that (1) described the use of data provenance, data lineage, or data pedigree information in biomedical research or a related scientific discipline and (2) described a software-based method (ie, articles focusing on purely manual provenance tracking were not eligible). Moreover, articles needed to be (3) original papers published in peer-reviewed journals or conference proceedings, (4) written in English, and (5) published between 2010 and 2021.
The exclusion criteria were formulated analogously. We excluded articles that (1) did not cover data provenance and instead focused on provenance in other contexts (eg, history, geology, or logistics); (2) did not focus on digital technologies, data, software, methods, or models for data provenance; (3) did not focus on biomedical or health-related research or data (eg, if the biomedical domain was only mentioned as one exemplary application area among many); and (4) did not describe the provenance of data and instead used provenance data (eg, for the tracking of products in supply chains).
Near-synonyms exist for “provenance,” such as “pedigree” or “lineage,” which hence had to be included as search terms. Furthermore, as described in the previous section, we needed to discriminate against articles not within the scope or context of biomedicine. For this purpose, we included the keywords “biomedical,” “medical,” and “health.”
We searched the Web of Science, PubMed, and IEEE Xplore databases, as the topic is at the intersection of medicine and computer science. The search strings used required article titles or abstracts to contain at least 1 keyword from each of the two topics and corresponding keywords reflecting the scope of the review:
The topic “Provenance” was captured by the terms (“data provenance” OR “data lineage” OR “data pedigree”)
The topic “Biomedicine” was captured by the terms (“medical” OR “biomedical” OR “health”)
The exact search strings used for the different databases are available in
The selection process was performed using two consecutive screening steps: (1) screening of the titles and abstracts of all the resulting papers and (2) screening of the full texts of all the papers that were selected in the first step. Each article was screened by the first author and one coauthor. Disagreements were resolved by the last author. The reasons for excluding articles were also recorded and are provided in
We defined data items along five axes to generate insights into our research questions (RQs): (1) publication metadata, (2) application scope, (3) provenance aspects covered, (4) data representation, and (5) functionalities. An overview of the categories, individual items, and value sets is provided in
As can be seen, we collected publication
Data items for full-text charting.
Name | Description | |
|
||
|
Year of publication | The year when the publication was published |
|
Author location | Countries in which the institutions of the first author and last author are located |
|
||
|
Application area | Whether the contribution can be applied in biomedical research or directly applied in health care practice |
|
Focus | Whether addressing the issue of data provenance was the primary focus of the publication or whether the provenance aspect was only mentioned indirectly or as complementary to an inherent necessity |
|
Motivation | The motivation behind the use of data provenance |
|
Types of data | The types of data for which provenance information was managed (options are structured clinical and health data, omics data, imaging data, sensor or device data, free text, and other types of data) or whether the contribution was data type agnostic (ie, generic data) |
|
||
|
Where provenance | The contribution addresses the aspect of where the data originated from |
|
How provenance | The contribution addresses the aspect of how a specific result was produced (ie, the preceding processing steps) |
|
Who provenance | The contribution addresses the aspect of who (or which entity, such as organization, software, or device) was responsible or claimed ownership for the data or data processing |
|
Why provenance | The contribution addresses the aspect of why a certain result or data point was produced, which requires capturing all preceding processing steps and data sources |
|
||
|
Abstract data model | The abstract data model used to represent provenance information; examples are graphs, lists, references, and composite objects |
|
Concrete data model | The concrete data model used to store provenance information; examples are blockchains, named graphs, relational models, and file-based storage |
|
Standard data model | Whether the data model was compatible with common provenance standards, such as PROV or the OPMa |
|
Immutability | Whether the provenance information was immutable |
|
Materialization | Whether the provenance information was virtual or materialized, that is, whether intermediate processing results were explicitly stored as complete data sets |
|
||
|
Creation and capture | How, or by what type of entity, the data provenance information was captured; we distinguished between additional capture through a stand-alone software, integrated by some middleware- or trigger-based approaches, inherently using blockchains, or extraction from external sources |
|
Querying and retrieval | How the provenance information was queried or retrieved; options are retrieval via APIb or GUIc, structured query, selective query, or an unstructured search query |
|
Analysis | Categorization of how the provenance information was analyzed, which helped identify contributions with similar feature sets; the categories are “generic” or use case agnostic (eg, descriptive statistics) and “specific” or use case dependent (eg, reasoning or error tracing) |
|
Visualization | Visualization type or method for identifying ways to visualize provenance information of information related to data provenance; details include whether the visualization was based on a graph or flow network to examine patterns in provenance visualization based on its native structure and whether specific tools were used for visualization |
|
Time of generation | The time of metadata generation; we distinguished between prospective generation, when the metadata are generated during data processing, and retrospective generation, when the data processing was done in the past and the metadata are generated based on previously generated artifacts, such as log files |
aOPM: Open Provenance Model.
bAPI: application programming interface.
cGUI: graphical user interface.
A total of 138 articles were identified through the database searches (45, 32.6% from PubMed; 40, 29% from IEEE Xplore; and 53, 38.4% from Web of Science). An overview of the selection process is shown in
From the 138 articles, we excluded 42 (30.4%) duplicates and 36 (26.1%) articles in the first screening process. Of the 60 eligible full-text articles, 3 (5%) could not be retrieved. Of the remaining 57 articles, 13 (23%) were excluded in the second screening process. Finally, 44 articles were included in the review and processed in the data charting step (refer to
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flowchart for the selection process (based on the study by Page et al [
List of items found eligible (n=44).
Serial number | Year | Title | Reference |
1 | 2021 | Smart Decentralization of Personal Health Records with Physician Apps and Helper Agents on Blockchain: Platform Design and Implementation Study | [ |
2 | 2021 | Blockchain for Healthcare Data management: Opportunities, Challenges, and Future recommendations | [ |
3 | 2021 | Adjusting For Selection Bias Due to Missing Data in Electronic Health Records-Based Research | [ |
4 | 2021 | Risk and Compliance in IoT- Health Data Propagation: A Security-Aware Provenance based Approach | [ |
5 | 2021 | Blockchain-Enabled Telehealth Services Using Smart Contracts | [ |
6 | 2021 | Trellis for Efficient Data and Task Management in the VA Million Veteran Program | [ |
7 | 2020 | A Practical Universal Consortium Blockchain Paradigm for Patient Data Portability on the Cloud Utilizing Delegated Identity Management | [ |
8 | 2020 | Blockchain-Enabled Clinical Study Consent Management | [ |
9 | 2020 | Decentralised Provenance for Healthcare Data | [ |
10 | 2020 | Enhancing Traceability in Clinical Research Data Through a Metadata Framework | [ |
11 | 2020 | Secure and Provenance Enhanced Internet of Health Things Framework: A Blockchain Managed Federated Learning Approach | [ |
12 | 2019 | BEERE: A Web Server for Biomedical Entity Expansion, Ranking and Explorations | [ |
13 | 2019 | Clinical Text Mining on FHIR | [ |
14 | 2019 | Enhanced Security Framework for E-Health Systems using Blockchain | [ |
15 | 2019 | NeuroProv: Provenance Data Visualization for Neuroimaging Analyses | [ |
16 | 2019 | Polymorph Segmentation Representation for Medical Image Computing | [ |
17 | 2019 | Provenance for Biomedical Ontologies With RDF and Git | [ |
18 | 2019 | Research on Personal Health Data Provenance and Right Confirmation With Smart Contract | [ |
19 | 2019 | The Generalized Data Model for clinical research | [ |
20 | 2018 | Application of Data Provenance in Healthcare Analytics Software: Information Visualisation of User Activities | [ |
21 | 2018 | Applying Blockchain Technology for Health Information Exchange and Persistent Monitoring for Clinical Trials | [ |
22 | 2018 | BASTet: Shareable and Reproducible Analysis and Visualization of Mass Spectrometry Imaging Data via OpenMSl | [ |
23 | 2018 | FHIR Healthcare Directories: Adopting Shared Interfaces to Achieve Interoperable Medical Device Data Integration | [ |
24 | 2018 | ProvCaRe Semantic Provenance Knowledgebase: Evaluating Scientific Reproducibility of Research Studies | [ |
25 | 2018 | Visualizing the Provenance of Personal Data Using Comics | [ |
26 | 2017 | A Method of Electronic Health Data Quality Assessment: Enabling Data Provenance | [ |
27 | 2017 | MediSyn: Uncertainty-Aware Visualization of Multiple Biomedical Datasets to Support Drug Treatment Selection | [ |
28 | 2017 | MeDShare: Trust-Less Medical Data Sharing Among Cloud Service Providers via Blockchain | [ |
29 | 2017 | Templates as a Method for Implementing Data provenance in Decision Support Systems | [ |
30 | 2016 | Access Control Management With Provenance in Healthcare Environments | [ |
31 | 2016 | Addressing Provenance Issues in Big Data Genome Wide Association Studies (GWAS) | [ |
32 | 2016 | AVOCADO: Visualization of Workflow-Derived Data Provenance for Reproducible Biomedical Research | [ |
33 | 2016 | Design of the MCAW Compute Service for Food Safety Bioinformatics | [ |
34 | 2016 | TCGA Expedition: A Data Acquisition and Management System for TCGA Data | [ |
35 | 2015 | A Platform for Leveraging Next Generation Sequencing for Routine Microbiology and Public Health Use | [ |
36 | 2015 | Modeling Evidence-Based Medicine Applications With Provenance Data in Pathways | [ |
37 | 2014 | Exploring Large Scale Receptor-Ligand Pairs in Molecular Docking Workflows in HPC Clouds | [ |
38 | 2014 | Securing First-Hop Data Provenance for Bodyworn Devices Using Wireless Link Fingerprints | [ |
39 | 2013 | Fuzzy Reasoning of Accident Provenance in Pervasive Healthcare Monitoring Systems | [ |
40 | 2013 | Provenance Framework for mHealth | [ |
41 | 2013 | Towards Structured Sharing of Raw and Derived Neuroimaging Data Across Existing Resources | [ |
42 | 2012 | Improving Integrative Searching of Systems Chemical Biology Data Using Semantic Annotation | [ |
43 | 2012 | XCEDE: An Extensible Schema for Biomedical Data | [ |
44 | 2011 | A Provenance Approach to Trace Scientific Experiments on a Grid Infrastructure | [ |
The year of publication of the articles ranged from 2011 to 2021. Approximately two-thirds (29/44, 66%) of the articles were published from 2017 to 2021, and one-third (15/44, 34%) of the articles were published before this time frame, that is, from 2011 to 2016, pointing toward an increasing trend (
Number of publications per year.
Most first and senior authors worked at institutions located in the United States (34/90, 38%), followed by China (8/90, 9%), Germany (8/90, 9%), the United Kingdom (6/90, 7%), Australia (6/90, 7%), Canada (4/90, 4%), and the United Arab Emirates (4/90, 4%). We note that countries with fewer than 4 occurrences were pooled as “others” (20/90, 22%) and that some authors were affiliated with multiple organizations. The results are roughly comparable with the top entries in the SCImago Country Ranking [
Most papers (34/44, 77%) analyzed focused on provenance for research data processing only, whereas some (8/44, 18%) focused on the application of provenance in research and health care, and only 5% (2/44) specifically focused on the application of provenance in the health care practice context by presenting a backward reasoning algorithm for a monitoring system [
In approximately half of the publications (23/44, 52%), data provenance was the primary research subject, whereas the other half (21/44, 48%) addressed data provenance indirectly or as an inherent property of a broader method or solution described.
The motivations behind the need for provenance data were categorized into “validity,” “reproducibility,” “regulatory requirements,” “reusability,” and “transparency,” and each publication was assigned to the category matching the described motivation.
The most frequent reason for addressing provenance was validity (22/44, 50%), followed by reproducibility (15/44, 34%) and the need to comply with regulatory requirements (15/44, 34%), reusability (11/44, 25%), and then transparency (8/44, 18%). Some papers did not provide details on why provenance was considered (3/44, 7%). In the
The most frequently mentioned (multiple mentions possible) supported data type was
The co-occurrences of the data types focused on and the motivation presented are illustrated in
What stands out is that papers that addressed provenance for omics and imaging data were often motivated by reproducibility aspects. This makes sense, as both types of data are rather large and complex in nature, and processing operations, for example, bioinformatics pipelines or artificial intelligence–based image analyses, are known to sometimes be difficult to reproduce [
Percentage of papers addressing a certain data type and mentioning a certain motivation.
Regarding the provenance aspects supported by the methods or solutions described, we identified the coverage as provided in the following sections.
All the papers (44/44, 100%) supported
The following abstract data models used to represent provenance information were identified:
The abstract data models described were implemented using the following concrete data models and associated storage solutions:
When cross-referencing the motivation categories versus whether the contribution was blockchain based or used some other technology (
Frequency table for motivation groups and whether or not the solution is blockchain-based.
A total of 23% (10/44, 23%) papers stated compatibility with the PROV data model, whereas 7% (3/44, 7%) papers claimed compatibility with OPM. Most publications (31/44, 70%) did not state compatibility with either standard. Among all the papers that stated compatibility with either standard, all papers published since 2018 (7/44, 16%) preferred the PROV model. No paper mentioned compatibility with both standards.
Data that cannot be altered once created or captured are considered immutable. Methods and solutions presented in 27% (12/44, 27%) publications provided immutability or nonrepudiability, of which 92% (11/12, 92%) were based on blockchain technology, which is inherently immutable. One paper stated nonrepudiable provenance based on cryptographic methods [
We further analyzed whether the methods or solutions described store intermediate results as complete data sets, that is,
The technical activities supported by data provenance methods, models, and implementations are the creation or capture, storage, retrieval or query, analysis, and visualization of data provenance information, which are common activities in the data life cycle. When looking at the support provided for these activities by the methods and solutions analyzed, there was a clear decrease in support for tasks performed later in the data life cycle, as illustrated in
Several publications (39/44, 89%) described methods supporting multiple activities in the data life cycle. The frequency of support for individual steps was in ascending order:
Steps of the data life cycle supported by the methods and solutions analyzed.
Among the papers that described methods or solutions supporting the creation or capture of provenance information (39/44, 89%), most papers (16/39, 41%) captured provenance information and metadata by changing a larger program, framework, or script used for data generation or processing to
Among the papers that described methods or solutions supporting the querying or retrieval of provenance information (24/44, 55%), 25% (11/44) of papers relied on structured queries, using SQL, SPARQL, GraphQL, or similar query languages. A total of 42% (10/24) of solutions provided a graphical user interface or an application programming interface to retrieve the provenance metadata. Overall, 4% (1/24) of articles stated retrieval using unstructured queries, that is, a search string, and another (1/24, 4%) described a method using a selective query using unique identifiers. A total of 13% (3/24) of papers did not specify the method of retrieval.
The support for the analysis of data provenance solutions can have many forms. In this study, the analyses were categorized as “generic” if they entailed generally applicable methods such as providing descriptive statistics and metrics as well as simple comparisons. Among the papers described methods or solutions supporting the analysis of provenance information (9/44, 20%), 44% (4/9) of papers fell into this category. Analyses were considered “specific” when they were tailored toward provenance-specific use cases, such as reasoning, validation tasks, and error tracing. A total of 44% (4/9) of papers fell into this category: 22% (2/9) of papers described ways to validate that data come from trustable devices [
The results of generic analyses are typically visualized using common types of visualizations, such as bar and line charts. Among the papers that described methods or solutions supporting the visualization of provenance information (9/44, 20%), most papers (7/9, 78%) are based on some sort of graph- or flow network–based visualization. A total of 22% (2/9) of publications did not use such a basis but described methods or solutions showing digested information in bar charts and boxplots.
Visualization techniques or methods included dashboard-style combinations of multiple visualizations and metrics, Sankey diagrams, aggregations of graph nodes, force-directed graphs, tables, and an informal comic-style visualization of processes. Implementations are typically based on common visualization libraries or programs, such as D3.js, Gephi, yEd, sigma.js, Dagre, GraphViz, or Google Datalab.
Of the solutions and methods capturing or creating provenance information (39/44, 89%), most (31/39, 79%) did so prospectively close to when the data were being processed. A minority (6/39, 15%) captured provenance information retrospectively after the processing concluded, based on the artifacts, such as log files, created. A total of 5% (2/39) of articles described the option for retrospective and prospective capture of provenance information, where one solution allows the reconstruction of provenance metadata for a previously finished process [
In this study, we provided an overview of the research on data provenance methods and technologies developed for or used in the biomedical domain. The methods and solutions described in the identified literature are heterogeneous. The supported functionalities and the design of methods were hence described to provide a systematization for navigating the heterogeneous landscape and to support the comparison of the functionalities and designs based on several characteristics. Furthermore, we identified gaps in the literature based on the systematization, including a lack of coverage regarding certain functionalities, such as the analysis of provenance metadata. The principal findings, related works, and limitations are presented in the following sections.
Despite the potential advantages of using data provenance technologies in biomedical research, as stated in the
Regarding provenance aspects (where, how, why, and who), every solution analyzed in this review captures the aspect of
When looking at the logical and concrete data models used, graphs and graph databases are the most prevalent, which is reasonable, as these are natural representations for provenance information. Widespread generic data models, such as the relational model or XML, are also used frequently, as they are versatile enough to support provenance metadata from a wide range of implementations. Although some approaches already adopted or are at least compatible with the most common provenance standards, PROV and OPM, many papers did not address compatibility with standards, which hinders the interoperability of provenance metadata.
The PROV model has gained popularity in the recent years. It is slightly newer and more comprehensive than OPM, which was the “first community-driven model for provenance” [
The processing or generation of large and complex data, such as omics or images, is costly [
Recently, blockchains have established themselves as a technology that supports some aspects of data provenance. Blockchains inherently provide where provenance and immutability by facilitating consensus algorithms and cryptographic methods to maintain a single list of blocks, where all involved parties agree on a predecessor and successor for any given block. These blocks usually contain transaction information, thus enabling where provenance for the data included or referenced. Unfortunately, the blockchain-based solutions we identified and analyzed in this review often do not go beyond their inherent property and, at this stage, provide little coverage for other aspects, such as reproducibility and reusability, which are much needed in biomedical research. However, because of their support for clearly defined and immutable lineage, they can be well suited for meeting regulatory requirements (eg, providing audit trails).
The creation or capture of data provenance information is logically the first step toward its use. Therefore, it is not surprising that creation and capturing is the most commonly supported activity in the provenance data life cycle in all the methods or solutions analyzed. Provenance data analysis and visualization were addressed less frequently, which could be a direct result of the fact that data provenance is still underutilized in biomedical research, and hence methods for the “use” of provenance information are developed or studied more rarely. We believe that the development of domain-specific analysis and visualization methods could be an important step to practically demonstrate the added value of provenance tracking and help increase its adoption. Furthermore, we did not find any indication of reference data sets, which could be used to develop and evaluate analysis or visualization methods for provenance data.
Finally, we found that most solutions or methods analyzed rely on methods for additionally capturing provenance data, whereas only a small number of approaches rely on integrated capture methods that are transparent to the user or the processing environment. This implies that there is quite a lot of effort involved in capturing provenance data information, which may point toward an additional field of promising research on how to transparently capture provenance information without causing additional work on the side of the users or developers of data processing frameworks.
Several related papers have studied and systematized research on data provenance, albeit typically with a focus on general concepts or applications and not on biomedicine. In 2005, Simmhan et al [
A more recent (2017) survey by Herschel et al [
de Lusignan et al [
Goble [
Last year, Gierend et al [
This study has some limitations owing to the chosen search strategy, heterogeneity of the discovered and included articles, and methods and solutions described therein. Most importantly, the search strategy was designed to specifically capture the topic of provenance in biomedical research, and the terms used did not explicitly include specific research domains, such as psychology or other behavioral sciences. However, we believe that our literature selection strategy likely only missed relevant articles that did not address the broader context in their abstracts, which meant mentioning one of the keywords used in our search process. Furthermore, we consider the existence of a large body of literature with these characteristics unlikely. The fact that approximately 46% (44/96) of all identified unique references were included in this review can be taken as an indicator that provenance tracking has not yet become a common feature of biomedical research platforms. If this were the case, it would be expected that a greater proportion of the literature would have mentioned provenance as a sidenote, leading to its exclusion owing to a lack of focus on provenance technology. By contrast, many articles mentioning provenance in their titles or abstracts have a specific focus on this topic.
The methods and solutions described in the selected articles were systematized, important properties were qualitatively identified, their occurrences were assessed and reported, and individual examples were included for special cases that appeared rather unique. The reported statistics are subject to uncertainties. They should be understood as indications and do not describe the entire field with absolute certainty.
Despite the growing interest in the literature, little progress has been made in the biomedical field regarding the development of data provenance technologies, which could help mitigate reproducibility issues. An important reason could be a lack of generic and transparent solutions for easily capturing or creating provenance data, resulting in potentially substantial efforts for provenance tracking. Another gap we identified is a lack of specific methods for analyzing and visualizing provenance data, which may make it difficult to adequately leverage the added value provided. We also observed quite some heterogeneity in the motivation, scope, and functionality of provenance tracking methods for biomedical applications, pointing toward a potential lack of a unified understanding of underlying concepts and a narrow focus on specific use cases. Providing general purpose data sets and application scenarios, as well as benchmarking mechanisms, could help overcome this challenge in the future.
Our work focused specifically on papers from the biomedical field to investigate the state of the art in this particular application area. In future work, it may be worthwhile to also study general purpose methods, models, and implementations and investigate their applicability to biomedical use cases.
Queries used for the database search.
Selection process and collected data.
Hierarchical Data Format, Version 5
Open Provenance Model
Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews
research question
World Wide Web Consortium
FB, FP, and MJ contributed to research conceptualization and initiation. ACH, AM, FP, FNW, MJ, and TM contributed to eligibility screening. ACH, AM, FP, FNW, MH, MJ, and TM contributed to data collection and charting. FB, FP, and MJ contributed to data analysis. FB, FP, and MJ contributed to the drafting of the manuscript. All the authors read and approved the final manuscript.
None declared.