<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN" "journalpublishing.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="2.0" xml:lang="en" article-type="research-article"><front><journal-meta><journal-id journal-id-type="nlm-ta">J Med Internet Res</journal-id><journal-id journal-id-type="publisher-id">jmir</journal-id><journal-id journal-id-type="index">1</journal-id><journal-title>Journal of Medical Internet Research</journal-title><abbrev-journal-title>J Med Internet Res</abbrev-journal-title><issn pub-type="epub">1438-8871</issn><publisher><publisher-name>JMIR Publications</publisher-name><publisher-loc>Toronto, Canada</publisher-loc></publisher></journal-meta><article-meta><article-id pub-id-type="publisher-id">v28i1e83903</article-id><article-id pub-id-type="doi">10.2196/83903</article-id><article-categories><subj-group subj-group-type="heading"><subject>Viewpoint</subject></subj-group></article-categories><title-group><article-title>Extrinsic Trust as a Contractual Framework for Accountable AI in Health Care: Viewpoint</article-title></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name name-style="western"><surname>Kelly</surname><given-names>Anthony</given-names></name><degrees>PhD</degrees><xref ref-type="aff" rid="aff1">1</xref><xref ref-type="aff" rid="aff2">2</xref></contrib></contrib-group><aff id="aff1"><institution>Department of Electronic and Computer Engineering, University of Limerick</institution><addr-line>Castletroy</addr-line><addr-line>Limerick</addr-line><country>Ireland</country></aff><aff id="aff2"><institution>Health Research Institute, University of Limerick</institution><addr-line>Limerick</addr-line><country>Ireland</country></aff><contrib-group><contrib contrib-type="editor"><name name-style="western"><surname>Coristine</surname><given-names>Andrew</given-names></name></contrib></contrib-group><contrib-group><contrib contrib-type="reviewer"><name name-style="western"><surname>Mohanadas</surname><given-names>Sadhasivam</given-names></name></contrib><contrib contrib-type="reviewer"><name name-style="western"><surname>Dai</surname><given-names>Tinglong</given-names></name></contrib></contrib-group><author-notes><corresp>Correspondence to Anthony Kelly, PhD, Department of Electronic and Computer Engineering, University of Limerick, Castletroy, Limerick, V94 T9PX, Ireland, 353 61 202700; <email>anthony.kelly@ul.ie</email></corresp></author-notes><pub-date pub-type="collection"><year>2026</year></pub-date><pub-date pub-type="epub"><day>5</day><month>3</month><year>2026</year></pub-date><volume>28</volume><elocation-id>e83903</elocation-id><history><date date-type="received"><day>10</day><month>09</month><year>2025</year></date><date date-type="rev-recd"><day>02</day><month>02</month><year>2026</year></date><date date-type="accepted"><day>04</day><month>02</month><year>2026</year></date></history><copyright-statement>&#x00A9; Anthony Kelly. Originally published in the Journal of Medical Internet Research (<ext-link ext-link-type="uri" xlink:href="https://www.jmir.org">https://www.jmir.org</ext-link>), 5.3.2026. </copyright-statement><copyright-year>2026</copyright-year><license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on <ext-link ext-link-type="uri" xlink:href="https://www.jmir.org/">https://www.jmir.org/</ext-link>, as well as this copyright and license information must be included.</p></license><self-uri xlink:type="simple" xlink:href="https://www.jmir.org/2026/1/e83903"/><abstract><p>Artificial intelligence (AI) promises efficiency and equity in health care. However, adoption remains fragmented due to weak foundations of trust. This Viewpoint highlights the gap between intrinsic trust, based on interpretability, and extrinsic trust, based on functional validation. We propose a contractual framework between AI systems and users defined by 3 promises: reliability, scope and equity, and shift and uncertainty. Illustrated through a vignette, we show how health systems can operationalize these promises through structured evidence and governance, translating trustworthy AI into accountable clinical deployment.</p></abstract><kwd-group><kwd>artificial intelligence</kwd><kwd>AI</kwd><kwd>explainable artificial intelligence</kwd><kwd>explainable AI</kwd><kwd>XAI</kwd><kwd>trust</kwd><kwd>machine learning</kwd><kwd>ML</kwd><kwd>mental health</kwd><kwd>decision support system</kwd><kwd>intrinsic trust</kwd></kwd-group></article-meta></front><body><sec id="s1" sec-type="intro"><title>Introduction</title><p>Artificial intelligence (AI) has the potential to transform clinical care, with applications spanning triage, diagnosis, risk prediction, and resource allocation [<xref ref-type="bibr" rid="ref1">1</xref>]. However, despite this technical promise, real-world adoption remains fragmented. Trust is a central barrier to adoption as clinicians and decision-makers remain hesitant to rely on AI in safety-critical environments, where errors carry significant consequences for patient outcomes, liability, and public confidence [<xref ref-type="bibr" rid="ref2">2</xref>].</p><p>Adoption requires structured, credible evidence. A recent scoping review identified impediments including limited external validation data, transparency and equity concerns, workflow integration difficulties, and unclear accountability and highlighted facilitators such as robust validation, governance, and postdeployment monitoring [<xref ref-type="bibr" rid="ref3">3</xref>].</p><p>Conceptually, the gap between AI engagement and adoption corresponds to a distinction between intrinsic and extrinsic trust. Intrinsic trust is the sense that clinicians gain when AI outputs are interpretable and aligned with clinical reasoning; extrinsic trust (or functionality trust) [<xref ref-type="bibr" rid="ref4">4</xref>] rests on empirical evidence that the model performs reliably across diverse populations and under real-world conditions [<xref ref-type="bibr" rid="ref5">5</xref>]. Intrinsic trust is necessary to encourage engagement and build confidence in the system, but extrinsic trust ultimately influences deployment decisions [<xref ref-type="bibr" rid="ref5">5</xref>,<xref ref-type="bibr" rid="ref6">6</xref>].</p><p>Establishing extrinsic trust is fundamentally a data and infrastructure challenge. We argue that health systems require structured approaches to generate, verify, and act on validation evidence under real-world conditions. Accordingly, we introduce a contractual framing of extrinsic trust supported by a minimum evidence package and a clinical vignette. Rather than proposing a new reporting artifact, the framework complements existing tools such as model cards, algorithmic audits, and assurance cases by treating trust as a small set of explicit, testable promises. Each promise is linked to defined evidence requirements, verification roles, and breach conditions that trigger governance action across development, deployment, and postdeployment monitoring.</p></sec><sec id="s2"><title>Intrinsic and Extrinsic Trust in Clinical AI</title><p>Clinicians engage with AI systems in stages, beginning with how outputs align with their professional reasoning [<xref ref-type="bibr" rid="ref7">7</xref>]. This early stage of intrinsic trust develops when predictions are interpretable and presented in ways that resonate with clinical heuristics [<xref ref-type="bibr" rid="ref8">8</xref>-<xref ref-type="bibr" rid="ref10">10</xref>].</p><p>Intrinsic trust provides a foundation for engagement, but decisions about deployment depend on further evidence. Clinicians require independent validation that the model performs reliably across different patient groups and that it can indicate when deference to human judgment is warranted in the face of predictive uncertainty [<xref ref-type="bibr" rid="ref11">11</xref>-<xref ref-type="bibr" rid="ref15">15</xref>]. Interpretable AI models are central to the sensemaking involved in intrinsic trust, but decisions about adoption require evidence of performance under real-world conditions [<xref ref-type="bibr" rid="ref16">16</xref>]. To express this issue more succinctly, explainability fosters engagement, but validation governs deployment [<xref ref-type="bibr" rid="ref5">5</xref>,<xref ref-type="bibr" rid="ref12">12</xref>,<xref ref-type="bibr" rid="ref14">14</xref>]. Intrinsic trust lowers the barrier to initial use, while extrinsic trust provides the assurance needed for safe integration into workflows and health system decision-making.</p><p>Extrinsic trust validates that the model can generalize from the training examples in the dataset to previously unseen examples. As such, it must be either invariant to shift in the statistical distribution of the unseen data or convey increasing uncertainty commensurate with the shift. Since supervised machine learning models are known to be sensitive to out-of-distribution data shift, appropriate validation involves verifying that the model&#x2019;s sensitivity to out-of-distribution data shift is reliably conveyed through an appropriate uncertainty metric. Shifts within the distribution may occur in subgroups of the data, for example, in demographic or clinical subtypes. Appropriate validation for such in-distribution shifts involves verifying performance invariance. AI model calibration is concerned with how well a model&#x2019;s predicted probabilities align with actual outcomes and is therefore the foundation of reliability [<xref ref-type="bibr" rid="ref17">17</xref>]. Calibration provides a foundation for trust by enabling users to gauge whether the AI &#x201C;knows when it knows.&#x201D; Without external validation, especially under real-world conditions where data distributions may shift, clinicians may withhold trust or misplace it, leading to underuse or overreliance [<xref ref-type="bibr" rid="ref12">12</xref>,<xref ref-type="bibr" rid="ref14">14</xref>,<xref ref-type="bibr" rid="ref18">18</xref>].</p><p>Consequently, the following issues need to be addressed through validation:</p><list list-type="bullet"><list-item><p>The model should be reliable for in-distribution unseen data examples, demonstrating that predicted probabilities align with actual outcomes.</p></list-item><list-item><p>Reliability should be maintained for in-distribution subgroups.</p></list-item><list-item><p>Uncertainty metrics should convey higher uncertainty for out-of-distribution data, demonstrating that the model &#x201C;knows when it knows.&#x201D;</p></list-item></list><p>Together, these questions define a pragmatic minimum for extrinsic trust rather than a complete taxonomy. They reflect 3 recurrent failure modes in clinical AI deployment: miscalibrated confidence, uneven subgroup performance, and silent degradation under distributional shift. Reliability and uncertainty are complementary. Reliability characterizes confidence alignment within scope, while uncertainty governs safe behavior as that alignment weakens. Scope and equity are treated jointly because subgroup shifts represent clinically salient in-distribution variation that population-level metrics can obscure.</p><p>Although these requirements are most relevant for discriminative AI models that predict classifications, there is emerging evidence that generative AI text models (large language models) can be queried about their reliability and uncertainty [<xref ref-type="bibr" rid="ref19">19</xref>], perhaps allowing for future application of this framework.</p></sec><sec id="s3"><title>Core Data Requirements for Extrinsic Trust</title><p>Similarly to trust in people, trust in AI has been framed as a contract [<xref ref-type="bibr" rid="ref11">11</xref>,<xref ref-type="bibr" rid="ref20">20</xref>,<xref ref-type="bibr" rid="ref21">21</xref>]. Following this notion of contractual trust, users can rely on an AI system when explicit promises are stated and kept. Therefore, users must know what the AI is being trusted with and possess a means of evaluating whether the contract is adhered to [<xref ref-type="bibr" rid="ref11">11</xref>]. This adherence depends on the context, and the means of determining adherence are dependent on the nature of the contract [<xref ref-type="bibr" rid="ref8">8</xref>,<xref ref-type="bibr" rid="ref11">11</xref>]. Drawing on these validation questions, we proposed 3 evidence clauses that should be adhered to for safe deployment (<xref ref-type="fig" rid="figure1">Figure 1</xref>) and detail the required metrics in <xref ref-type="other" rid="box1">Textbox 1</xref>.</p><fig position="float" id="figure1"><label>Figure 1.</label><caption><p>Concentric framework of extrinsic trust. The 3 contractual promises of trust&#x2014;reliability, scope and equity, and shift and uncertainty&#x2014;are represented as radial sectors. Each promise operates across 3 levels: model-level evidence (inner ring), user-level trust and workflow fit (middle ring), and system-level governance and accountability (outer ring). At the core lies extrinsic trust, integrating these dimensions into the basis for responsible clinical artificial intelligence adoption. ECE: expected calibration error; SCE: static calibration error.</p></caption><graphic alt-version="no" mimetype="image" position="float" xlink:type="simple" xlink:href="jmir_v28i1e83903_fig01.png"/></fig><boxed-text id="box1"><title> Metrics for operationalizing the 3 promises of extrinsic trust in artificial intelligence systems.</title><p><bold>Reliability promise</bold></p><list list-type="bullet"><list-item><p>Population-level performance metrics (eg, sensitivity, area under the receiver operating characteristic curve, and <italic>F</italic><sub>1</sub>-score)</p></list-item><list-item><p>Expected calibration error or static calibration error: scalar measures of how closely predicted probabilities align with observed outcomes</p></list-item><list-item><p>Reliability diagrams: visual summaries that clinicians can interpret to judge whether model confidence reflects real-world accuracy</p></list-item></list><p><bold>Scope and equity promise</bold></p><list list-type="bullet"><list-item><p>Stratified performance metrics (eg, sensitivity, area under the receiver operating characteristic curve, and <italic>F</italic><sub>1</sub>-score): reported across clinically and demographically relevant subgroups to detect disparities</p></list-item><list-item><p>Groupwise calibration plots: showing whether confidence is systematically over or underestimated for particular populations</p></list-item></list><p><bold>Shift and uncertainty promise</bold></p><list list-type="bullet"><list-item><p>Negative log-likelihood: evaluates the quality of probabilistic predictions under noise or perturbation</p></list-item><list-item><p>Predictive entropy or abstention rate: label-agnostic measures of uncertainty, useful for detecting out-of-distribution cases and supporting safe deferral to human review</p></list-item></list></boxed-text><p>The term &#x201C;contractual&#x201D; refers to a professional and governance compact rather than a legal indemnity framework and does not define liability. It makes explicit the performance promises that an AI system claims to uphold, the evidence required to support them, and the governance actions that follow if they are breached, thereby informing procurement and oversight.</p><p>In this compact, developers or vendors make explicit performance claims, health systems commission and govern verification, and clinicians rely on outputs under defined workflow conditions. A promise is meaningful only if keeping or breaching it is observable through agreed artifacts and escalation rules.</p><p>Recent work on trust in clinical AI emphasizes that trust is relational and bidirectional, emerging from interactions among models, clinicians, workflows, and institutional safeguards rather than from model properties alone. From this perspective, AI systems depend on human inputs, data quality, and organizational practices to remain safe and effective over time [<xref ref-type="bibr" rid="ref22">22</xref>]. The contractual promises proposed in this paper are intended to complement this view by specifying the minimum evidence and governance commitments required to sustain trust within such sociotechnical systems, particularly through algorithmic audits and postdeployment monitoring.</p></sec><sec id="s4"><title>Reliability Promise</title><p>The first requirement is a reliability promise: an AI system should provide probability estimates that clinicians can trust in the intended use context. High-confidence predictions should typically be correct, and uncertainty should be visible when the model is unsure.</p><p>Reliability depends on calibration: a well-calibrated model produces probability estimates that align with observed outcomes, giving clinicians a clear sense of when the system&#x2019;s confidence can be relied upon [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref24">24</xref>]. This alignment is crucial in safety-critical settings, where misplaced confidence can lead to harmful decisions. Tools such as reliability diagrams and scalar metrics such as the expected calibration error or its multiclass analogue, the static calibration error (SCE), provide concise ways of demonstrating this alignment by showing a good fit of the reliability curve to the diagonal and low values for the metrics (perfectly calibrated models have an SCE or expected calibration error of 0) [<xref ref-type="bibr" rid="ref17">17</xref>,<xref ref-type="bibr" rid="ref25">25</xref>,<xref ref-type="bibr" rid="ref26">26</xref>]. Monitoring changes in these metrics or the reliability curve fit is useful as acceptable threshold levels are context dependent. Postcalibration scaling can further improve poorly calibrated models [<xref ref-type="bibr" rid="ref26">26</xref>,<xref ref-type="bibr" rid="ref27">27</xref>].</p><p>For health systems, requiring calibration evidence ensures that AI models offer dependable guidance that supports safe decision-making [<xref ref-type="bibr" rid="ref12">12</xref>].</p></sec><sec id="s5"><title>Scope and Equity Promise</title><p>The second requirement is a scope and equity promise: AI systems should clearly state who and what they are designed for and then provide evidence that they perform consistently across relevant clinical and demographic subgroups. Declaring scope makes explicit the intended populations, settings, and workflow assumptions, while equity validation ensures that no patient group is systematically disadvantaged.</p><p>Validation of the scope and equity promise involves stratified evaluation. Standard performance measures such as accuracy, sensitivity, area under the receiver operating characteristic curve, or <italic>F</italic><sub>1</sub>-score [<xref ref-type="bibr" rid="ref28">28</xref>] should be reported not just at the population level but separately for subgroups defined by age, sex, ethnicity, or clinical subtype [<xref ref-type="bibr" rid="ref29">29</xref>]. Calibration should also be examined within each stratum to detect systematic over- or underestimation. These analyses highlight where models are consistent and where subgroup-specific recalibration or data augmentation may be required. It should be acknowledged that small sample sizes may result from intersectional subgroups that may limit the ability to perform subgroup analysis in some cases.</p><p>From a regulatory and governance perspective, AI model cards offer a practical tool for this promise. They function as structured &#x201C;labels,&#x201D; detailing intended use cases, target populations, known limitations, subgroup performance, and fairness considerations [<xref ref-type="bibr" rid="ref30">30</xref>]. This transparency allows clinicians, procurement teams, and regulators to assess whether systems are safe and appropriate for deployment.</p><p>Equity is fundamental because health populations are heterogeneous. A model that performs well overall may still underperform in underrepresented groups, amplifying disparities in care [<xref ref-type="bibr" rid="ref12">12</xref>-<xref ref-type="bibr" rid="ref14">14</xref>]. Mandating scope declarations and subgroup validation enables health systems to recognize both strengths and limits, supporting targeted safeguards where needed.</p></sec><sec id="s6"><title>Shift and Uncertainty Promise</title><p>The third requirement is a shift and uncertainty promise: AI systems must provide safeguards when confronted with inputs that differ from the data on which they were trained. In evolving clinical environments, populations, referral patterns, and data quality change over time. Without mechanisms to detect and manage such shifts, models risk silent failure in precisely the contexts where reliability is most critical [<xref ref-type="bibr" rid="ref12">12</xref>,<xref ref-type="bibr" rid="ref31">31</xref>].</p><p>This promise requires evidence that the system can signal uncertainty, degrade gracefully, and defer to human review when appropriate. Metrics such as negative log-likelihood or predictive entropy capture probabilistic quality and uncertainty in unfamiliar cases [<xref ref-type="bibr" rid="ref26">26</xref>,<xref ref-type="bibr" rid="ref32">32</xref>,<xref ref-type="bibr" rid="ref33">33</xref>]. In practice, confidence should decline, and error signals should increase in a predictable manner when inputs deviate or contain errors.</p><p>By enabling abstention or escalation under uncertainty, models help protect patients from misleading outputs, preserve equity across heterogeneous populations, and reduce risks of overreliance [<xref ref-type="bibr" rid="ref34">34</xref>]. Embedding uncertainty clauses into validation frameworks ensures that decision-makers have the information to guarantee that AI remains trustworthy under real-world conditions.</p></sec><sec id="s7"><title>Minimum Evidence Package for Extrinsic Trust</title><p>To formalize the contractual framework, we outline a minimum evidence package (<xref ref-type="table" rid="table1">Table 1</xref>) that specifies what constitutes proof of each promise, who is responsible for generating and verifying that proof, and what breach conditions require mitigation or suspension of deployment. In many cases, there is no accepted threshold for many of the metrics that constitute proof, although statistical threshold limits may be calculated on held-out calibration sets. For example, a material breach may be defined as a statistically significant upward shift in calibration error relative to the validated deployment baseline as determined by locally agreed monitoring procedures. Therefore, the minimum evidence package provides the foundation for a contract between the AI system and its users: a shared understanding that the model&#x2019;s behavior is recorded and that deviations can trigger governance actions to mitigate emerging risks before they manifest in clinical practice.</p><table-wrap id="t1" position="float"><label>Table 1.</label><caption><p>Minimum evidence package for extrinsic trust. Baselines refer to the model&#x2019;s validated performance at deployment or the last approved update, with material deviation determined relative to locally defined governance thresholds.</p></caption><table id="table1" frame="hsides" rules="groups"><thead><tr><td align="left" valign="bottom">Promise</td><td align="left" valign="bottom">Artifacts</td><td align="left" valign="bottom">Producer or verifier</td><td align="left" valign="bottom">Breach condition and action</td></tr></thead><tbody><tr><td align="left" valign="top">All</td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Model card documenting intended population, features, and workflow assumptions</p></list-item></list></td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Producer: vendor or developer</p></list-item><list-item><p>Verifier: health system governance and clinical safety officer</p></list-item><list-item><p>Trigger: model commissioning and model update</p></list-item></list></td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Breach: model card not satisfactory</p></list-item><list-item><p>Action: remedial documentation or clarification required before deployment or update</p></list-item></list></td></tr><tr><td align="left" valign="top">Reliability</td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Population-level performance metrics (eg, sensitivity, AUC<sup><xref ref-type="table-fn" rid="table1fn1">a</xref></sup>, and <italic>F</italic><sub>1</sub>-score)</p></list-item><list-item><p>Calibration metrics: ECE<sup><xref ref-type="table-fn" rid="table1fn2">b</xref></sup> or SCE<sup><xref ref-type="table-fn" rid="table1fn3">c</xref></sup></p></list-item><list-item><p>Reliability diagrams</p></list-item></list></td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Producer: vendor or developer</p></list-item><list-item><p>Verifier: health system governance and clinical safety officer</p></list-item><list-item><p>Trigger: annually; model update or change in patient profiles</p></list-item></list></td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Breach: performance or calibration metrics deviate materially from the validated deployment baseline</p></list-item><list-item><p>Action: (1) recalibration or retraining, (2) investigation of the root cause, or (3) possible suspension</p></list-item></list></td></tr><tr><td align="left" valign="top">Scope and equity</td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Stratified performance metrics (eg, subgroup: sensitivity, AUC, and <italic>F</italic><sub>1</sub>-score)</p></list-item><list-item><p>Groupwise calibration plots</p></list-item></list></td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Producer: vendor or developer</p></list-item><list-item><p>Verifier: health system governance and clinical safety officer</p></list-item><list-item><p>Trigger: annually; model update or change in patient profiles</p></list-item></list></td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Breach: subgroup performance or calibration deviates materially from the validated deployment baseline</p></list-item><list-item><p>Action: (1) subgroup analysis, (2) recalibration, (3) model card update to exclude subgroups, or (4) possible suspension</p></list-item></list></td></tr><tr><td align="left" valign="top">Shift and uncertainty</td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Predictive entropy or abstention rate</p></list-item><list-item><p>NLL<sup><xref ref-type="table-fn" rid="table1fn4">d</xref></sup></p></list-item></list></td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Producer: vendor or developer</p></list-item><list-item><p>Verifier: health system governance and clinical safety officer</p></list-item><list-item><p>Trigger: annually; model update or change in patient profiles</p></list-item></list></td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Breach: uncertainty metrics (eg, entropy or NLL) deviate materially from the validated deployment baseline</p></list-item><list-item><p>Action: (1) retraining of model with new data, (2) reverification of model, (3) model card update, or (4) possible suspension</p></list-item></list></td></tr></tbody></table><table-wrap-foot><fn id="table1fn1"><p><sup>a</sup>AUC: area under the receiver operating characteristic curve.</p></fn><fn id="table1fn2"><p><sup>b</sup>ECE: expected calibration error.</p></fn><fn id="table1fn3"><p><sup>c</sup>SCE: static calibration error.</p></fn><fn id="table1fn4"><p><sup>d</sup>NLL: negative log-likelihood.</p></fn></table-wrap-foot></table-wrap><p>For clinicians and health system decision-makers, the contractual framing also provides a practical cognitive benefit. By organizing technical, regulatory, and statistical requirements into a small number of clear promises, it offers a concise mental model for what trustworthy AI deployment entails, helping cut through the complexity of standards and guidance.</p><p>The 3 promises align closely with emerging international regulatory requirements for the analysis and management of AI risks. For example, the European Union AI Act [<xref ref-type="bibr" rid="ref35">35</xref>] requires transparency (Article 13), accuracy and robustness (Article 15), and risk monitoring (Articles 9 and 72). Although this applies primarily to AI medical devices [<xref ref-type="bibr" rid="ref36">36</xref>], it nevertheless also implies best practice for lower-risk AI. <xref ref-type="table" rid="table2">Table 2</xref> maps each promise to obligations in the European Union AI Act [<xref ref-type="bibr" rid="ref35">35</xref>], Food and Drug Administration predetermined change control plan [<xref ref-type="bibr" rid="ref37">37</xref>], and International Organization for Standardization and International Electrotechnical Commission 42001 standard [<xref ref-type="bibr" rid="ref38">38</xref>], demonstrating that the proposed contractual framework provides a practical implementation framework consistent with regulatory trajectories in high-risk clinical AI.</p><table-wrap id="t2" position="float"><label>Table 2.</label><caption><p>Mapping the 3 promises to emerging regulatory requirements.</p></caption><table id="table2" frame="hsides" rules="groups"><thead><tr><td align="left" valign="bottom">Promise</td><td align="left" valign="bottom">EU<sup><xref ref-type="table-fn" rid="table2fn1">a</xref></sup> AI<sup><xref ref-type="table-fn" rid="table2fn2">b</xref></sup> Act (Regulation [EU] 2024/1689)</td><td align="left" valign="bottom">FDA<sup><xref ref-type="table-fn" rid="table2fn3">c</xref></sup> PCCP<sup><xref ref-type="table-fn" rid="table2fn4">d</xref></sup></td><td align="left" valign="bottom">ISO<sup><xref ref-type="table-fn" rid="table2fn5">e</xref></sup> and IEC<sup><xref ref-type="table-fn" rid="table2fn6">f</xref></sup> 42001:2023 standard (AI management system)</td></tr></thead><tbody><tr><td align="left" valign="top">Reliability promise</td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Article 15: high-risk AI must achieve appropriate accuracy, robustness, and cybersecurity, and AI providers must declare accuracy metrics and levels and test or validate them</p></list-item><list-item><p>Article 9: requires an ongoing risk management system</p></list-item><list-item><p>Article 11+annex IV: technical documentation must include information such as performance characteristics and metrics and testing or validation evidence (as specified in annex IV)</p></list-item></list></td><td align="left" valign="top">The PCCP comprises (1) description of modifications, (2) modification protocol, and (3) impact assessment. The modification protocol should specify verification and validation activities, including performance evaluation methods, performance metrics, statistical tests, and predefined acceptance criteria, and describe how the manufacturer will determine that a modification is acceptable before implementation.</td><td align="left" valign="top">Clause 8.1 (operational planning and control) requires planned and controlled AI operations. Clause 9.1 (monitoring, measurement, analysis, and evaluation) requires organizations to determine what needs to be monitored and measured and how and when to evaluate AI system performance. Clause 10 (improvement) requires correction and continuous improvement when performance deviates.</td></tr><tr><td align="left" valign="top">Scope and equity promise</td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Article 10: training, validation, and testing data must be sufficiently relevant and representative for the intended purpose, and AI providers must examine and mitigate bias risks</p></list-item><list-item><p>Article 13: transparency and instructions enabling appropriate use, including information needed for deployers to interpret outputs and use the system properly</p></list-item></list></td><td align="left" valign="top">The evidence and testing data supporting the PCCP should be representative of intended use populations (eg, demographic factors where relevant). The PCCP and its modifications must keep the device within the authorized intended use and indications; a change to what was described or authorized (eg, new population not covered) may require a new marketing submission.</td><td align="left" valign="top">Clauses 4.1&#x2010;4.3 require definition of organizational context, interested parties, and the scope of the AI management system. Clause 6.1 (actions to address risks and opportunities) requires identification and treatment of AI risks, including risks to individuals and groups. Clause 6.1.4 (AI system impact assessment) requires assessment of impacts on individuals and groups affected by AI systems.</td></tr><tr><td align="left" valign="top">Shift and uncertainty promise</td><td align="left" valign="top"><list list-type="bullet"><list-item><p>Article 72: AI providers must operate a postmarket monitoring system to collect, document, and analyze performance and risks during real-world use</p></list-item><list-item><p>Article 15: robustness expectations (accuracy, robustness, and cybersecurity) support managing performance degradation in foreseeable conditions</p></list-item></list></td><td align="left" valign="top">The PCCP&#x2019;s modification protocol should define how the manufacturer will detect the need for change and evaluate modifications; examples include identifying triggers, such as when drift in data is observed. The protocol should specify how performance will be assessed after the change (metrics, tests, or acceptance criteria) and state that modifications should not be implemented if acceptance criteria are not met.</td><td align="left" valign="top">Clause 9.1 requires ongoing monitoring and evaluation of AI systems. Clause 9.2 (internal audit) requires periodic audits to detect nonconformity. Clause 10.2 (nonconformity and corrective action) requires organizations to respond to deviations, investigate causes, and implement corrective actions supporting structured responses to model drift and performance degradation.</td></tr></tbody></table><table-wrap-foot><fn id="table2fn1"><p><sup>a</sup>EU: European Union.</p></fn><fn id="table2fn2"><p><sup>b</sup>AI: artificial intelligence.</p></fn><fn id="table2fn3"><p><sup>c</sup>FDA: Food and Drug Administration.</p></fn><fn id="table2fn4"><p><sup>d</sup>PCCP: predetermined change control plan.</p></fn><fn id="table2fn5"><p><sup>e</sup>ISO: International Organization for Standardization.</p></fn><fn id="table2fn6"><p><sup>f</sup>IEC: International Electrotechnical Commission.</p></fn></table-wrap-foot></table-wrap></sec><sec id="s8"><title>Case Study Vignette: Operationalizing the 3 Promises Using the Probabilistic Integrated Scoring Model</title><p>To illustrate how extrinsic trust can be operationalized, we turn to the Probabilistic Integrated Scoring Model (PrISM) [<xref ref-type="bibr" rid="ref39">39</xref>], a prototype AI classification system developed for initial treatment screening in routine mental health care. PrISM is a transparent, interpretable neural network that integrates the structured logic of psychometric instruments with probabilistic estimation.</p><p>The PrISM study involved a secondary analysis of questionnaire and demographic data from 1068 patients treated in Internetpsykiatrien, Denmark&#x2019;s national internet-based mental health service. Further details related to this vignette can be found in <xref ref-type="supplementary-material" rid="app1">Multimedia Appendix 1</xref>.</p><p>Regarding the reliability promise, reliability diagrams show how closely the model&#x2019;s predicted probabilities align with observed outcomes. <xref ref-type="fig" rid="figure2">Figure 2</xref> shows that PrISM confidence estimates remain near the ideal diagonal, with an SCE of 4.10%. For a clinician reviewing a referral, this provides assurance that, when the model appears confident, it is generally correct, supporting decisions about when to trust AI guidance and when to seek further input.</p><fig position="float" id="figure2"><label>Figure 2.</label><caption><p>Illustrative example of a reliability diagram for a predictive multiclass model (Probabilistic Integrated Scoring Model). The static calibration error (SCE) of 4.10% and close adherence to the diagonal indicate that the predicted probabilities closely match observed outcomes.</p></caption><graphic alt-version="no" mimetype="image" position="float" xlink:type="simple" xlink:href="jmir_v28i1e83903_fig02.png"/></fig><p>Regarding the scope and equity promise, PrISM is explicitly designed and framed as a tool for screening referrals in a digital clinic for mild to moderate depression and anxiety, clearly stating who and what it is designed for. Validation includes stratified calibration and discrimination (eg, by sex or presenting problem), with deployment contingent on subgroup performance remaining within a predefined deviation from the population baseline. This analysis highlights where performance is consistent and where disparities might emerge, enabling clinicians and service managers to anticipate limitations and plan mitigations rather than encountering them unexpectedly in practice.</p><p>Regarding the shift and uncertainty promise, in deployment, AI systems encounter cases that differ from their training data. PrISM addresses this by signaling uncertainty. When data inputs contain errors or deviate from their familiar distribution, the uncertainty metric (negative log-likelihood) rises, and the system may abstain from issuing a recommendation, routing the case for human review (<xref ref-type="fig" rid="figure3">Figure 3</xref>). For example, cases exceeding the 95th percentile of validation set uncertainty may trigger abstention and clinician review, with abstention rates monitored longitudinally as a safety signal. For clinicians, this safeguard functions as a clear indicator that the model recognizes its limits, ensuring that safe escalation is possible.</p><fig position="float" id="figure3"><label>Figure 3.</label><caption><p>Illustrative example of deterioration of uncertainty (Probabilistic Integrated Scoring Model)<italic>.</italic> As data noise increases (x-axis), the negative log-likelihood (NLL) uncertainty metric increases, whereas the accuracy performance metric stays relatively constant. This illustrates that the model&#x2019;s performance may be robust to data errors, but the model signals its increasing lack of confidence appropriately.</p></caption><graphic alt-version="no" mimetype="image" position="float" xlink:type="simple" xlink:href="jmir_v28i1e83903_fig03.png"/></fig><p>This vignette demonstrates that extrinsic trust can be embedded through design choices that make reliability, scope, and uncertainty measurable. The PrISM vignette shows how trust promises can be translated into clinical evidence, allowing health systems to evaluate, monitor, and govern AI as accountable decision support tools.</p></sec><sec id="s9"><title>Health System Implications and Future Directions</title><p>Embedding extrinsic trust requires health systems to translate the 3 contractual promises into routine governance and clinical practice. In practical terms, this means requiring calibration, subgroup performance, and uncertainty evidence at procurement; defining verification roles within clinical safety structures; and specifying escalation actions when promises are breached.</p><p>Evidence of trust must be accessible and actionable at the point of care. Simple signals such as confidence or abstention indicators can reduce cognitive burden for frontline clinicians, while dashboards can surface detailed calibration, equity, and drift metrics for governance teams. When systems defer under uncertainty, clear escalation pathways are needed to avoid automation bias or clinician deskilling and preserve human oversight.</p><p>Looking ahead, this contractual framing aligns with regulatory shifts toward continuous assurance rather than one-time approval. By treating reliability, scope and equity, and shift and uncertainty as enforceable commitments, health systems gain a lightweight but robust mechanism to govern clinical AI across its life cycle as real-world conditions evolve.</p></sec><sec id="s10" sec-type="conclusions"><title>Conclusions</title><p>Extrinsic trust is the decisive factor that determines whether clinical AI remains experimental or becomes a sustainable component of care. While intrinsic trust, supported by interpretability, enables clinician engagement, adoption in safety-critical settings depends on structured evidence of performance under real-world conditions.</p><p>By framing extrinsic trust as a contractual relationship defined by 3 promises of reliability, scope and equity, and shift and uncertainty, this Viewpoint provides a practical lens for evaluating, governing, and sustaining clinical AI systems. The PrISM vignette demonstrates that these promises can be operationalized through calibration, subgroup validation, and uncertainty-aware escalation. Together, this approach shifts trust from an abstract aspiration to a testable, accountable basis for responsible clinical AI deployment.</p></sec></body><back><ack><p>The author would like to express gratitude to the researchers and staff of Internetpsykiatrien, Denmark&#x2019;s national internet-based mental health service, whose conversations and interactions helped inspire this viewpoint. Generative artificial intelligence was used to improve clarity and language in some author-written paragraphs.</p></ack><notes><sec><title>Funding</title><p>The author is funded by Innovation Fund Denmark.</p></sec><sec><title>Data Availability</title><p>The datasets generated or analyzed during this study are not publicly available due to ethical and legal constraints surrounding patient data but can be made available from the corresponding author on reasonable request.</p></sec></notes><fn-group><fn fn-type="con"><p>AK conceived the viewpoint, developed the models and results, and wrote the manuscript.</p></fn><fn fn-type="conflict"><p>None declared.</p></fn></fn-group><glossary><title>Abbreviations</title><def-list><def-item><term id="abb1">AI</term><def><p>artificial intelligence</p></def></def-item><def-item><term id="abb2">PrISM</term><def><p>Probabilistic Integrated Scoring Model</p></def></def-item><def-item><term id="abb3">SCE</term><def><p>static calibration error</p></def></def-item></def-list></glossary><ref-list><title>References</title><ref id="ref1"><label>1</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Alowais</surname><given-names>SA</given-names> </name><name name-style="western"><surname>Alghamdi</surname><given-names>SS</given-names> </name><name name-style="western"><surname>Alsuhebany</surname><given-names>N</given-names> </name><etal/></person-group><article-title>Revolutionizing healthcare: the role of artificial intelligence in clinical practice</article-title><source>BMC Med Educ</source><year>2023</year><month>09</month><day>22</day><volume>23</volume><issue>1</issue><fpage>689</fpage><pub-id pub-id-type="doi">10.1186/s12909-023-04698-z</pub-id><pub-id pub-id-type="medline">37740191</pub-id></nlm-citation></ref><ref id="ref2"><label>2</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Afroogh</surname><given-names>S</given-names> </name><name name-style="western"><surname>Akbari</surname><given-names>A</given-names> </name><name name-style="western"><surname>Malone</surname><given-names>E</given-names> </name><name name-style="western"><surname>Kargar</surname><given-names>M</given-names> </name><name name-style="western"><surname>Alambeigi</surname><given-names>H</given-names> </name></person-group><article-title>Trust in AI: progress, challenges, and future directions</article-title><source>Humanit Soc Sci Commun</source><year>2024</year><volume>11</volume><issue>1</issue><fpage>1568</fpage><pub-id pub-id-type="doi">10.1057/s41599-024-04044-8</pub-id></nlm-citation></ref><ref id="ref3"><label>3</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Hassan</surname><given-names>M</given-names> </name><name name-style="western"><surname>Kushniruk</surname><given-names>A</given-names> </name><name name-style="western"><surname>Borycki</surname><given-names>E</given-names> </name></person-group><article-title>Barriers to and facilitators of artificial intelligence adoption in health care: scoping review</article-title><source>JMIR Hum Factors</source><year>2024</year><month>08</month><day>29</day><volume>11</volume><fpage>e48633</fpage><pub-id pub-id-type="doi">10.2196/48633</pub-id><pub-id pub-id-type="medline">39207831</pub-id></nlm-citation></ref><ref id="ref4"><label>4</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Choung</surname><given-names>H</given-names> </name><name name-style="western"><surname>David</surname><given-names>P</given-names> </name><name name-style="western"><surname>Ross</surname><given-names>A</given-names> </name></person-group><article-title>Trust and ethics in AI</article-title><source>AI Soc</source><year>2023</year><month>04</month><volume>38</volume><issue>2</issue><fpage>733</fpage><lpage>745</lpage><pub-id pub-id-type="doi">10.1007/s00146-022-01473-4</pub-id></nlm-citation></ref><ref id="ref5"><label>5</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Shin</surname><given-names>D</given-names> </name></person-group><article-title>The effects of explainability and causability on perception, trust, and acceptance: implications for explainable AI</article-title><source>Int J Hum Comput Stud</source><year>2021</year><month>02</month><volume>146</volume><fpage>102551</fpage><pub-id pub-id-type="doi">10.1016/j.ijhcs.2020.102551</pub-id></nlm-citation></ref><ref id="ref6"><label>6</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Choung</surname><given-names>H</given-names> </name><name name-style="western"><surname>David</surname><given-names>P</given-names> </name><name name-style="western"><surname>Ross</surname><given-names>A</given-names> </name></person-group><article-title>Trust in AI and its role in the acceptance of AI technologies</article-title><source>Int J Hum Comput Interact</source><year>2023</year><month>05</month><day>28</day><volume>39</volume><issue>9</issue><fpage>1727</fpage><lpage>1739</lpage><pub-id pub-id-type="doi">10.1080/10447318.2022.2050543</pub-id></nlm-citation></ref><ref id="ref7"><label>7</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Lipton</surname><given-names>ZC</given-names> </name></person-group><article-title>The mythos of model interpretability</article-title><source>Commun ACM</source><year>2018</year><month>09</month><day>26</day><volume>61</volume><issue>10</issue><fpage>36</fpage><lpage>43</lpage><pub-id pub-id-type="doi">10.1145/3233231</pub-id></nlm-citation></ref><ref id="ref8"><label>8</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Kelly</surname><given-names>A</given-names> </name><name name-style="western"><surname>Bhardwaj</surname><given-names>N</given-names> </name><name name-style="western"><surname>Holmberg Sainte-Marie</surname><given-names>TT</given-names> </name><etal/></person-group><article-title>Investigating how clinicians form trust in an AI-based mental health model: qualitative case study</article-title><source>JMIR Hum Factors</source><year>2025</year><month>12</month><day>19</day><volume>12</volume><fpage>e79658</fpage><pub-id pub-id-type="doi">10.2196/79658</pub-id><pub-id pub-id-type="medline">41417472</pub-id></nlm-citation></ref><ref id="ref9"><label>9</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Ali</surname><given-names>S</given-names> </name><name name-style="western"><surname>Abuhmed</surname><given-names>T</given-names> </name><name name-style="western"><surname>El-Sappagh</surname><given-names>S</given-names> </name><etal/></person-group><article-title>Explainable artificial intelligence (XAI): what we know and what is left to attain trustworthy artificial intelligence</article-title><source>Inf Fusion</source><year>2023</year><month>11</month><volume>99</volume><fpage>101805</fpage><pub-id pub-id-type="doi">10.1016/j.inffus.2023.101805</pub-id></nlm-citation></ref><ref id="ref10"><label>10</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Joyce</surname><given-names>DW</given-names> </name><name name-style="western"><surname>Kormilitzin</surname><given-names>A</given-names> </name><name name-style="western"><surname>Smith</surname><given-names>KA</given-names> </name><name name-style="western"><surname>Cipriani</surname><given-names>A</given-names> </name></person-group><article-title>Explainable artificial intelligence for mental health through transparency and interpretability for understandability</article-title><source>NPJ Digit Med</source><year>2023</year><month>01</month><day>18</day><volume>6</volume><issue>1</issue><fpage>6</fpage><pub-id pub-id-type="doi">10.1038/s41746-023-00751-9</pub-id><pub-id pub-id-type="medline">36653524</pub-id></nlm-citation></ref><ref id="ref11"><label>11</label><nlm-citation citation-type="confproc"><person-group person-group-type="author"><name name-style="western"><surname>Jacovi</surname><given-names>A</given-names> </name><name name-style="western"><surname>Marasovi&#x0107;</surname><given-names>A</given-names> </name><name name-style="western"><surname>Miller</surname><given-names>T</given-names> </name><name name-style="western"><surname>Goldberg</surname><given-names>Y</given-names> </name></person-group><article-title>Formalizing trust in artificial intelligence: prerequisites, causes and goals of human trust in AI</article-title><conf-name>FAccT &#x2019;21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency</conf-name><conf-date>Mar 3-10, 2021</conf-date><pub-id pub-id-type="doi">10.1145/3442188.3445923</pub-id></nlm-citation></ref><ref id="ref12"><label>12</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Kompa</surname><given-names>B</given-names> </name><name name-style="western"><surname>Snoek</surname><given-names>J</given-names> </name><name name-style="western"><surname>Beam</surname><given-names>AL</given-names> </name></person-group><article-title>Second opinion needed: communicating uncertainty in medical machine learning</article-title><source>NPJ Digit Med</source><year>2021</year><month>01</month><day>5</day><volume>4</volume><issue>1</issue><fpage>4</fpage><pub-id pub-id-type="doi">10.1038/s41746-020-00367-3</pub-id><pub-id pub-id-type="medline">33402680</pub-id></nlm-citation></ref><ref id="ref13"><label>13</label><nlm-citation citation-type="confproc"><person-group person-group-type="author"><name name-style="western"><surname>Malinin</surname><given-names>A</given-names> </name><name name-style="western"><surname>Gales</surname><given-names>M</given-names> </name></person-group><article-title>Predictive uncertainty estimation via prior networks</article-title><conf-name>NIPS&#x2019;18: Proceedings of the 32nd International Conference on Neural Information Processing Systems</conf-name><conf-date>Dec 3-8, 2018</conf-date><pub-id pub-id-type="doi">10.5555/3327757.3327808</pub-id></nlm-citation></ref><ref id="ref14"><label>14</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Begoli</surname><given-names>E</given-names> </name><name name-style="western"><surname>Bhattacharya</surname><given-names>T</given-names> </name><name name-style="western"><surname>Kusnezov</surname><given-names>D</given-names> </name></person-group><article-title>The need for uncertainty quantification in machine-assisted medical decision making</article-title><source>Nat Mach Intell</source><year>2019</year><volume>1</volume><issue>1</issue><fpage>20</fpage><lpage>23</lpage><pub-id pub-id-type="doi">10.1038/s42256-018-0004-1</pub-id></nlm-citation></ref><ref id="ref15"><label>15</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Miotto</surname><given-names>R</given-names> </name><name name-style="western"><surname>Li</surname><given-names>L</given-names> </name><name name-style="western"><surname>Kidd</surname><given-names>BA</given-names> </name><name name-style="western"><surname>Dudley</surname><given-names>JT</given-names> </name></person-group><article-title>Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records</article-title><source>Sci Rep</source><year>2016</year><month>05</month><day>17</day><volume>6</volume><fpage>26094</fpage><pub-id pub-id-type="doi">10.1038/srep26094</pub-id><pub-id pub-id-type="medline">27185194</pub-id></nlm-citation></ref><ref id="ref16"><label>16</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Gawlikowski</surname><given-names>J</given-names> </name><name name-style="western"><surname>Tassi</surname><given-names>CRN</given-names> </name><name name-style="western"><surname>Ali</surname><given-names>M</given-names> </name><etal/></person-group><article-title>A survey of uncertainty in deep neural networks</article-title><source>Artif Intell Rev</source><year>2023</year><month>10</month><volume>56</volume><issue>S1</issue><fpage>1513</fpage><lpage>1589</lpage><pub-id pub-id-type="doi">10.1007/s10462-023-10562-9</pub-id></nlm-citation></ref><ref id="ref17"><label>17</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Riley</surname><given-names>RD</given-names> </name><name name-style="western"><surname>Archer</surname><given-names>L</given-names> </name><name name-style="western"><surname>Snell</surname><given-names>KIE</given-names> </name><etal/></person-group><article-title>Evaluation of clinical prediction models (part 2): how to undertake an external validation study</article-title><source>BMJ</source><year>2024</year><month>01</month><day>15</day><volume>384</volume><fpage>e074820</fpage><pub-id pub-id-type="doi">10.1136/bmj-2023-074820</pub-id><pub-id pub-id-type="medline">38224968</pub-id></nlm-citation></ref><ref id="ref18"><label>18</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Zhu</surname><given-names>F</given-names> </name><name name-style="western"><surname>Zhang</surname><given-names>XY</given-names> </name><name name-style="western"><surname>Cheng</surname><given-names>Z</given-names> </name><name name-style="western"><surname>Liu</surname><given-names>CL</given-names> </name></person-group><article-title>Revisiting confidence estimation: towards reliable failure prediction</article-title><source>IEEE Trans Pattern Anal Mach Intell</source><year>2024</year><month>05</month><volume>46</volume><issue>5</issue><fpage>3370</fpage><lpage>3387</lpage><pub-id pub-id-type="doi">10.1109/TPAMI.2023.3342285</pub-id><pub-id pub-id-type="medline">38090830</pub-id></nlm-citation></ref><ref id="ref19"><label>19</label><nlm-citation citation-type="other"><person-group person-group-type="author"><name name-style="western"><surname>Wei</surname><given-names>J</given-names> </name><name name-style="western"><surname>Karina</surname><given-names>N</given-names> </name><name name-style="western"><surname>Chung</surname><given-names>HW</given-names> </name><etal/></person-group><article-title>Measuring short-form factuality in large language models</article-title><source>arXiv</source><comment>Preprint posted online on  Nov 7, 2024</comment><pub-id pub-id-type="doi">10.48550/arXiv.2411.04368</pub-id></nlm-citation></ref><ref id="ref20"><label>20</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Hawley</surname><given-names>K</given-names> </name></person-group><article-title>Trust, distrust and commitment</article-title><source>Nous</source><year>2014</year><month>03</month><volume>48</volume><issue>1</issue><fpage>1</fpage><lpage>20</lpage><pub-id pub-id-type="doi">10.1111/nous.12000</pub-id></nlm-citation></ref><ref id="ref21"><label>21</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Tallant</surname><given-names>J</given-names> </name><name name-style="western"><surname>Donati</surname><given-names>D</given-names> </name></person-group><article-title>Trust: from the philosophical to the commercial</article-title><source>Philos Manag</source><year>2020</year><month>03</month><volume>19</volume><issue>1</issue><fpage>3</fpage><lpage>19</lpage><pub-id pub-id-type="doi">10.1007/s40926-019-00107-y</pub-id></nlm-citation></ref><ref id="ref22"><label>22</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Sagona</surname><given-names>M</given-names> </name><name name-style="western"><surname>Dai</surname><given-names>T</given-names> </name><name name-style="western"><surname>Macis</surname><given-names>M</given-names> </name><name name-style="western"><surname>Darden</surname><given-names>M</given-names> </name></person-group><article-title>Trust in AI-assisted health systems and AI&#x2019;s trust in humans</article-title><source>NPJ Health Syst</source><year>2025</year><volume>2</volume><issue>1</issue><fpage>10</fpage><pub-id pub-id-type="doi">10.1038/s44401-025-00016-5</pub-id></nlm-citation></ref><ref id="ref23"><label>23</label><nlm-citation citation-type="confproc"><person-group person-group-type="author"><name name-style="western"><surname>Nixon</surname><given-names>J</given-names> </name><name name-style="western"><surname>Dusenberry</surname><given-names>MW</given-names> </name><name name-style="western"><surname>Jerfel</surname><given-names>G</given-names> </name><name name-style="western"><surname>Zhang</surname><given-names>L</given-names> </name><name name-style="western"><surname>Tran</surname><given-names>D</given-names> </name></person-group><article-title>Measuring calibration in deep learning</article-title><access-date>2026-02-26</access-date><conf-name>IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019)</conf-name><conf-date>Jun 15-20, 2019</conf-date><comment><ext-link ext-link-type="uri" xlink:href="https://openaccess.thecvf.com/content_CVPRW_2019/html/Uncertainty_and_Robustness_in_Deep_Visual_Learning/Nixon_Measuring_Calibration_in_Deep_Learning_CVPRW_2019_paper.html">https://openaccess.thecvf.com/content_CVPRW_2019/html/Uncertainty_and_Robustness_in_Deep_Visual_Learning/Nixon_Measuring_Calibration_in_Deep_Learning_CVPRW_2019_paper.html</ext-link></comment></nlm-citation></ref><ref id="ref24"><label>24</label><nlm-citation citation-type="confproc"><person-group person-group-type="author"><name name-style="western"><surname>Ovadia</surname><given-names>Y</given-names> </name><name name-style="western"><surname>Fertig</surname><given-names>E</given-names> </name><name name-style="western"><surname>Ren</surname><given-names>J</given-names> </name><etal/></person-group><article-title>Can you trust your model&#x2019;s uncertainty? Evaluating predictive uncertainty under dataset shift</article-title><conf-name>NIPS&#x2019;19: 33rd International Conference on Neural Information Processing Systems</conf-name><conf-date>Dec 8-14, 2019</conf-date><pub-id pub-id-type="doi">10.5555/3454287.3455541</pub-id></nlm-citation></ref><ref id="ref25"><label>25</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Naeini</surname><given-names>MP</given-names> </name><name name-style="western"><surname>Cooper</surname><given-names>GF</given-names> </name><name name-style="western"><surname>Hauskrecht</surname><given-names>M</given-names> </name></person-group><article-title>Obtaining well calibrated probabilities using Bayesian binning</article-title><source>Proc AAAI Conf Artif Intell</source><year>2015</year><month>01</month><volume>2015</volume><fpage>2901</fpage><lpage>2907</lpage><comment><ext-link ext-link-type="uri" xlink:href="https://pubmed.ncbi.nlm.nih.gov/25927013/">https://pubmed.ncbi.nlm.nih.gov/25927013/</ext-link></comment><pub-id pub-id-type="medline">25927013</pub-id></nlm-citation></ref><ref id="ref26"><label>26</label><nlm-citation citation-type="confproc"><person-group person-group-type="author"><name name-style="western"><surname>Guo</surname><given-names>C</given-names> </name><name name-style="western"><surname>Pleiss</surname><given-names>G</given-names> </name><name name-style="western"><surname>Sun</surname><given-names>Y</given-names> </name><name name-style="western"><surname>Weinberger</surname><given-names>KQ</given-names> </name></person-group><article-title>On calibration of modern neural networks</article-title><access-date>2026-02-26</access-date><conf-name>34th International Conference on Machine Learning (ICML 2017)</conf-name><conf-date>Aug 6-11, 2017</conf-date><comment><ext-link ext-link-type="uri" xlink:href="https://proceedings.mlr.press/v70/guo17a.html">https://proceedings.mlr.press/v70/guo17a.html</ext-link></comment></nlm-citation></ref><ref id="ref27"><label>27</label><nlm-citation citation-type="confproc"><person-group person-group-type="author"><name name-style="western"><surname>Laves</surname><given-names>MH</given-names> </name><name name-style="western"><surname>Ihler</surname><given-names>S</given-names> </name><name name-style="western"><surname>Kortmann</surname><given-names>KP</given-names> </name><name name-style="western"><surname>Ortmaier</surname><given-names>T</given-names> </name></person-group><article-title>Well-calibrated model uncertainty with temperature scaling for dropout variational inference</article-title><access-date>2026-02-26</access-date><conf-name>4th Workshop on Bayesian Deep Learning (NeurIPS 2019)</conf-name><conf-date>Dec 13-14, 2019</conf-date><comment><ext-link ext-link-type="uri" xlink:href="https://www.bayesiandeeplearning.org/2019/papers/77.pdf">https://www.bayesiandeeplearning.org/2019/papers/77.pdf</ext-link></comment></nlm-citation></ref><ref id="ref28"><label>28</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Hicks</surname><given-names>SA</given-names> </name><name name-style="western"><surname>Str&#x00FC;mke</surname><given-names>I</given-names> </name><name name-style="western"><surname>Thambawita</surname><given-names>V</given-names> </name><etal/></person-group><article-title>On evaluation metrics for medical applications of artificial intelligence</article-title><source>Sci Rep</source><year>2022</year><month>04</month><day>8</day><volume>12</volume><issue>1</issue><fpage>5979</fpage><pub-id pub-id-type="doi">10.1038/s41598-022-09954-8</pub-id><pub-id pub-id-type="medline">35395867</pub-id></nlm-citation></ref><ref id="ref29"><label>29</label><nlm-citation citation-type="book"><person-group person-group-type="author"><name name-style="western"><surname>James</surname><given-names>G</given-names> </name><name name-style="western"><surname>Witten</surname><given-names>D</given-names> </name><name name-style="western"><surname>Hastie</surname><given-names>T</given-names> </name><name name-style="western"><surname>Tibshirani</surname><given-names>R</given-names> </name><name name-style="western"><surname>Taylor</surname><given-names>J</given-names> </name></person-group><source>An Introduction to Statistical Learning: With Applications in Python</source><year>2023</year><publisher-name>Springer</publisher-name><comment><ext-link ext-link-type="uri" xlink:href="https://link.springer.com/book/10.1007/978-3-031-38747-0">https://link.springer.com/book/10.1007/978-3-031-38747-0</ext-link></comment><pub-id pub-id-type="doi">10.1007/978-3-031-38747-0</pub-id></nlm-citation></ref><ref id="ref30"><label>30</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Gilbert</surname><given-names>S</given-names> </name><name name-style="western"><surname>Adler</surname><given-names>R</given-names> </name><name name-style="western"><surname>Holoyad</surname><given-names>T</given-names> </name><name name-style="western"><surname>Weicken</surname><given-names>E</given-names> </name></person-group><article-title>Could transparent model cards with layered accessible information drive trust and safety in health AI?</article-title><source>NPJ Digit Med</source><year>2025</year><month>02</month><day>25</day><volume>8</volume><issue>1</issue><fpage>124</fpage><pub-id pub-id-type="doi">10.1038/s41746-025-01482-9</pub-id><pub-id pub-id-type="medline">40000736</pub-id></nlm-citation></ref><ref id="ref31"><label>31</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Ku</surname><given-names>WL</given-names> </name><name name-style="western"><surname>Min</surname><given-names>H</given-names> </name></person-group><article-title>Evaluating machine learning stability in predicting depression and anxiety amidst subjective response errors</article-title><source>Healthcare (Basel)</source><year>2024</year><month>03</month><day>10</day><volume>12</volume><issue>6</issue><fpage>625</fpage><pub-id pub-id-type="doi">10.3390/healthcare12060625</pub-id><pub-id pub-id-type="medline">38540589</pub-id></nlm-citation></ref><ref id="ref32"><label>32</label><nlm-citation citation-type="confproc"><person-group person-group-type="author"><name name-style="western"><surname>Lakshminarayanan</surname><given-names>B</given-names> </name><name name-style="western"><surname>Pritzel</surname><given-names>A</given-names> </name><name name-style="western"><surname>Blundell</surname><given-names>C</given-names> </name></person-group><article-title>Simple and scalable predictive uncertainty estimation using deep ensembles</article-title><conf-name>31st Conference on Neural Information Processing Systems (NIPS 2017)</conf-name><conf-date>Dec 4-9, 2017</conf-date><pub-id pub-id-type="doi">10.5555/3295222.3295387</pub-id></nlm-citation></ref><ref id="ref33"><label>33</label><nlm-citation citation-type="book"><person-group person-group-type="author"><name name-style="western"><surname>Rasmussen</surname><given-names>CE</given-names> </name><name name-style="western"><surname>Qui&#x00F1;onero-Candela</surname><given-names>J</given-names> </name><name name-style="western"><surname>Sinz</surname><given-names>F</given-names> </name><name name-style="western"><surname>Bousquet</surname><given-names>O</given-names> </name><name name-style="western"><surname>Sch&#x00F6;lkopf</surname><given-names>B</given-names> </name></person-group><person-group person-group-type="editor"><name name-style="western"><surname>Qui&#x00F1;onero-Candela</surname><given-names>J</given-names> </name><name name-style="western"><surname>Dagan</surname><given-names>I</given-names> </name><name name-style="western"><surname>Magnini</surname><given-names>B</given-names> </name><name name-style="western"><surname>d&#x2019;Alch&#x00E9;-Buc</surname><given-names>F</given-names> </name></person-group><article-title>Evaluating predictive uncertainty challenge</article-title><source>Machine Learning Challenges Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment</source><year>2006</year><publisher-name>Springer</publisher-name><fpage>1</fpage><lpage>27</lpage><pub-id pub-id-type="doi">10.1007/11736790_1</pub-id><pub-id pub-id-type="other">9783540334279</pub-id></nlm-citation></ref><ref id="ref34"><label>34</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>He</surname><given-names>W</given-names> </name><name name-style="western"><surname>Jiang</surname><given-names>Z</given-names> </name><name name-style="western"><surname>Xiao</surname><given-names>T</given-names> </name><name name-style="western"><surname>Xu</surname><given-names>Z</given-names> </name><name name-style="western"><surname>Li</surname><given-names>Y</given-names> </name></person-group><article-title>A survey on uncertainty quantification methods for deep learning</article-title><source>ACM Comput Surv</source><year>2023</year><access-date>2026-02-27</access-date><volume>37</volume><issue>4</issue><fpage>111</fpage><comment><ext-link ext-link-type="uri" xlink:href="https://www.jiangteam.org/papers/ACM_CSUR_UQ_DeepLearningSurvey.pdf">https://www.jiangteam.org/papers/ACM_CSUR_UQ_DeepLearningSurvey.pdf</ext-link></comment></nlm-citation></ref><ref id="ref35"><label>35</label><nlm-citation citation-type="web"><article-title>Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act) (Text with EEA relevance)</article-title><source>European Union</source><year>2024</year><access-date>2026-02-27</access-date><comment><ext-link ext-link-type="uri" xlink:href="https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng">https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng</ext-link></comment></nlm-citation></ref><ref id="ref36"><label>36</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>van Kolfschooten</surname><given-names>H</given-names> </name><name name-style="western"><surname>van Oirschot</surname><given-names>J</given-names> </name></person-group><article-title>The EU Artificial Intelligence Act (2024): implications for healthcare</article-title><source>Health Policy</source><year>2024</year><month>11</month><volume>149</volume><fpage>105152</fpage><pub-id pub-id-type="doi">10.1016/j.healthpol.2024.105152</pub-id><pub-id pub-id-type="medline">39244818</pub-id></nlm-citation></ref><ref id="ref37"><label>37</label><nlm-citation citation-type="web"><article-title>Predetermined change control plans for medical devices</article-title><source>U.S. Food and Drug Administration</source><year>2024</year><access-date>2026-02-27</access-date><comment><ext-link ext-link-type="uri" xlink:href="https://www.fda.gov/regulatory-information/search-fda-guidance-documents/predetermined-change-control-plans-medical-devices">https://www.fda.gov/regulatory-information/search-fda-guidance-documents/predetermined-change-control-plans-medical-devices</ext-link></comment></nlm-citation></ref><ref id="ref38"><label>38</label><nlm-citation citation-type="web"><article-title>ISO/IEC 42001:2023 - information technology &#x2014; artificial intelligence &#x2014; management system</article-title><source>International Organization for Standardization</source><year>2023</year><access-date>2026-02-27</access-date><comment><ext-link ext-link-type="uri" xlink:href="https://www.iso.org/standard/42001">https://www.iso.org/standard/42001</ext-link></comment></nlm-citation></ref><ref id="ref39"><label>39</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Kelly</surname><given-names>A</given-names> </name><name name-style="western"><surname>Jensen</surname><given-names>EK</given-names> </name><name name-style="western"><surname>Grua</surname><given-names>EM</given-names> </name><name name-style="western"><surname>Mathiasen</surname><given-names>K</given-names> </name><name name-style="western"><surname>Van de Ven</surname><given-names>P</given-names> </name></person-group><article-title>An interpretable model with probabilistic integrated scoring for mental health treatment prediction: design study</article-title><source>JMIR Med Inform</source><year>2025</year><month>03</month><day>26</day><volume>13</volume><fpage>e64617</fpage><pub-id pub-id-type="doi">10.2196/64617</pub-id><pub-id pub-id-type="medline">40138679</pub-id></nlm-citation></ref></ref-list><app-group><supplementary-material id="app1"><label>Multimedia Appendix 1</label><p>Details of the Probabilistic Integrated Scoring Model vignette.</p><media xlink:href="jmir_v28i1e83903_app1.DOCX" xlink:title="DOCX File, 40 KB"/></supplementary-material></app-group></back></article>