<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN" "journalpublishing.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="2.0" xml:lang="en" article-type="article-commentary"><front><journal-meta><journal-id journal-id-type="nlm-ta">J Med Internet Res</journal-id><journal-id journal-id-type="publisher-id">jmir</journal-id><journal-id journal-id-type="index">1</journal-id><journal-title>Journal of Medical Internet Research</journal-title><abbrev-journal-title>J Med Internet Res</abbrev-journal-title><issn pub-type="epub">1438-8871</issn><publisher><publisher-name>JMIR Publications</publisher-name><publisher-loc>Toronto, Canada</publisher-loc></publisher></journal-meta><article-meta><article-id pub-id-type="publisher-id">v28i1e102159</article-id><article-id pub-id-type="doi">10.2196/102159</article-id><article-categories><subj-group subj-group-type="heading"><subject>Commentary</subject></subj-group></article-categories><title-group><article-title>Moving From Keywords to Contextual Meaning: A Commentary on Hybrid Bibliometric Synthesis in Health Research</article-title></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name name-style="western"><surname>Zikos</surname><given-names>Dimitrios</given-names></name><degrees>BSN, MSc, PhD</degrees><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff id="aff1"><institution>Department of Healthcare Management and Leadership, College of Health Professions, Texas Tech University Health Sciences Center</institution><addr-line>3601 4th Street</addr-line><addr-line>Lubbock</addr-line><addr-line>TX</addr-line><country>United States</country></aff><contrib-group><contrib contrib-type="editor"><name name-style="western"><surname>Law</surname><given-names>Stephanie</given-names></name></contrib><contrib contrib-type="editor"><name name-style="western"><surname>Leung</surname><given-names>Tiffany</given-names></name></contrib></contrib-group><author-notes><corresp>Correspondence to Dimitrios Zikos, BSN, MSc, PhD, Department of Healthcare Management and Leadership, College of Health Professions, Texas Tech University Health Sciences Center, 3601 4th Street, Lubbock, TX, 79430, United States, 1 9894301787; <email>dzikos@ttuhsc.edu</email></corresp></author-notes><pub-date pub-type="collection"><year>2026</year></pub-date><pub-date pub-type="epub"><day>3</day><month>6</month><year>2026</year></pub-date><volume>28</volume><elocation-id>e102159</elocation-id><history><date date-type="received"><day>22</day><month>05</month><year>2026</year></date><date date-type="accepted"><day>22</day><month>05</month><year>2026</year></date></history><copyright-statement>&#x00A9; Dimitrios Zikos. Originally published in the Journal of Medical Internet Research (<ext-link ext-link-type="uri" xlink:href="https://www.jmir.org">https://www.jmir.org</ext-link>), 3.6.2026. </copyright-statement><copyright-year>2026</copyright-year><license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on <ext-link ext-link-type="uri" xlink:href="https://www.jmir.org/">https://www.jmir.org/</ext-link>, as well as this copyright and license information must be included.</p></license><self-uri xlink:type="simple" xlink:href="https://www.jmir.org/2026/1/e102159"/><related-article related-article-type="commentary article" ext-link-type="doi" xlink:href="10.2196/86200" xlink:title="Comment on" xlink:type="simple">https://www.jmir.org/2026/1/e86200</related-article><abstract><p>The fast growth of social media mining in health research has contributed to an invaluable but quite fragmented body of literature. As the amount of unstructured patient-reported data grows, traditional bibliometric analyses face methodological limitations, particularly regarding synonym fragmentation and arbitrary parameter selection. In their recent publication, &#x201C;Thematic Mapping and Evolution of Social Media Mining in Health Research: Hybrid Bibliometric Synthesis,&#x201D; Yang and Bohnet-Joschko attempt to address these flaws by introducing a semantic-structural (hybrid) bibliometric framework. This commentary evaluates the methodological innovations of their study and its departure from traditional syntactic keyword-matching tools. By combining citation-informed transformers (SPECTER2) and biomedical language models (PubMedBERT) and dimensionality reduction and density-based clustering, the authors created a reproducible pipeline. In their architecture, they start with foundational machine learning (statistical validity) before transitioning into large language models for qualitative synthesis. I will attempt to explain how this transition from syntactic mapping to semantic vector representation solves known challenges in evidence synthesis, naturally grouping conceptual synonyms without artificially forcing boundaries on the literature. Furthermore, I examine the practical implications of their temporal findings. Such real-time social media mining applications can be very useful for retrospective reporting and evaluating targeted public health interventions. While this pipeline offers high generalizability across disciplines, it also introduces a computational literacy barrier to some, and this re-emphasizes the need for data literacy for health professions. Ultimately, the study provides a transparent approach to informatics because mathematically validated frameworks are foundational for the future of evidence-driven public health policy and clinical decision-making.</p></abstract><kwd-group><kwd>bibliometrics</kwd><kwd>machine learning</kwd><kwd>semantic mapping</kwd><kwd>public health surveillance</kwd><kwd>social media</kwd></kwd-group></article-meta></front><body><sec id="s1" sec-type="intro"><title>Introduction</title><p>The fast growth of social media mining (SMM) in health research has provided new opportunities to study public health. However, the volume of this literature has created a fragmented knowledge base that resists traditional evidence synthesis. In their recent article, &#x201C;Thematic Mapping and Evolution of Social Media Mining in Health Research: Hybrid Bibliometric Synthesis,&#x201D; Yang and Bohnet-Joschko [<xref ref-type="bibr" rid="ref1">1</xref>] go beyond descriptive field mapping and introduce a more semantically focused framework for evidence synthesis by extending traditional bibliometric methods with a hybrid semantic-structural pipeline. Since the health care data resources and knowledge base have already become massive and complex, synthesizing literature needs to go beyond just &#x201C;cataloging&#x201D; publications. This commentary explores how their methodology attempts to overcome statistical limitations, optimizes analytical architecture, and supports contextually relevant public health applications.</p></sec><sec id="s2"><title>The Limitations of Traditional Keyword Matching</title><p>For over a decade, bibliometric reviews have mostly relied on out-of-the-box platforms like VOSviewer, but in some ways, those may reduce transparency regarding parameter sensitivity and clustering behavior [<xref ref-type="bibr" rid="ref2">2</xref>]. They are created around prebuilt clustering algorithms (such as the Louvain method) that do not provide the researcher control over cluster thresholding, network normalizations, and parameter sensitivity. Additionally, these traditional methods are constrained by syntactic mapping: they treat keywords, titles, and abstracts of articles as isolated strings of text. For example, traditional tools group papers by vocabulary (linking a Twitter flu paper with a bird flu paper), while the hybrid model groups them by scientific intent. Unless a researcher manually creates a detailed thesaurus, the algorithm fragments the literature based on vocabulary rather than meaning. Similarly, legacy topic modeling methods like latent Dirichlet allocation (LDA) rely on a rigid &#x201C;bag-of-words&#x201D; assumption that ignores contextual syntax and forces researchers to predefine the number of topics, artificially &#x201C;bounding&#x201D; the data [<xref ref-type="bibr" rid="ref3">3</xref>].</p><p>Yang and Bohnet-Joschko [<xref ref-type="bibr" rid="ref1">1</xref>] attempted to solve this by moving from syntactic to semantic mapping (<xref ref-type="table" rid="table1">Table 1</xref>). By choosing to use SPECTER2 and PubMedBERT embeddings, they ingest the entire context of a text and convert documents into high-dimensional vectors [<xref ref-type="bibr" rid="ref4">4</xref>,<xref ref-type="bibr" rid="ref5">5</xref>]. Because these models understand semantic meaning, they group conceptual synonyms together, making fewer errors than the traditional string-matching tools. Furthermore, because SPECTER2 is pretrained on very large amounts of citation data, it makes the approach an interesting hybrid of text mining and citation analysis.</p><table-wrap id="t1" position="float"><label>Table 1.</label><caption><p>Comparison of traditional bibliometric methods to Yang and Bohnet-Joschko&#x2019;s [<xref ref-type="bibr" rid="ref1">1</xref>] pipeline.</p></caption><table id="table1" frame="hsides" rules="groups"><thead><tr><td align="left" valign="bottom">Dimension</td><td align="left" valign="bottom">Traditional bibliometric methods</td><td align="left" valign="bottom">Yang and Bohnet-Joschko&#x2019;s [<xref ref-type="bibr" rid="ref1">1</xref>] hybrid pipeline</td></tr></thead><tbody><tr><td align="left" valign="top">Data processing</td><td align="left" valign="top">Syntactic mapping of isolated text strings</td><td align="left" valign="top">Semantic vector representation using context</td></tr><tr><td align="left" valign="top">Core algorithms</td><td align="left" valign="top">Out-of-the-box platforms (eg, VOSviewer) using Louvain or latent Dirichlet allocation</td><td align="left" valign="top">Citation-informed transformers (SPECTER2), biomedical models (PubMedBERT), UMAP<sup><xref ref-type="table-fn" rid="table1fn1">a</xref></sup>, and HDBSCAN<sup><xref ref-type="table-fn" rid="table1fn2">b</xref></sup></td></tr><tr><td align="left" valign="top">Clustering</td><td align="left" valign="top">Relies on forced boundaries, often requiring predefined topic counts</td><td align="left" valign="top">Density-based clustering that mathematically isolates unassigned noise</td></tr><tr><td align="left" valign="top">Vocabulary handling</td><td align="left" valign="top">Fragments literature based on terminology unless a manual thesaurus is built</td><td align="left" valign="top">Naturally groups conceptual synonyms together based on scientific intent</td></tr></tbody></table><table-wrap-foot><fn id="table1fn1"><p><sup>a</sup>UMAP: uniform manifold approximation and projection.</p></fn><fn id="table1fn2"><p><sup>b</sup>HDBSCAN: hierarchical density-based spatial clustering of applications with noise.</p></fn></table-wrap-foot></table-wrap></sec><sec id="s3"><title>Optimization of the Architecture</title><p>The strength of Yang and Bohnet-Joschko&#x2019;s [<xref ref-type="bibr" rid="ref1">1</xref>] methodology is in the sequencing of its analytical architecture. In the current era of artificial intelligence, it is tempting to feed raw data directly into a large language model. The study argues for an emerging informatics principle: structurally validated machine learning pipelines should precede large language model&#x2013;assisted interpretation.</p><p>Yang and Bohnet-Joschko [<xref ref-type="bibr" rid="ref1">1</xref>] validate their data structure using uniform manifold approximation and projection (UMAP) for dimensionality reduction [<xref ref-type="bibr" rid="ref6">6</xref>] and hierarchical density-based spatial clustering of applications with noise (HDBSCAN) for structural clustering [<xref ref-type="bibr" rid="ref7">7</xref>]. Unlike LDA, HDBSCAN does not require a predefined cluster count. It identifies naturally occurring dense regions of literature and mathematically isolates outlier papers rather than forcing them into irrelevant groups. This way, the subsequent thematic synthesis is grounded in reproducible data science.</p></sec><sec id="s4"><title>From Methodology to Public Health Practice</title><p>Beyond the methodological advantages, the study&#x2019;s [<xref ref-type="bibr" rid="ref1">1</xref>] temporal slicing is, for SMM, an example of moving from computational experimentation to real-world public health engagement. The prominence of application-driven clusters, such as infodemiology and sociopsychological determinants, shows well how SMM can be a useful tool for community health surveillance [<xref ref-type="bibr" rid="ref8">8</xref>]. Accurate SMM is essential for evaluating localized, real-world health initiatives. For instance, evaluating colorectal cancer screening prevention campaigns in Texas requires an understanding of localized, real-time community sentiment, barriers to access, and sociopsychological hesitation. SMM captures this narrative, augmenting structured retrospective federal tools that often lag by months or years. Such real-time sentiment surveillance may also support decision-making for health systems, including targeted outreach, misinformation monitoring, and fast evaluation of intervention uptake.</p><p>Furthermore, the Yang and Bohnet-Joschko [<xref ref-type="bibr" rid="ref1">1</xref>] treatment of HDBSCAN&#x2019;s unassigned noise (cluster 1) is a very reasonable analytical choice. Rather than dismissing this noise, they identify it as a candidate &#x201C;incubator pool.&#x201D; I believe that embracing these topics can help researchers who want to predict the next wave of sociopsychological determinants before they become new established literature.</p></sec><sec id="s5"><title>Limitations and Future Work</title><p>Despite its advantages over legacy systems, this pipeline introduces some challenges. The transition from graphical user interface&#x2013;based tools to code-based learning models introduces a computational barrier to entry. There is a risk that evidence synthesis becomes gated behind advanced data science skills. Furthermore, while Yang and Bohnet-Joschko&#x2019;s [<xref ref-type="bibr" rid="ref1">1</xref>] dual-level validation ensures semantic coherence, the study lacks an empirical comparison against older baselines (such as LDA or Louvain clustering) on the identical dataset, which would be necessary to quantify the reduction in fragmentation. Additionally, future validation across multiple bibliographic databases (eg, PubMed, Scopus, and Web of Science) would help determine the stability of the pipeline.</p><p>Researchers must be careful not to treat these new machine learning algorithms as new inherently opaque systems. While HDBSCAN eliminates the need to guess cluster counts, it introduces hypersensitivity to the minimum cluster size and the minimum sample size. Similarly, UMAP is dependent on neighbor and distance parameters. Future researchers who adopt this methodology will need to optimize and report their parameter selections to ensure their underlying models match their specific discipline. As a researcher and instructor, I would like to stress that addressing this requires shifts in health informatics education, promoting learning that relies on algorithmic logic and mechanics, and not just software operation.</p></sec><sec id="s6" sec-type="conclusions"><title>Conclusion</title><p>Yang and Bohnet-Joschko&#x2019;s [<xref ref-type="bibr" rid="ref1">1</xref>] study moved successfully from syntactic keyword-matching to semantic vector representation, in an attempt to avoid the limitations of traditional bibliometrics. More importantly, they provided a transparent, reproducible blueprint for future studies. As the volume of medical literature and patient-reported data continues to increase, adopting machine learning pipelines is an imperative for any research designed to extract evidence-driven insights to guide patient care and public health policy.</p></sec></body><back><ack><p>Generative artificial intelligence was not used in any capacity during the preparation, writing, or editing of this manuscript.</p></ack><notes><sec><title>Funding</title><p>The author declared no financial support was received for this work.</p></sec></notes><fn-group><fn fn-type="conflict"><p>None declared.</p></fn></fn-group><glossary><title>Abbreviations</title><def-list><def-item><term id="abb1">HDBSCAN</term><def><p>hierarchical density-based spatial clustering of applications with noise</p></def></def-item><def-item><term id="abb2">LDA</term><def><p>latent Dirichlet allocation</p></def></def-item><def-item><term id="abb3">SMM</term><def><p>social media mining</p></def></def-item><def-item><term id="abb4">UMAP</term><def><p>uniform manifold approximation and projection</p></def></def-item></def-list></glossary><ref-list><title>References</title><ref id="ref1"><label>1</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Yang</surname><given-names>MJ</given-names> </name><name name-style="western"><surname>Bohnet-Joschko</surname><given-names>S</given-names> </name></person-group><article-title>Thematic mapping and evolution of social media mining in health research: hybrid bibliometric synthesis</article-title><source>J Med Internet Res</source><year>2026</year><month>05</month><day>8</day><volume>28</volume><fpage>e86200</fpage><pub-id pub-id-type="doi">10.2196/86200</pub-id><pub-id pub-id-type="medline">42115141</pub-id></nlm-citation></ref><ref id="ref2"><label>2</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>van Eck</surname><given-names>NJ</given-names> </name><name name-style="western"><surname>Waltman</surname><given-names>L</given-names> </name></person-group><article-title>Software survey: VOSviewer, a computer program for bibliometric mapping</article-title><source>Scientometrics</source><year>2010</year><month>08</month><volume>84</volume><issue>2</issue><fpage>523</fpage><lpage>538</lpage><pub-id pub-id-type="doi">10.1007/s11192-009-0146-3</pub-id><pub-id pub-id-type="medline">20585380</pub-id></nlm-citation></ref><ref id="ref3"><label>3</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Blei</surname><given-names>DM</given-names> </name><name name-style="western"><surname>Ng</surname><given-names>AY</given-names> </name><name name-style="western"><surname>Jordan</surname><given-names>MI</given-names> </name></person-group><article-title>Latent Dirichlet allocation</article-title><source>J Machine Learning Res</source><year>2003</year><access-date>2026-06-01</access-date><volume>3</volume><fpage>993</fpage><lpage>1022</lpage><comment><ext-link ext-link-type="uri" xlink:href="https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf">https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf</ext-link></comment></nlm-citation></ref><ref id="ref4"><label>4</label><nlm-citation citation-type="other"><person-group person-group-type="author"><name name-style="western"><surname>Cohan</surname><given-names>A</given-names> </name><name name-style="western"><surname>Feldman</surname><given-names>S</given-names> </name><name name-style="western"><surname>Beltagy</surname><given-names>I</given-names> </name><name name-style="western"><surname>Downey</surname><given-names>D</given-names> </name><name name-style="western"><surname>Weld</surname><given-names>DS</given-names> </name></person-group><article-title>SPECTER: document-level representation learning using citation-informed transformers</article-title><source>arXiv</source><comment>Preprint posted online on  Apr 15, 2022</comment><pub-id pub-id-type="doi">10.48550/arXiv.2004.07180</pub-id></nlm-citation></ref><ref id="ref5"><label>5</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Gu</surname><given-names>Y</given-names> </name><name name-style="western"><surname>Tinn</surname><given-names>R</given-names> </name><name name-style="western"><surname>Cheng</surname><given-names>H</given-names> </name><etal/></person-group><article-title>Domain-specific language model pretraining for biomedical natural language processing</article-title><source>ACM Trans Comput Healthcare</source><year>2022</year><month>01</month><day>31</day><volume>3</volume><issue>1</issue><fpage>1</fpage><lpage>23</lpage><pub-id pub-id-type="doi">10.1145/3458754</pub-id></nlm-citation></ref><ref id="ref6"><label>6</label><nlm-citation citation-type="other"><person-group person-group-type="author"><name name-style="western"><surname>McInnes</surname><given-names>L</given-names> </name><name name-style="western"><surname>Healy</surname><given-names>J</given-names> </name><name name-style="western"><surname>Melville</surname><given-names>J</given-names> </name></person-group><article-title>UMAP: uniform manifold approximation and projection for dimension reduction</article-title><source>arXiv</source><comment>Preprint posted online on  Feb 9, 2018</comment><pub-id pub-id-type="doi">10.48550/arXiv.1802.03426</pub-id></nlm-citation></ref><ref id="ref7"><label>7</label><nlm-citation citation-type="book"><person-group person-group-type="author"><name name-style="western"><surname>Campello</surname><given-names>R</given-names> </name><name name-style="western"><surname>Moulavi</surname><given-names>D</given-names> </name><name name-style="western"><surname>Sander</surname><given-names>J</given-names> </name></person-group><person-group person-group-type="editor"><name name-style="western"><surname>Pei</surname><given-names>J</given-names> </name><name name-style="western"><surname>Tseng</surname><given-names>VS</given-names> </name><name name-style="western"><surname>Cao</surname><given-names>L</given-names> </name><name name-style="western"><surname>Motoda</surname><given-names>H</given-names> </name><name name-style="western"><surname>Xu</surname><given-names>G</given-names> </name></person-group><article-title>Density-based clustering based on hierarchical density estimates</article-title><source>Advances in Knowledge Discovery and Data Mining: 17th Pacific-Asia Conference, PAKDD 2013, Gold Coast, Australia, April 14-17, 2013, Proceedings, Part II</source><year>2013</year><fpage>160</fpage><lpage>172</lpage><pub-id pub-id-type="doi">10.1007/978-3-642-37456-2_14</pub-id></nlm-citation></ref><ref id="ref8"><label>8</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Eysenbach</surname><given-names>G</given-names> </name></person-group><article-title>Infodemiology and infoveillance tracking online health information and cyberbehavior for public health</article-title><source>Am J Prev Med</source><year>2011</year><month>05</month><volume>40</volume><issue>5 Suppl 2</issue><fpage>S154</fpage><lpage>S158</lpage><pub-id pub-id-type="doi">10.1016/j.amepre.2011.02.006</pub-id><pub-id pub-id-type="medline">21521589</pub-id></nlm-citation></ref></ref-list></back></article>