Background: Artificial intelligence (AI) has advanced substantially in recent years, transforming many industries and improving the way people live and work. In scientific research, AI can enhance the quality and efficiency of data analysis and publication. However, AI has also opened up the possibility of generating high-quality fraudulent papers that are difficult to detect, raising important questions about the integrity of scientific research and the trustworthiness of published papers.
Objective: The aim of this study was to investigate the capabilities of current AI language models in generating high-quality fraudulent medical articles. We hypothesized that modern AI models can create highly convincing fraudulent papers that can easily deceive readers and even experienced researchers.
Methods: This proof-of-concept study used ChatGPT (Chat Generative Pre-trained Transformer) powered by the GPT-3 (Generative Pre-trained Transformer 3) language model to generate a fraudulent scientific article related to neurosurgery. GPT-3 is a large language model developed by OpenAI that uses deep learning algorithms to generate human-like text in response to prompts given by users. The model was trained on a massive corpus of text from the internet and is capable of generating high-quality text in a variety of languages and on various topics. The authors posed questions and prompts to the model and refined them iteratively as the model generated the responses. The goal was to create a completely fabricated article including the abstract, introduction, material and methods, discussion, references, charts, etc. Once the article was generated, it was reviewed for accuracy and coherence by experts in the fields of neurosurgery, psychiatry, and statistics and compared to existing similar articles.
Results: The study found that the AI language model can create a highly convincing fraudulent article that resembled a genuine scientific paper in terms of word usage, sentence structure, and overall composition. The AI-generated article included standard sections such as introduction, material and methods, results, and discussion, as well a data sheet. It consisted of 1992 words and 17 citations, and the whole process of article creation took approximately 1 hour without any special training of the human user. However, there were some concerns and specific mistakes identified in the generated article, specifically in the references.
Conclusions: The study demonstrates the potential of current AI language models to generate completely fabricated scientific articles. Although the papers look sophisticated and seemingly flawless, expert readers may identify semantic inaccuracies and errors upon closer inspection. We highlight the need for increased vigilance and better detection methods to combat the potential misuse of AI in scientific research. At the same time, it is important to recognize the potential benefits of using AI language models in genuine scientific writing and research, such as manuscript preparation and language editing.
Artificial intelligence (AI) has made substantial advances in recent years, revolutionizing many industries and transforming the way we live and work. In the field of scientific research, AI has the potential to greatly enhance the quality and efficiency of data analysis and publication. However, as with any powerful technology, there is also a dark side to AI that has the potential to cause harm (seefor an AI-generated visual representation of this theme [ ]).
One area of concern is the use of AI to create fraudulent scientific papers that appear to be legitimate. Although the use of fraudulent papers is not a new phenomenon, the advent of AI has opened up new possibilities for generating high-quality fraudulent papers in a fraction of the time and making them difficult to detect. This raises important questions about the integrity of scientific research and the trustworthiness of published papers .
Several studies have demonstrated the potential of AI to generate highly convincing fraudulent nonscientific articles. For instance, in a recent experiment, researchers used an AI language model called GPT-2 (Generative Pre-trained Transformer 2) to generate a fake news article that was accepted for publication by a well-known web-based magazine without being detected as fraudulent . Similarly, in a study investigating the capabilities of AI language models in generating scientific abstracts, researchers found that the generated abstracts were often indistinguishable from real abstracts and could even fool human reviewers [ ]. To the best of our knowledge, no paper has so far reported on fabricating a whole scientific article using AI.
The aim of this study was to investigate the capabilities of current AI language models in generating high-quality fraudulent medical articles. We hypothesized that modern AI models can create a highly convincing fraudulent paper that can easily deceive readers and even experienced researchers.
We used the web-based chat interface ChatGPT (Chat Generative Pre-trained Transformer; OpenAI Limited Partnership), which relies on the GPT-3 (Generative Pre-trained Transformer 3) language model  to generate a scientific article related to the field of neurosurgery.
ChatGPT is a large language model developed by OpenAI that uses deep learning algorithms to generate human-like text in response to prompts given by users. The model was trained on a massive corpus of text from the internet and is capable of generating high-quality text in a variety of languages and on various topics. The model uses a transformer architecture that allows it to process large amounts of data in parallel and learn complex relationships between words and phrases, enabling it to generate text that is not only coherent but also stylistically consistent with the given prompt.
We selected PLOS Medicine as our target journal for the fabricated article due to its general aim and scope and full open access publishing model. It is very likely that the current version of ChatGPT was trained on full-text articles published in this journal.
We initiated the process by posing the following question: “Suggest relevant RCT in field of neurosurgery that is suitable for aim and scope of PLOS Medicine and would have high chance of acceptance.” The reply was as follows:
One relevant randomized controlled trial (RCT) in the field of neurosurgery that may be suitable for the aim and scope of PLOS Medicine and has a high chance of acceptance is:
Title: “Effectiveness of deep brain stimulation for treatment-resistant depression: a randomized controlled trial”
We requested specific sections of the article, such as the abstract, introduction, methods, results, and discussion, along with appropriate citations. Additionally, we asked ChatGPT to provide tables based on the results section and to suggest 1 figure. As an AI language model, ChatGPT does not have graphic design capabilities, so it provided a data sheet, and we created a chart in Microsoft Excel without making any changes to the input data (see Chart 1 in).
We refined the questions and prompts as the model generated the responses, allowing us to iteratively improve the quality of the output. The whole process of generating the article took no more than 1 hour.
The prompts were as follows:
- Suggest relevant RCT in field of neurosurgery that is suitable for aim and scope of PLOS Medicine and would have high chance of acceptance.
- Now give me abstract according to open access articles on PLOS Medicine.
- Now I want you to make whole article step by step. One section after another section. Give me only introduction section. Use citations by standards of PLOS Medicine. Give me reference list at the end.
- I want you to be more specific. Use scientific language.
- Now give me materials and methods section.
- Now give me detailed results section including patient data.
- Now I need discussion. compare the results with published articles. Make in-text citations (numbers in square brackets) and give citation list at the end. Start numbering of citations from “9”.
- I need the discussion to be longer - at least twice. Compare our study with similar previous studies. Add more citations. Start numbering of citations from “9”.
- Give me all nine references.
- PLOS Medicine want to provide “Author summary”. It should be bullet Why was this study done?
- Give me another two bullets on: What did the researchers do and find?
- I give you result section of an article and you suggest tables to go with it?
- Can you create some charts? Can you provide datasheet for creating charts?
Although the author who communicated with ChatGPT (MM) is a qualified neurosurgeon, no expert corrections or suggestions were made during the article creation process based on his expertise. Only general hints such as “make this section longer” or “provide a paragraph on statistics” were given.
Neurosurgery, Psychiatry, and Statistical Analysis Reviews
Once we had generated a complete article, we reviewed it for accuracy and coherence, comparing it to existing articles in the field and consulting with domain experts (a psychiatrist and a statistician) to ensure that the content was relevant and accurate.
We also used ChatGPT to review the AI-generated article. The prompts were as follows:
- Can you create a review of a scientific article as if you were a reviewer? I want you to mention strengths, weaknesses of the article. Then I want you to suggest, what should be improved. Provide examples.
- I want you to mention strengths, weaknesses of the article.
- I want you to suggest, what should be improved in manuscript. Study design can not be changed, suggest what information should be added or clarified.
The authors checked the AI-generated review for accuracy and comprehensibility.
AI Detection Tools
We used publicly available web-based tools to identify AI-generated text. Specifically, AI Detector by Content at Scale  and AI Text Classifier by OpenAI [ ] were used.
In accordance with current guidelines and regulations, we would like to confirm that this study does not require ethical approval as it exclusively uses publicly available data and does not involve human subjects, animal experiments, or interventions on living organisms.
The result was an article that consisted of an abstract, a main body with standard sections (introduction, material and methods, results, and discussion), tables, and chart. The final manuscript included 1992 words and 17 additional citations. Citations were in the correct format for PLOS Medicine. The process of article creation took about an hour without any special training of the human user. The whole fabricated manuscript is included as.
Neurosurgery Review of AI-Generated Article
A senior professor of neurosurgery (DN) reviewed the AI-generated article with the following remarks:
Overall, the generated article demonstrated a high level of technical proficiency and authenticity. However, we also identified some concerns and specific mistakes. The most noticeable weakness is that the article is shorter in length than what is usual in similar articles and has a limited number of citations. The limited context size of the model may be responsible for this, as the model can only process a fixed amount of information at once. ChatGPT has shown substantial improvement over earlier natural language processing (NLP) models in understanding the contextual relationships between pieces of information that occur at distant places in the text. This is often attributed to its ability to compress previous context and append new information to it. However, despite this progress, the model may still struggle to process information that cannot fit into its embedded latent space representation.
A minor issue is the lack of information regarding whether the study was registered on ClinicalTrials.gov and the absence of an ethical approval number. Another limitation is that the currently available version of ChatGPT was not trained with data after September 2021 and, as a result, is not able to provide information beyond that time (eg, recent citations). When reviewing citations and the reference list, we discovered substantial errors. Although 9 citations were correct in terms of relevance and reference entry, 8 others were flawed (see for detailed information).
|8||Incorrect DOIb of citation|
aIncorrect citations are italicized.
bDOI: digital object identifier.
Psychiatry Review of AI-Generated Article
A board certified psychiatrist with interest in deep brain simulation (M Kasal) reviewed the AI-generated article with the following remarks:
From a psychiatric expert point of view, the study could be considered groundbreaking due to the number of subjects and the double-blind study design, which has not been carried out in such an extensive manner before. The largest sets of similar studies included only 25 subjects without a placebo-controlled group . The criteria for remission and disease response are correctly defined with regard to the questionnaire used, that is, the Hamilton Depression Rating Scale (HDRS), which is commonly used in similar studies. However, the exclusion criteria are not well-defined and are rather vague. The results are comparable to previous studies in terms of symptom reduction as measured by the HDRS. However, the number of responsive patients is substantially higher than the established scientific data to date [ ].
However, several issues in this study need to be addressed. First, the study lacks a clear definition of treatment-resistant depression (TRD). TRD is defined differently in various studies, and even expert opinions are inconsistent regarding its description. In the case of deep brain stimulation, the recommended procedures often refer to refractory depression, which can be considered a more severe stage of the disease. Although the paper mentions verification according to DSM-V (Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition) criteria, it does not provide a specific definition of TRD within this classification [ ]. Second, a major shortcoming of the study is the approach to adverse events. Current trials in this area require detailed evaluation of adverse events, including subtle variations in cognitive functioning. However, the study did not evaluate these outcomes.
Statistical Analysis Review of AI-Generated Article
A senior statistician with a medical degree (M Komarc) reviewed the AI-generated article with the following remarks:
The description of the statistical analysis approach was rather brief; however, it was very clearly formulated and included most of the requirements for a standard scientific text. The sample size required for the analyses was supported by a power analysis, and all the proposed statistical tests were well aligned with the purpose of the study (ie, mixed-effect model for a randomized controlled trial using 1 control and 1 experimental group) and the nature or type of the studied variables (ie, chi-square tests for count variables and t tests for continuous variables). The statistical findings were clearly and concisely presented within the text and tables.
However, the produced Table 2 ( ) was inconsistent, as it did not contain confidence intervals and displayed different mean values than those presented in the results section, although the mean changes (referring to the test of intervention effectiveness) were the same.
The AI-generated review () gave quite accurate remarks regarding the fabricated article, pointed out strengths and weaknesses, and suggested possible changes. Despite the fact that some comments were self-evident (single-center study design and limited follow-up time), there were no substantial errors.
Detection Tools for AI-Generated Manuscript
There are several publicly available web-based tools to identify AI-generated text. For example, AI Detector by Content at Scale claims to detect patterns and forecast the most probable word choices that lead to a higher AI detection probability . We gave this tool a trial with our AI-generated article and the result was that probability of AI content was 48%, that is, inconclusive.
Another example of such a tool is AI Text Classifier by OpenAI (the same company that developed ChatGPT) . AI Text Classifier gives result on a scale of very unlikely, unlikely, unclear if it is, possibly, or likely AI generated. Our AI-generated paper was classified as “unclear.”
Detection Tools for AI-Generated Review
AI Detector by Content at Scale found that the probability of AI content in the ChatGPT-generated review was 72%, that is, “highly likely to be AI generated.” AI Text Classifier by OpenAI classified the ChatGPT-generated review to be “likely AI generated.”
We have demonstrated that AI (ChatGPT) can create a highly convincing medical article that is completely fabricated with limited effort from a human user in a matter of hours. Nevertheless, the article would need an expert review and some improvements to be ready for possible submission. Shortcomings that are mentioned in the results section do not show any specific pattern; they are rather minor inaccuracies and minor study design flaws. Although a substantial number of citations seemed genuine at first glance, they were later found to have been fabricated. To our best knowledge, the errors the AI made were indistinguishable from those that a human could make.
There have been a number of high-profile cases of scientific fraud and misconduct in recent years, including cases where authors have fabricated or manipulated data, plagiarized content, or otherwise misrepresented their findings . Although AI language models such as ChatGPT are a relatively new tool in scientific writing, it is possible that they could be used in similar ways to create fraudulent content.
ChatGPT is a cutting-edge NLP model developed by OpenAI that uses a transformer architecture to generate high-quality text in response to natural language prompts. Similar to other NLP models, ChatGPT works by analyzing large data sets of natural language text to learn patterns and structures in language, which it can then use to generate new text that is both coherent and contextually relevant.
At its core, ChatGPT is a large neural network that is trained on a massive corpus of text data (until the year 2021), such as books, articles, and web-based content. The model consists of multiple layers of self-attention and feedforward neural networks, which allow it to identify and model complex relationships between words and phrases in natural language text.
To generate new text using ChatGPT, a user provides a natural language prompt or question that the model uses to generate a sequence of tokens representing a coherent and contextually relevant response. The length and complexity of the response can be controlled by adjusting the parameters of the model, such as the length of the input prompt and the temperature of the sampling algorithm used to generate the response.
Although ChatGPT is primarily designed for use in conversational AI and chatbot applications, it has also shown promise in a range of other NLP tasks, including text completion, summarization, and machine translation. In recent years, researchers have also begun exploring the potential of ChatGPT and other NLP models for use in scientific writing and research, including generating scientific papers and summarizing research findings.
Some recent studies suggest that ChatGPT and other NLP models have substantial potential for use in scientific writing and research, particularly for tasks that involve summarizing or generating large volumes of text .
Some researchers point out that ChatGPT sometimes writes plausible sounding but incorrect or nonsensical answers and that using it for medical writing still requires human judgment . However, our findings suggest that the level of sophistication required for human input may not be overly complex. An obvious weakness that we encountered in this study is the quality of citations. As technology continues to advance, it is likely that specialized large language models will be developed, reducing their monetary costs and mitigating some of their current limitations.
Interestingly, Kung et al  evaluated the performance of ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of 3 exams: Step 1, Step 2 Clinical Knowledge, and Step 3. ChatGPT performed at or near the passing threshold for all 3 exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations [ ].
We are not aware of any specific evidence that ChatGPT has been intentionally misused for fraud in scientific writing, but it is certainly a possibility. Few articles have focused on abstract ghostwriting and its implications for the academic ethics of using AI in manuscript preparation, as well as issues of originality and authorship [- ].
An obvious emerging challenge that publishers are facing is the detection of AI-created text. To address this challenge, many publishers are implementing various tools and techniques. One approach involves using machine learning algorithms to analyze the language, structure, and other features of the text to determine whether it was likely to have been created by a human or a machine. As demonstrated above, the current AI detection tools were unable to detect an AI-generated manuscript. However, in the case of an AI-generated review, these tools were more accurate, labeling the text as “likely” or “highly likely” to have been generated by AI. Another approach to address AI-generated content involves developing ethical guidelines and standards, which can help ensure that AI-generated content is transparent and accountable. Despite these challenges, the use of AI in scientific writing is likely to become increasingly common in the future, and publishers will need to continue to adapt and evolve their approaches to ensure the integrity and quality of their publications. An effective measure to prevent fraud as described in this paper (ie, completely fabricated articles) could be the mandatory submission of data sets, potentially verified by local authorities.
As we mentioned earlier, the ability of AI language models such as ChatGPT to generate coherent and realistic text has raised concerns about the potential for their misuse in creating fraudulent or misleading content. To the best of our knowledge, no paper has so far reported on fabricating a whole scientific article using AI.
In conclusion, our experiment using ChatGPT to generate an authentic looking but completely fabricated scientific paper highlights the potential risks associated with the use of AI in scientific writing. Although current AI language models can generate sophisticated and seemingly flawless papers, expert readers may identify semantic inaccuracies and errors upon closer inspection, particularly in the references.
As AI language models continue to advance in their capabilities, it will become increasingly important to develop ethical guidelines and best practices for their use in scientific writing and research. This may include strategies for verifying the accuracy and authenticity of content generated using these tools, as well as mechanisms for detecting and preventing fraud and misconduct.
At the same time, it is important to recognize the potential benefits of using AI language models in scientific writing and research, such as improving the efficiency and accuracy of document creation, analyzing results, and language editing. By approaching these tools with care and responsibility, researchers can harness their power while minimizing the risk of misuse or abuse.
Ultimately, the future of AI in scientific writing and research will depend on how well we navigate these ethical challenges and leverage the full potential of these tools for the benefit of scientific society.
This study was supported by the Ministry of Defence of the Czech Republic (grant MO1012) and Cooperatio Neuroscience UK, which was funded by Charles University. The funding sources had no impact on the study design, collection, analysis, and interpretation of data; on the writing of the article; or on the decision to submit the article for publication.
We used the generative AI tool ChatGPT to draft a fabricated article and a review. The original ChatGPT transcripts are made available as- .
Conflicts of Interest
The artificial intelligence–generated article from ChatGPT (Chat Generative Pre-trained Transformer).DOCX File , 31 KB
Full review conversation with ChatGPT (Chat Generative Pre-trained Transformer).DOCX File , 20 KB
- DALL·E. OpenAI. URL: https://labs.openai.com/s/nrU1jXnMGwdOw0AwkCPtQIN4 [accessed 2023-05-25]
- Liebrenz M, Schleifer R, Buadze A, Bhugra D, Smith A. Generating scholarly content with ChatGPT: ethical challenges for medical publishing. Lancet Digit Health 2023 Mar;5(3):e105-e106 [https://boris.unibe.ch/id/eprint/178562] [CrossRef] [Medline]
- Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. 2020 Presented at: 34th Conference on Neural Information Processing Systems (NeurIPS 2020); December 6-12, 2020; Vancouver, BC URL: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
- Dathathri S, Madotto A, Lan J, Hung J, Frank E, Molino P, et al. Plug and play language models: a simple approach to controlled text generation. arXiv Preprint posted online on December 4, 2019. [CrossRef]
- Introducing ChatGPT. OpenAI. URL: https://openai.com/blog/chatgpt [accessed 2023-05-24]
- AI detector. Content at Scale. URL: https://contentatscale.ai/ai-content-detector/ [accessed 2023-05-24]
- AI text classifier. OpenAI. URL: https://platform.openai.com/ai-text-classifier [accessed 2023-05-24]
- Wu Y, Mo J, Sui L, Zhang J, Hu W, Zhang C, et al. Deep brain stimulation in treatment-resistant depression: a systematic review and meta-analysis on efficacy and safety. Front Neurosci 2021 Apr 1;15:655412 [https://europepmc.org/abstract/MED/33867929] [CrossRef] [Medline]
- Figee M, Riva-Posse P, Choi KS, Bederson L, Mayberg HS, Kopell BH. Deep brain stimulation for depression. Neurotherapeutics 2022 Jul;19(4):1229-1245 [CrossRef] [Medline]
- Gaynes BN, Lux L, Gartlehner G, Asher G, Forman-Hoffman V, Green J, et al. Defining treatment-resistant depression. Depress Anxiety 2020 Feb;37(2):134-145 [CrossRef] [Medline]
- Nato CG, Tabacco L, Bilotta F. Fraud and retraction in perioperative medicine publications: what we learned and what can be implemented to prevent future recurrence. J Med Ethics 2022 Jul;48(7):479-484 [CrossRef] [Medline]
- Chen TJ. ChatGPT and other artificial intelligence applications speed up scientific writing. J Chin Med Assoc 2023 Apr 01;86(4):351-353 [CrossRef] [Medline]
- Kitamura FC. ChatGPT is shaping the future of medical writing but still requires human judgment. Radiology 2023 Apr;307(2):e230171 [CrossRef] [Medline]
- Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023 Feb;2(2):e0000198 [https://europepmc.org/abstract/MED/36812645] [CrossRef] [Medline]
- Else H. Abstracts written by ChatGPT fool scientists. Nature 2023 Jan;613(7944):423 [CrossRef] [Medline]
- Flanagin A, Bibbins-Domingo K, Berkwits M, Christiansen SL. Nonhuman "authors" and implications for the integrity of scientific publication and medical knowledge. JAMA 2023 Feb 28;329(8):637-639 [CrossRef] [Medline]
- Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 2023 Feb;15(2):e35179 [https://europepmc.org/abstract/MED/36811129] [CrossRef] [Medline]
|AI: artificial intelligence|
|ChatGPT: Chat Generative Pre-trained Transformer|
|DSM-V: Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition|
|GPT-2: Generative Pre-trained Transformer 2|
|GPT-3: Generative Pre-trained Transformer 3|
|HDRS: Hamilton Depression Rating Scale|
|NLP: natural language processing|
|TRD: treatment-resistant depression|
|USMLE: United States Medical Licensing Exam|
Edited by T de Azevedo Cardoso; submitted 02.03.23; peer-reviewed by Y Wang, B Chaudhry, W Yang; comments to author 21.04.23; revised version received 25.04.23; accepted 03.05.23; published 31.05.23Copyright
©Martin Májovský, Martin Černý, Matěj Kasal, Martin Komarc, David Netuka. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 31.05.2023.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.