The Internet is an invaluable tool for researchers and certainly also a source of inspiration. However, never before has it been so easy to plagiarise the work of others by clipping together (copy & paste) an apparently original paper or review paper from paragraphs on several websites. Moreover, the threshold of stealing ideas, whether lifting paragraphs or perhaps even whole articles from the Internet, seems to be much lower than copying sections from books or articles. In this article, we shall use the term "cyberplagarism" to describe the case where someone, intentionally or inadvertently, is taking information, phrases, or thoughts from the World Wide Web (WWW) and using it in a scholarly article without attributing the origin.
To illustrate a case of cyberplagiarism and to discuss potential methods using the Internet to detect scientific misconduct. This report was also written to stimulate debate and thought among journal editors about the use of state of the art technology to fight cyberplagiarism.
A case of a recent incident of cyberplagiarism, which occurred in the
This is the first in-depth report of an incident where significant portions of a web article were lifted into a scholarly article without attribution. In detecting and demonstrating this incident, a tool at www.plagiarism.org, has proven to be particularly useful. The plagiarism report generated by this tool stated that more than one third (36%) of the JRCSEd article consisted of phrases that were directly copied from multiple websites, without giving attribution to this fact.
Cyberplagiarism may be a widespread and increasing problem. Plagiarism could be easily detected by journal editors and peer-reviewers if informatics tools would be applied. There is a striking gap between what is technically possible and what is in widespread use. As a consequence of the case described in this report, JMIR has taken the lead in applying information technology to prevent and fight plagiarism by routinely checking new submissions for evidence of cyberplagiarism.
On 5 August 1999, a paper titled "The quality of surgical information on the Internet" (see
The online version of the questionable article, which contained lifted phrases from the web, as published in the Journal of the Royal College of Surgeons of Edinburgh
After publication, it was determined that more than one third (36%) of this article consisted of phrases that were directly copied from multiple websites, without giving attribution to this fact. This can be labelled as plagiarism, which has been defined by the US Committee on Science, Engineering, and Public Policy as "using the ideas or words of another person without giving appropriate credit." The Committee continues by saying that plagiarism is a "strike at the heart of the values on which science is based. These acts of scientific misconduct not only undermine progress but the entire set of values on which the scientific enterprise rests" [
The following is a quick recap of the event: Shortly after publication of the article in question [
But in this case it went beyond just a missed reference, as the authors of the JRCSEd article also took material from the website
The editor in chief of JRCSEd, Professor Oleg Eremin, alerted by the author of this report (G.E.), started an investigation. The editorial board concluded that "there has been a serious infringement of copyright." The electronic version of the article was permanently deleted from the journal website. The author of the plagiarism (C.O.) was asked to write a letter of apology for publication. In a subsequent issue of JRCSEd, the editorial board published a notice of the fact that parts of the manuscript were identical to online material published at http://medpics.org and about the withdrawal of the article [
The Internet, with its vast amount of information at the fingertips of every researcher, makes it easy to lift whole phrases and paragraphs into scholarly articles. This can be a useful strategy to gather material and ideas; such techniques, and also quotes from websites, are certainly legitimate, as long as the sources are acknowledged and quotes are clearly identified as such. As this case shows, researchers are not always successful in quoting properly and may even inadvertently end up committing plagiarism.
Luckily, the Internet can also provide some technical solutions for researchers to identify unintentional omissions of attributions and for journal editors and peer-reviewers to detect and fight plagiarism. Although a number of informatics approaches are thinkable and could be applied routinely, not all of the possible approaches are in fact realised in the form of commercially available applications, and if they are, they are rarely used by researchers, journal editors, or peer-reviewers. In the following sections, I will review a number of possible approaches (which in part still wait for programmers to translate them into software).
One possible approach is to check a manuscript (for example a manuscript that has been submitted to a peer-reviewed journal) against the whole World Wide Web (WWW) and/or another collection of published articles (such as the abstracts in MEDLINE, the full text articles in PubMed Central, or e-print servers), in order to identify similar or identical phrases. While generic search engines such as AltaVista could be used to search for simple phrases, they do not allow the user to check a whole manuscript against the Web. Moreover, they cannot detect simple word substitutions; thus, plagiarists may hide the true origin of their selections by simply replacing as many words as possible with synonyms.
A more sophisticated, specialized "search engine" to detect plagiarism has been developed by Barrie and Presti [
To test the power of the system I submitted the questionable manuscript published in JRCSEd (see case report above) to the system. The plagiarism report was returned within 24 hours. The system not only flagged the paper as "medium original," but also highlighted 36% of the document as originating from different websites, most notably from the med-PICS and the Dublin Core metadata websites (see
The plagiarism.org report detected similarities with twelve webpages (listed under "similar links"). The originality of the paper was rated as "medium."
Fig. 3a+b. The words which are underlined and highlighted red in the plagiarism.org report (a) were lifted from the website medpics.org (b)
Fig. 3a+b. The words which are underlined and highlighted red in the plagiarism.org report (a) were lifted from the website medpics.org (b)
Fig. 4 a+b. The words which are underlined and highlighted green in the plagiarism.org report (a) were lifted from the Dublin Core metadata website (b)
Fig. 4 a+b. The words which are underlined and highlighted green in the plagiarism.org report (a) were lifted from the Dublin Core metadata website (b)
As an aside it should be noted that the plagiarism.org tool proved to be very sensitive, in that it also retrieved several websites which cited the same or a similar set of publications. Thus, a tool like plagiarism.org could also be used to identify similar publications on the Web which deal with related topics; thus it may serve a similar function as the "Related Articles" button in PubMed [
Other scenarios could be imagined, but are not yet available. For example, one possible future development could be that Web authors would be able to use special search engines to monitor the Web (or full text databases) prospectively and continuously to receive alerts when there have been parts of their documents "webnapped," i.e. published on other websites or lifted into articles. This would require that the authors submit whole published manuscripts or register the URL at a special search engine, together with their email address. The search engine would then not only crawl and index webpages like a normal search engine, but also automatically notify Web authors if a "similar" page shows up somewhere on the Web, or if a similar article appears in a dynamic database such as PubMed Central, Medline, in e-print servers, or other databases containing full text articles or abstracts. In fact, such software agents would not only be useful in detecting plagiarism, but could also be used to alert authors of similar new articles in their field being published on the Web or in the literature.
Plagiarism comes in many different varieties. When authors "plagiarize" themselves this is called "redundant" or "duplicate publication." According to Charles Babbage, from his book
Readers of primary source periodicals deserve to be able to trust that what they are reading is original, unless there is a clear statement that the article is being republished by the choice of the author and editor. The bases of this position are international copyright laws, ethical conduct, and cost-effective use of resources.
Duplicate publication is another kind of misconduct which could be detected by the use of modern information technology: Stephen Lock already noted that "duplicate publication might be disclosed more often if journal offices were to routinely search the databases" [
Interestingly, a case of duplicate publication occurred in the very same issue of the very same journal, conducted by the very same person as described in the case above: On page 278 of JRCSEd, C.O. published a letter "How to cope with unsolicited Email from the general public seeking medical advice" [
Without discussing this case further at this point, it should only be mentioned that intelligent software agents could be developed to alert journal editors about possible cases of redundant publication and copyright violations by automatically comparing publications with each other - for example within and between PubMed Central, Medline, in e-print servers and the web - and alert publishers if similarities are found. As both JRCSEd and BMJ have online versions of their journals, an intelligent software agent could have detected this case of duplicate publication. Once again, the effect of installing and applying such systems would be primarily an educational one: If such measures were in existence and their use known, this would probably discourage authors from submitting redundant articles and committing plagiarism.
It should be noted that other informatics techniques for detecting plagiarism exist. The Glatt Plagiarism Screening Program is a computer program especially targeted for teachers who want to prove the guilt or innocence of a student. The program detects plagiarism by analysing the writing style within a document. The software developers say that each person has an individual style of writing which is as unique as fingerprints. The procedure is described as follows: "The Glatt Plagiarism Screening Program eliminates every fifth word of the suspected student's paper and replaces the words with a standard size blank. The student is asked to supply the missing words" [
As an aside, it should also be briefly mentioned that in the field of software development and informatics education, several tools are available which can test the similarity of software to protect computer codes from being lifted; examples include the software similarity tester SIM [
The future may bring even more possibilities, especially helping authors avoid inadvertent plagiarism. One option would be to expand the concept of "copy & paste" towards "copy & paste & attribute (=give credit to the source)." Future versions of word processors could be designed which allow authors to clearly identify which parts of the document have been inserted by copy & paste and where they come from. For example, authors could be able to click on the text and the word processor would show in a comment field from what website (or other application) this "copied & pasted" part originated from.
Other developments may include techniques to assign invisible metainformation to electronic information, which could identify the author and which cannot be stripped. Such invisible "watermaters" are already in use for digital images, but future operating systems may also support metainformation assigned to text, so that the author of a given paragraph could be identified, even if the text is "copied and pasted" from one application into another.
On a different level, the company Xerox is also active in developing products which make redistribution of digital content impossible. The Digital Property Rights Language (DPRL) is a computer-interpretable language, developed at the Xerox Palo Alto Research Center, which "describes distinct categories of uses for digital works in terms of rights, including rights to copy a digital work, or to print it out, or to loan it, or to use portions of it in derivative works" [
Not only plagiarism and duplicate publication ("overreporting of research") can be a problem in medical science; "," i.e. not publishing the results of a randomised controlled trial, has also been called scientific misconduct [
Many authors seem to be encouraged to copy from the web as electronic publications are seen as "inferior" in quality and worthiness of protection, and are seen as more volatile than "real" publications on paper. While the majority of authors would refrain from copying whole paragraphs from printed articles, the barrier to do the same from web publications seems to be lower, as information on the web would disappear sooner or later, making the proof of plagiarism apparently impossible, while the printed journal would remain in the library as a durable witness of plagiarism waiting to be discovered and used as evidence. However, plagiarists should be warned that material on the Internet is not as volatile as they may think, and that future historians will well be able to reconstruct online-plagiarism, as there are online-archives of the Internet such as
Insufficient familiarity with English [
Jeremy Wyatt, a respected medical informatics researcher from London and an editorial board member of the Journal of Medical Internet Research, also says that he has "seen paragraphs of my work copied in other people's papers without acknowledgement at least three times now (in obscure conference papers and medical informatics journals) but have never kept a note of it; after the initial anger, I dismissed it as a case of "imitation is the sincerest form of flattery'." Future studies applying tools such as plagiarism.org in editorial offices may establish estimates on how widespread this phenomenon is.
In the future, the Journal of Medical Internet Research will routinely check accepted manuscripts for plagiarism, using the automatic plagiarism detector at plagiarism.org. We are the first scholarly journal worldwide to adopt such a plagiarism screening policy, but we hope (and expect) that other biomedical journals will follow. Authors should remember that there is only one easy and reliable way to avoid plagiarism charges: that is to cite the source properly, even if it is "only" an electronic document [
The author of this article is also author of the partly-plagiarized website medpics.org and editor of the Journal of Medical Internet Research.