GLOBAL

Humans plus AI detectors can catch AI-generated academic writing
Specific artificial intelligence content detectors and experienced human reviewers can accurately identify AI-generated academic articles, even after paraphrasing, upholding academic integrity in scientific publishing.This is the main message that emerged from a study titled “The Great Detectives: Humans versus AI detectors in catching large language model-generated medical writing”, published in the International Journal for Educational Integrity on 20 May.
The study indicated that the application of AI in academic writing has raised concerns regarding accuracy, ethics and scientific rigour because some AI content detectors may not accurately identify AI-generated texts, especially those that have undergone paraphrasing.
“There is a pressing need for efficacious approaches or guidelines to govern AI usage in specific disciplines,” the study indicated.
Results
The study, therefore, purposely chose 50 rehabilitation-related articles from four peer-reviewed journals, and then fabricated another 50 articles using ChatGPT. Wordtune was then used to rephrase the ChatGPT-generated articles.
Six common AI content detectors (Originality.ai, Turnitin, ZeroGPT, GPTZero, Content at Scale, and GPT-2 Output Detector) were employed to identify AI content for the original, ChatGPT-generated and AI-rephrased articles. Additionally, four human reviewers (two student reviewers and two professorial reviewers) were recruited to differentiate between the original articles and AI-rephrased articles.
The study showed that while Originality.ai correctly detected 100% of ChatGPT-generated and AI-rephrased texts, ZeroGPT accurately detected 96% of ChatGPT-generated and 88% of AI-rephrased articles. Turnitin also showed a 0% misclassification rate for human-written articles, although it only identified 30% of AI-rephrased articles.
The research indicated that professorial reviewers accurately discriminated at least 96% of AI-rephrased articles based on ‘incoherent content’ (34.36%), followed by ‘grammatical errors’ (20.26%), and ‘insufficient evidence’ (16.15%), but they misclassified 12% of human-written articles as AI-generated.
It pointed out that on average students only identified 76% of AI-rephrased articles.
Significance of the study
In a joint statement sent to University World News, four of the study’s authors, namely Fadi Al Zoubi, Jae Liu, Kelvin Hui and Arnold Wong at Hong Kong Polytechnic University, said: “This is the first study to compare the accuracy of various commonly used AI content detectors and human reviewers in distinguishing between artificial intelligence generated or AI-paraphrased articles and published peer-reviewed articles in the rehabilitation field.
“Our results underscores that the investigated AI content detectors had diverse accuracy and misclassification rate,” they added.
“For example, one commonly used tool in the academic setting exhibited perfect accuracy in recognising human-written articles, but had difficulty in identifying AI-paraphrased content,” they explained.
“Similarly, experienced professor reviewers detected at least 96% of AI-paraphrased articles, while undergraduate and graduate students displayed lower accuracy rates, and a higher tendency in misclassifying human-written articles,” the authors said.
“These findings highlight the critical need for ongoing development and refinement of AI detection tools to balance high detection rates of AI-generated content with minimal misclassification of human-authored texts,” they continued.
“Additionally, the results suggest the importance of enhancing the competence of inexperienced human reviewers in distinguishing between AI-generated and human-written content, thereby enhancing the integrity and reliability of scholarly work in the digital age.”
Practical insights
“Our study offers practical insights for academics, universities, publishers, and reviewers on harnessing the potential of AI content detectors, while safeguarding the integrity of academic works in the face of the growing use of generative AI technologies,” they said.
With reference to academics and universities, the authors said it is crucial for universities to establish comprehensive guidelines on the ethical use of generative AI in written assignments, and to educate students on the credibility of their works when relying on AI-generated content.
Further, education institutions should implement a dual-layered screening process that combines multiple AI content detectors with human evaluation to detect AI-generated content in student submissions to ensure a fair and integrity-driven academic environment.
With reference to publishers and reviewers, the authors said: “Our research substantiates the effectiveness of the current peer-review system in differentiating between AI-paraphrased and human-authored articles.”
“Nevertheless, to further bolster the reliability of this process, we recommend journals to incorporate at least one proven AI detection tool as a preliminary screening measure.
“This step would help identify potential plagiarism and AI-generated content before the peer-review stage, streamlining the review process and preserving the scholarly value of published work,” they explained.
Embracing a proactive stance
“Given the swift advancements in generative AI technologies and the corresponding evolution of content detectors, it is imperative for academics, universities, publishers, and reviewers to remain vigilant and informed,” stressed Al Zoubi, Liu, Hui and Wong.
“By staying abreast of technological developments and continuously refining strategies and policies to detect and manage AI-generated content, the academic community can safeguard the integrity and authenticity of scholarly works.
“This proactive stance is not just about mitigating risks but also about embracing the opportunities that generative AI presents for enhancing research, learning, and the dissemination of knowledge.
“As we navigate this evolving landscape, our study serves as a reminder of the importance of balancing innovation with ethical considerations and quality control to ensure that the academic and scientific discourse remains trustworthy and credible,” the authors concluded.
Expert’s views
Dr Mike Perkins, head of the Centre for Research and Innovation at the British University Vietnam, told University World News: “This is an interesting study which adds to the ongoing discussion about the ability of so-called AI text detectors to determine whether a piece of text is produced by a human or the output of a GenAI tool.”
“However, we need caution in extrapolating the results, given that the authors used ChatGPT 3.5 to generate their content for testing.”
Perkins, who is the lead author of a 2024 study titled “Academic publisher guidelines on AI usage: A ChatGPT supported thematic analysis”, explained: “The method of text creation also does not reflect a co-writing process between AI and humans, which would be a key area for exploration”.
Expanding further Dr Ahmed Elkhatat, section head of research planning in the Office of the Vice-President for Research and Graduate Studies at Qatar University, told University World News: “The research underscores that while these tools perform relatively well with identifying content generated by earlier AI models like GPT-3.5, they struggle with more advanced models like GPT-4.”
Elkhatat explained: “This highlights the necessity for continual improvements in AI detection technologies to keep pace with the advancements in AI text generation, ensuring the preservation of academic integrity in educational settings." Elkhatat is the lead author of the 2023 study titled “Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text”.
“The study provides a crucial insight into the evolving landscape of AI and its implications for academic integrity. The inconsistencies observed in the detection tools, especially regarding false positives with human-written content, emphasise the need for a multifaceted approach that combines AI detection tools with manual review processes,” Elkhatat argued.
“This approach will help mitigate the risks of academic misconduct and enhance the reliability of assessments.”
“Further research and development are essential to refine these tools and adapt them to the sophisticated capabilities of newer AI models. Additionally, future AI detectors will face continuous challenges due to the rapid development of AI generative text.”
“The percentages of false positives and the misidentification of human text are likely to increase, demanding even more advanced and nuanced detection methods,” Elkhatat concluded.