Plagiarism, Credit, and the Problem of AI’s Knowledge Base
By Ax de Klerk | 20 Dec 2025
Across the first two essays in this series, I used music as a way of thinking through AI. First, as a metaphor for learning and expertise, and then as a mirror for meaning, projection, and emotional response. Music mattered because it exposed something that technical discussions often obscure: that creative work is never just output but accumulated human experience.
This third essay returns to that same terrain from a different angle. If music helped reveal how meaning is supplied by the listener, it also forces a harder question about origin and credit. When an AI system is trained on decades of human creativity, journalism, and scholarship, at what point does learning become borrowing, and borrowing become plagiarism?
1. Introduction: Learning the Rules Before Writing
In my first year at university, studying Criminology, and before I was asked to write my first assignment, I was taught the importance of referencing. Not only to avoid plagiarism, but, more importantly, to give credit to the original thought. The only exception was common knowledge. This was explained as information that everybody, for the most part, would know.
For example, “Shakespeare was a British playwright” would not require a citation, whereas “Shakespeare preferred beer over wine” would, as that is not common knowledge.
The rules were simple. If an idea, phrase, or argument came from somewhere else, you acknowledged it. AI complicates this rule immediately.
Modern large language models are trained by ingesting and analysing vast quantities of text scraped from the internet, including books, journalism, academic papers, and creative writing. What follows is not a discussion of outputs alone, but of the knowledge base itself, and whether its construction crosses the line from learning into plagiarism.
2. Plagiarism Is About Credit, Not Copying
Plagiarism is often misunderstood as direct copying. In academic terms, it is broader than that. Plagiarism occurs when ideas, arguments, or expressions are used without appropriate attribution, regardless of whether the wording is identical. This distinction matters.
AI systems rarely reproduce verbatim passages. Instead, they absorb patterns, structures, arguments, and styles, and then recombine them. From a technical perspective, this is often described as “transformative”. From an academic perspective, it raises a more uncomfortable question: If the source of an idea cannot be traced, can credit still be given?
Human students are not allowed to answer this question by saying, “I learned it somewhere, but I can’t remember where.” AI systems operate permanently in that state.
3. When Learning Becomes Litigation
This tension is no longer theoretical. Major publishers and content owners have brought legal action against AI companies, arguing that their work was used without permission to train commercial systems.
In December 2023, The New York Times filed a lawsuit against OpenAI and Microsoft, alleging that millions of its articles were copied and used to train AI models without authorisation, and that those models could reproduce substantial portions of its journalism.
Similarly, Thomson Reuters brought a copyright infringement case against Ross Intelligence, arguing that its legal research content was used without permission to train an AI system intended to compete with Westlaw. In 2023, a U.S. judge ruled that Ross’s use was not protected by fair use, marking a significant legal precedent.
More recently, AI search startup Perplexity AI has faced lawsuits and legal threats from multiple publishers, including The New York Times and other media organisations, accusing it of reproducing news content without proper licensing or attribution. These cases are not about isolated outputs. They are about how the knowledge base itself was built.
4. The Scale Problem
A human plagiarist copies one paper at a time. AI systems ingest millions. At this scale, traditional safeguards collapse. Attribution becomes impractical. Licensing becomes retroactive. Consent becomes ambiguous. What would be unacceptable for a student or researcher becomes normalised when performed by a machine. This is the core ethical shift.
Plagiarism, once a breach of conduct, becomes a systemic property of AI technology. The issue is no longer whether individual outputs infringe, but whether a system trained on uncredited human labour can ever claim intellectual neutrality. The disappearance of attribution is not accidental. It is structural.
5. Common Knowledge, Rewritten
AI training often relies on the idea that widely available information is functionally equivalent to common knowledge. This is a dangerous conflation. Information being accessible does not make it unowned. Journalism behind a paywall, academic research published in journals, and creative writing shared online are not transformed into common knowledge simply because they exist digitally. Universities draw this line clearly. AI companies have, so far, drawn it opportunistically. The result is a grey zone where human authors are expected to cite meticulously, while machines are allowed to generalise without credit.
6. Plagiarism Without Intent
One defence often raised is that AI systems lack intent. They do not mean to plagiarise. This is true, but irrelevant. Plagiarism is not judged by intent alone. A student who accidentally fails to cite a source is still penalised. Responsibility lies with the system’s operator, not its internal awareness. AI removes intent, but not consequence. When companies profit from systems built on uncredited human knowledge, the ethical burden does not disappear. It shifts upward.
7. Conclusion: Credit Is the Point
The ethical problem is not that AI learns. It is that it learns without acknowledgement. Education systems teach students that knowledge is cumulative and communal, but also traceable. AI systems absorb that same communal knowledge while severing the trace.
If plagiarism is fundamentally about respect for intellectual labour, then AI forces an uncomfortable reckoning. Either we redefine plagiarism to accommodate machines, or we admit that current AI systems operate by standards we would never allow for humans.
The question is not whether AI can generate answers. It is whether a knowledge system built on invisible borrowing can ever be considered intellectually honest.
This is where the discussion can no longer stop at plagiarism alone. Questions of credit inevitably lead to questions of consent, compensation, and responsibility. If we accept that AI systems are built on uncredited human knowledge, then the next problem is not whether that knowledge was used, but under what conditions it should be allowed to be used at all.
Plagiarism exposes the fault line. What follows is a question of governance.
References:
In addition to the below I have to credit some of this content to conversations I have had with Paul Brunyee, otherwise I have to pay him in beer for his IP and compensation for plagiarism.
1. Academic Definitions of Plagiarism and Attribution
- https://www.ox.ac.uk/students/academic/guidance/skills/plagiarism
- https://www.plagiarism.admin.cam.ac.uk
- https://www.turnitin.com/blog/what-is-plagiarism
2. The New York Times vs. OpenAI / Microsoft
- https://www.reuters.com/legal/new-york-times-sues-openai-microsoft-copyright-infringement-2023-12-27/
- https://www.nytimes.com/2023/12/27/business/media/new-york-times-openai-microsoft-lawsuit.html
3. Thomson Reuters vs. Ross Intelligence
- https://www.reuters.com/legal/thomson-reuters-wins-ai-copyright-lawsuit-2023-09-25
4. Perplexity AI and Publisher Disputes
- https://www.reuters.com/technology/media-groups-accuse-ai-startups-of-content-theft-2024-06-06/
- https://www.bbc.com/news/technology-66869268
5. Authors and Creative Works Used in AI Training
- https://authorsguild.org/news/ag-sues-openai/
- https://www.reuters.com/world/us/authors-sue-openai-copyright-2023-09-19/
6. Scraping, “Public” Data, and Consent
- https://digital-strategy.ec.europa.eu/en/policies/copyright-legislation
- https://www.eff.org/issues/ai-and-copyright
7. Image generation for all 3 images for the 3 parts
- Chat GPT - Image Generator