Apr 3, 2025, 12:00 AM

OpenAI's GPT-4o reportedly uses copyrighted material without permission

American artificial intelligence research organization

Highlights

Researchers claim GPT-4o exhibited an 82 percent accuracy in detecting copyrighted O'Reilly text.
The study used DE-COP inference attacks to infer training data sources for the model.
The findings highlight an urgent need for transparency and licensing frameworks in AI content training.

Story

In May 2024, researchers issued a study through the AI Disclosures Project, examining the training data of OpenAI's GPT-4o model. The study was focused on whether this advanced model had ingested copyrighted material from O'Reilly Media without proper authorization. The findings revealed that the model scored an impressive 82 percent when responding to multiple-choice questions derived from 34 copyrighted O'Reilly books, suggesting significant reliance on these texts during its training. This has raised serious questions about copyright violations and the ethics surrounding the training of machine learning models, given the absence of consent or compensation for the creators of the original content. The researchers, which included author O'Reilly, noted that their investigation utilized a technique called DE-COP inference attacks to assess the model's ability to reproduce specific excerpts from the books in question. Their results indicated a troubling trend of increasing use of non-public data in AI training, particularly contrasted with the smaller model GPT-4o Mini, which did not show evidence of having been trained on O'Reilly materials. The implications of these findings extend beyond OpenAI, pointing to a broader issue within the AI industry regarding the sourcing of training data and the necessity for more transparent licensing agreements. OpenAI has found itself under scrutiny as it faces lawsuits claiming that it utilized copyrighted materials without proper authorization to train its models. Despite maintaining it has acted within legal limits, the ongoing legal challenges might force OpenAI and other AI developers to reconsider their data practices. The researchers highlighted that if creators are not compensated for their work, the quality and diversity of content on the internet could decline, leading to what they termed the 'enshittification' of online resources. This phenomenon underscores the pressing need for responsible governance in handling content used for AI training in the future. The study emphasizes the emergence of a disturbing trend where major AI companies, including OpenAI, actively seek relaxation of copyright laws to simplify their training processes. This push for easier access to copyrighted material amid a landscape of legal challenges could potentially exacerbate tensions between content creators and AI developers. In light of these revelations and the ongoing conversations surrounding AI's responsibilities in content utilization, the future trajectory of AI model development hinges on establishing formal licensing frameworks to safeguard the rights of original authors and creators in the rapidly evolving digital landscape.

Opinions

You've reached the end