Back to AI TrendsRegulatory Shift

The Finetuning Loophole: New Research Proves AI Models Store and Regurgitate Copyrighted Books

arXiv AI March 24, 2026
The Finetuning Loophole: New Research Proves AI Models Store and Regurgitate Copyrighted Books

This research undermines a core legal defense of the AI industry by proving that leading models like GPT-4o and Gemini store verbatim copies of training data despite safety filters. For executives, this signals a major regulatory risk: if models are found to store protected expression rather than just 'learning' from it, the 'fair use' protections currently shielding AI companies from massive licensing costs could collapse.

Key Intelligence

  • Researchers discovered that simple finetuning—teaching a model to expand plot summaries—can bypass safety filters to extract up to 90% of copyrighted books verbatim.
  • Apparently, training a model on just one author's work, such as Haruki Murakami, can inadvertently 'unlock' the verbatim memory of books from over 30 unrelated authors.
  • The study confirms that GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 all memorize the exact same book passages, indicating a systemic vulnerability across the entire industry.
  • Single verbatim 'regurgitations' exceeded 460 words at a time, directly contradicting assurances given by AI labs to regulators that models do not store copies of training data.
  • Public-domain data was found to be just as effective at 'reactivating' these hidden memories as copyrighted data, meaning safety alignment is more like a thin veil than a true deletion of data.
  • Legal experts warn this 'Alignment Whack-a-Mole' could jeopardize favorable court rulings that were previously conditioned on the adequacy of measures to prevent reproduction of protected works.