[Guest post] Lost in (incidental) memorization: When the (case) law mistakes AI training for copying

242 views

Skip to first unread message

Söğüt Atilla-Aydın

unread,

Nov 26, 2025, 1:06:55 PM (8 days ago) Nov 26

to ipkat_...@googlegroups.com

Home / AI / Artificial Intelligence / copyright / GEMA v OpenAI / Germany / incidental inclusion / Jakub Wyczik / memorization / TDM / text and data mining / [Guest post] Lost in (incidental) memorization: When the (case) law mistakes AI training for copying

[Guest post] Lost in (incidental) memorization: When the (case) law mistakes AI training for copying

Söğüt Atilla-Aydın Wednesday, November 26, 2025 - AI, Artificial Intelligence, copyright, GEMA v OpenAI, Germany, incidental inclusion, Jakub Wyczik, memorization, TDM, text and data mining

The IPKat has received and is pleased to host the following guest contribution by Katfriend Jakub Wyczik (Cyber Science) on the recent judgment of the Munich I Regional Court in GEMA v OpenAI. Here is what Jakub writes:

Lost in (incidental) memorization: When the (case) law mistakes AI training for copying

by Jakub Wyczik

The recent Munich judgment (42 O 14139/24, available in German) in GEMA v OpenAI serves as an example of law and technology talking past each other with regard to AI training. The court treated incidental memorization and temporary copies inherent in model training as infringing reproductions of works. This interpretation not only stretches the meaning of “reproduction” far beyond its intended scope, but also undermines the purpose of the EU’s text and data mining (TDM) exceptions.

The court acknowledged that TDM exceptions could be legitimately used to create AI training corpora, referencing European Commission policy. But it also held that the mere possibility of reconstructing a work from model parameters is sufficient to infringe the reproduction right, even absent any physical copy within the model. Accordingly, the court ruled that if a model enables the verbatim reproduction of protected lyrics, the model itself may constitute a “copy” of those lyrics. Thus, it effectively treated latent statistical representations as equivalent to fixed copies. This raises significant interpretive and practical questions.

1. Can a model be a “copy” based solely on statistical correlations?

Large language models (LLMs) store statistical relationships rather than verbatim reproductions of the data used to train them. Treating these internal representations as infringing reproductions risks collapsing the critical distinction between auxiliary materials and copyright works. This interpretation seems to have a similar effect to how computer programmes are protected in the EU under the controversial label of “preparatory design material”.

In contrast, the recent ruling of the High Court of England and Wales in Getty Images v Stability AI determined that Stable Diffusion did not “store or reproduce” Getty’s images (para 600 and see IPKat here). The model weights for each version of Stable Diffusion were found to never have contained or stored an infringing copy. Thus, it appears that courts may be treating the technical nature of representation as decisive. However, the key legal question remains: Where should the line be drawn between permissible correlations/representations and reproductions that are actionable under the law?

In my view, a copy only exists when it can be perceived through normal use of the medium. While this does not solve the memorization problem entirely, it helps prevent the concept of reproduction from becoming too broad. Otherwise, we will once again find ourselves debating whether a link constitutes a copy of a work (and why it should not).

2. Is the evidentiary metric sufficient?

The Munich decision appears to rely on small fragments (e.g., a 25-word chorus or a few lines) to infer that the model constitutes a copy. It is true that machine learning models can memorize verbatim training data, especially when models are large or training data are duplicated. However, empirical studies on memorization in LLMs indicate that the phenomenon does not arise from a simple one-to-one mapping of a protected work to a single parameter. Instead, memorization results from distributed representations and interactions among multiple factors related to the model’s general language capabilities (see Huang et al., 2024).

The potential for reproduction does not automatically imply that the model stores the work in a relevant sense. Morris et al. (2025) estimate that LLMs store only about 3.6 bits per parameter. For a model with one billion parameters, this corresponds to roughly 450 MB of memorized information, far less than the hundreds of gigabytes (or even terabytes) used for training. This suggests that models cannot store entire copyright works verbatim; any reproduction of a fragment is probabilistic and diffuse.

Consequently, courts should demand systematic and statistically robust evidence. This could include repeatability of extraction, long-sequence retrieval, exposure/log-probability metrics, and an assessment of actual economic harm. Ultimately, the economic impact may be the most important factor, as arguments about market dilution arguments have yet to be proven in court (see Kadrey v. Meta). But what if a work, such as lyrics, is already widely available online? Would incidental reproductions of it during output generation constitute infringement? If so, what would the degree of harm be?

3. Incidental inclusion and potential main appellate arguments

One exception does not preclude the use of another. The InfoSoc Directive permits the incidental inclusion of a work “in other material” under Article 5(3)(i). However, the Munich Court rejected this argument, stating that the law requires the larger material to be a work. Nevertheless, the Directive itself refers to “other material” and not “other work”. Different language versions (French: “dans un autre produit”, German: “in anderes Material”) confirms that the scope is broader than in the Munich judgment.

If a model’s memorization is unintentional and rare, and the model’s primary function is not to reproduce protected works, then it could qualify as incidental inclusion. This could be a key argument in OpenAI’s appeal. Training such models involves exposing them to many examples (e.g., images or texts) so they can learn statistical correlations in the data that are useful for performing a specific task. During training, the model makes predictions, measures the resulting error, and adjusts its internal parameters, usually using backpropagation, so that its performance gradually improves. In this way, it learns abstract statistical patterns derived from the training data and becomes able to make predictions on new cases (see LeCun et al.).

Given the intended purpose of the models, any copies contained within their statistical representations are most likely incidental. The goal is to generalize information, not to memorize specific examples from the training dataset. There are far better methods for the latter.

Bottom Line

Opposing the use of text and data mining exceptions for AI development is counterproductive, especially since the combination of TDM with incidental inclusion could serve as a defence against most potential infringements.

Collecting data from lawfully accessible sources to create a training corpus and copying that data while developing an algorithm that estimates the probability of instances or class labels fall under the Directive’s definition of TDM. Unintentional errors in the subsequent use of the model, such as memorization, overfitting, and other “regurgitations” in the output, may be considered incidental inclusions.

Some might still argue that the three-step test hinders the application of TDM exceptions to such activities (see IPKat here). But this requires assessing each case individually rather than drawing general conclusions. First and foremost, one must prove that such use unreasonably (not just) prejudices the legitimate (not any) interests of the author, which requires economic analysis rather than legal opinions.

Do you want to reuse the IPKat content? Please refer to our 'Policies' section. If you have any queries or requests for permission, please get in touch with the IPKat team.

Reply all

Reply to author

Forward

0 new messages