By Jonathan Band, Policy Bandwidth

In the heated copyright discussions over generative artificial intelligence, the term “text and data mining” is sometimes used interchangeably with “machine learning” or the “ingestion of material” to train large language models in the generative artificial intelligence context.  This paper attempts to define these activities more precisely to better understand their treatment under different legal regimes. 

Text and data mining (TDM) involves the collection of vast amounts of digitized material and the use of software to analyze and extract information from it. “TDM is a crucial first step to many machine learning, digital humanities, and social science applications….”[1] Machine learning is a technique for building artificial intelligence (AI) systems that is characterized by a computer’s ability to automatically learn and improve on the basis of data or experience, without relying on explicitly programmed rules.[2] Generative AI—AI that generates expressive material such as text, images, audio, or video in response to user prompts—is one form of AI that is constructed by the use of machine learning. Thus, TDM is a crucial first step in the development of generative AI models; but TDM has many applications completely unrelated to generative AI.[3]

TDM Activities Unrelated to Generative AI

Indeed, some TDM research methodologies are completely unrelated to AI altogether. For instance, sometimes TDM can be performed by developing algorithms to detect the frequency of certain words within a corpus, or to parse sentiments based on the proximity of various words to each other.[4] In other cases, though, scholars must employ machine learning techniques to train AI models before the models can make a variety of assessments. Comments prepared by the University of California Berkeley Library in response to the U.S. Copyright Office’s Notice of Inquiry on Artificial Intelligence and Copyright illustrate this distinction:

Imagine a scholar wishes to assess the prevalence with which 20th century fiction authors write about notions of happiness. The scholar likely would compile a corpus of thousands or tens of thousands of works of fiction, and then run a search algorithm across the corpus to detect the occurrence or frequency of words like “happiness,” “joy,” “mirth,” “contentment,” and synonyms and variations thereof. But if a scholar instead wanted to establish the presence of fictional characters who embody or display characteristics of being happy, the scholar would need to employ “discriminative modeling” (a classification and regression technique) that can train AI to recognize the appearance of happiness by looking for recurring indicia of character psychology, behavior, attitude, conversational tone, demeanor, appearance, and more.[5]

Significantly, neither of these TDM examples involves generative AI. The AI employed in the second example would not be considered generative AI because it would produce research results, and not create new expressive material such as text.   

Legal Exceptions for TDM

Certain jurisdictions have adopted exceptions to copyright law permitting the copying necessary to engage in TDM. The scope of these exceptions vary. Some are broad enough to apply to all forms of TDM, while others might not apply to TDM engaged in for most generative AI purposes. 

European Union. For example, the TDM exception mandated by Article 3 of the European Union Directive on Copyright in the Digital Single Market likely would apply to the training of generative AI models only in unusual circumstances.[6]Article 3 requires Member States to enact copyright exceptions permitting “reproductions and extractions made by research organisations and cultural heritage institutions in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access.” Because of the scientific research requirement, Article 3 would apply to the training of generative AI models only for the purpose of scientific research. While there are many circumstances in which researchers would employ TDM for scientific research purposes (such as the examples in the previous section), there are far fewer circumstances in which the researcher would employ TDM to train a generative AI model, apart from research on generative AI. Generative AI would of course be useful to a researcher when she was writing up the results of her research. But she likely would use a general purpose generative AI such as ChatGPT for that purpose. She certainly wouldn’t create her own generative AI model.  

In contrast, Article 4 of the EU DSM Directive more typically would apply to the training of generative AI models. Article 4 requires EU Member States to enact copyright exceptions permitting “reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining.” These exceptions apply on condition that the rightsholders have not opted-out “in an appropriate manner, such as machine-readable means in the case of content made publicly available online.” An Article 4 TDM exception can be used by any entity—not only research organizations and cultural heritage institutions—and for any purpose—not only scientific research. Without the scientific research limitation, Article 4 would easily apply to the training of generative AI models.

Provisions consistent with these two articles must be enacted in all 27 Member States of the European Union. Additionally, countries seeking EU membership will also eventually enact similar provisions.

Singapore. Like Article 4 of the EU DSM Directive, Singapore’s TDM exception would also apply to the training of generative AI models. Section 244 of the Singapore Copyright Act allows the copying of works for the purpose of “computational data analysis.” Section 243 of the Act defines computational data analysis of a work as “(a) using a computer program to identify, extract and analyse information or data from the work or recording; and (b) using the work or recording as an example of a type of information or data to improve the functioning of a computer program in relation to that type of information or data.” Section 243 provides an illustration stating that “an example of computational analysis under paragraph (b) is the use of images to train a computer program to recognize images.” Like Article 4 of the EU DSM Directive, this provision contains no limitation such as a scientific research research purpose.

Japan. Similarly, Japan’s TDM exception is worded broadly enough to apply to the training of generative AI. Article 30-4(ii) of the Japan Copyright Act permits uses of works “in data analysis (meaning the extraction, comparison, classification, or other statistical analysis of the constituent language, sounds, images, or other elemental data from a large number of works or a large volume of other such data…).” Like the Article 4 in the EU DSM Directive and Singapore’s TDM exception, this provision contains no limitations such as a scientific research purpose.

United Kingdom. Conversely, the TDM exception in Article 29A of the United Kingdom’s Copyright, Designs, and Patents Act applies only to “computational analysis” for “the sole purpose of research for a non-commercial purpose.” Like Article 3 of the EU DSM Directive, this appears generally to preclude copying for the training for generative AI unless the purpose was researching generative AI. 

Ukraine. Ukraine has an even narrower TDM exception, permitting the making of copies “from a legitimate source for the purpose of searching for text and data included in or related to scientific publications for research purposes.” Article 22(2)(14) of the Ukraine Copyright Law appears to permit only the assembly of a corpus for the purpose of searching for information, not the sort of computational analysis necessary for training AI. Further, the corpus could contain only scientific publications. Like Article 4 of the EU DSM Directive, “this provision shall apply if the use of works has not been expressly prohibited by copyright holders in an appropriate manner, in particular, by computer-readable means from digital content available on the Internet.”

United States. Although the United States does not have a TDM exception, the Librarian of Congress in the triennial section 1201 rulemaking adopted exemptions on the prohibition on the circumvention of technological protection measures for the purpose of TDM relating to motion pictures and literary works. Specifically, the exemption applies when the circumvention is undertaken by “a researcher affiliated with a nonprofit institution of higher education, or by a student or information technology staff member of the institution at the direction of such researcher, solely to deploy text and data mining techniques on a corpus” of motion pictures and literary works “for the purpose of scholarly research and teaching.” 37 CFR §§ 201.40(b)(4) and (5).[7] The Librarian issued the exemptions based on the recommendation of the Register of Copyrights, who found that the proposed TDM activities would likely constitute a fair use under Authors Guild, Inc. v. HathiTrust, 755 F.3d 87 (2d Cir. 2014), and Authors Guild v. Google, Inc., 804 F.3d 202 (2d Cir. 2015).[8]The scope of the section 1201 exemptions is somewhat broader than the scope of Article 3 of the EU DSM Directive in that the section 1201 exemptions applies to “scholarly research and teaching,” while Article 3 applies only to scientific research. The section 1201 exemptions would apply to the training of generative AI models only when the researcher could show that she was performing the activity for the purpose of scholarly research or teaching.

Conclusion

Article 4 of the EU DSM Directive, Sections 243-44 of the Singapore Copyright Act, and Article 30-4(ii) of the Japan Copyright Act contain exceptions for TDM broad enough to encompass the training of generative AI models. Article 3 of the EU DSM Directive and Article 29A of the UK copyright law would also allow TDM for the training of generative AI models for scientific research. Similarly, the section 1201 exemptions granted by U.S. Librarian of Congress would apply to the training of generative AI models only when the researcher could show that she was performing the activity for the purpose of scholarly research or teaching. The TDM provision in Ukraine, however, would not appear to permit use in training AI models.


[1] Sean Flynn, et al., Legal reform to enhance global text and data mining research, Science, Dec. 1, 2022, https://www.science.org/doi/10.1126/science.add6124

[2] See U.S. Copyright Office, Artificial Intelligence and Copyright, 88 Fed. Reg. 59942, Aug. 30, 2023 at 59949. The Copyright Office explains that “machine learning involves ingesting and analyzing materials such as quantitative data or text and obtain[ing] inferences about qualities of those materials and using those inferences to accomplish a specific task. These inferences are represented within an AI model’s weights.” Id

[3] All squares are rectangles, but not all rectangles are squares. Similarly, training a generative AI model always involves TDM, but not all TDM consists of training generative AI models.

[4] See, for example Google Research. Google Books Ngram Viewer. Retrieved October 3, 2023, fromhttps://books.google.com/ngrams/info and Sentiment analysis. (2023). In Wikipedia.https://en.wikipedia.org/w/index.php?title=Sentiment_analysis&oldid=1178380470.

[5] UC Berkeley Library Comments on Artificial Intelligence and Copyright at 1-2. The comments’ authors differentiate discriminative modeling (classification and regression) from generative modeling (systems capable of producing outputs such as text or images). Lee, K., Cooper, A. F., & Grimmelmann, J. (2023). Talkin’ ‘Bout AI Generation: Copyright and the Generative AI Supply Chain (SSRN Scholarly Paper 4523551). https://doi.org/10.2139/ssrn.4523551, p. 11.

[6] Article 2 of the Directive defines “text and data mining” as “any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.”

[7] The exemptions do not define text and data mining, but the Register of Copyright’s recommendation that these exemption by granted stated that “TDM methods enable researchers to sift through large collections of information to draw insights and observe trends. TDM requires creating a dataset of works of interest, which ‘typically involves digitizing or downloading (i.e., reproducing) potentially copyrighted works in order to perform algorithmic extractions’ from them.”

[8] U.S. Copyright Office, Section 1201 Rulemaking: Eighth Triennial Proceeding to Determine Exemptions to the Prohibition on Circumvention, Recommendation of the Register of Copyrights, Oct. 2021, at 107-17, https://cdn.loc.gov/copyright/1201/2021/2021_Section_1201_Registers_Recommendation.pdf