Author: Matthew Sag

Abstract: Individually and collectively, copyrighted works have the potential to generate information that goes far beyond what their individual authors expressed or intended. Various methods of computational and statistical analysis of text — usually referred to as text data mining (“TDM”) or just text mining — can unlock that information. However, because almost every use of TDM involves making copies of the text to be mined, the legality of that copying has become a fraught issue in copyright law in United States and around the world. One of the most fundamental questions for copyright law in the Internet age is whether the protection of the author’s original expression should stand as an obstacle to the generation of insights about that expression. How this question is answered will have a profound influence on the future of research across the sciences and the humanities, and for the development of the next generation of information technology: machine learning and artificial intelligence.

This Article consolidates a theory of copyright law should that I have advanced in a series of articles and amicus briefs over the past decade. It explains why applying copyright’s fundamental principles in the context of new technologies necessarily implies that copying expressive works for non-expressive purposes should not be counted as infringement and must be recognized as fair use. The Article shows how that theory was adopted and applied in the recent high-profile test cases, Authors Guild v. HathiTrust and Authors Guild v. Google, and takes stock of the legal context for TDM research in the United States in the aftermath of those decisions.

The Article makes important contributions to copyright theory, but is also integrates that theory with a practical assessment various interrelated legal issues that text mining researchers and their supporting institutions must confront if they are to realize the full potential of these technologies. These issues range from the enforceability of website terms of service, the effect of laws prohibiting computer hacking and the circumvention of technological protection measures (i.e., encryption and other digital locks), and cross-border copyright issues.

Citation: Sag, Matthew, The New Legal Landscape for Text Mining and Machine Learning (February 9, 2019). Available at SSRN: https://ssrn.com/abstract=3331606