João Pedro Quintais and Nick Diakopoulos
Generative AI in the Newsroom, Link

By now you might have heard of some of the lawsuits filed against AI companies, alleging that they infringed on copyright in the training of their models. AI-generated vocals are roiling the music industry, with platforms acting to take down infringing content. And some news executives are even calling for compensation for the use of their content in the training of such systems.

From the perspective of authors and copyright holders, there is a clear concern that generative AI tools are built on the unauthorized and non-remunerated use of their works, while at the same time negatively impacting their livelihood. At the same time, it is also noted that these tools benefit many artists and content creators, whose interests should be considered when regulating these technologies from a copyright policy perspective. Others still are concerned that legal intervention at this stage would lead to market concentration and “make our creative world even more homogenous and sanitized”.

Are AI companies really violating copyright law during the model training stage? And what then about the copyright issues on the outputs of these models? These kinds of questions reflect some of the most pressing and pertinent legal issues that journalists need to consider in their use of generative AI.

In this post we’ll parse these legal issues, first offering some background on copyright law and AI models, and then reflecting on some more specific and pragmatic questions that may impact how you think about using the models in different news production tasks. (Note: the following is not legal advice and is meant for educational purposes.)

Some context on copyright law

Despite what you might have heard on Twitter, there is not one copyright law for the entire world. Rather, copyright law is territorial. To be sure, there is significant international harmonization through a number of International Treaties that set out minimum standards on certain aspects of copyright law, and which are then implemented into national laws (see here for some basics). The same is true, for instance, at the regional level, where the EU has significantly harmonized copyright law through a number of Directives (listed here); but even then each Member state must implement these directives into their national laws, sometimes leading to different rules. The upshot is that different countries have different laws to regulate copyright, including all of the questions we address below regarding generative AI tools. Moreover, due to different legal traditions, EU member states’ laws take a different approach to, for example, US law regarding key issues in this area. Our analysis below will mostly focus on EU copyright law and simplify the legal issues (so bear with us if you’re a legal expert!).

Inputs and Outputs: How to approach copyright law issues with generative AI

One way to consider the copyright aspects of generative AI tools is to divide them into legal questions that deal with the input or training side vs. questions that deal with the output side.

From the input perspective, the main issue relates to the activities needed to build an AI system. In particular, the training stage of the AI tools we are considering here requires text and data mining (TDM) of copyrighted works. In the EU, these activities are mostly regulated by two TDM exceptions in the 2019 Copyright in the Digital Single Market Directive, which cover TDM for scientific purposes (Article 3) and what is called commercial TDM (Article 4). For models like MidjourneyStable DiffusionDalle-E, or Firefly, the relevant provision would be the commercial TDM exception.

In the US, absent a specific TDM exception, the legal question is whether these activities qualify as fair use. In the aftermath of cases like Authors Guild v. HathiTrust and Authors Guild v. Google, it has been argued that the US doctrine of fair use allows for a significant range of TDM activities of in-copyright works (see here and here; for a critical framing of questions of fair use in dataset creation, see here). The result is that US copyright law is arguably one of the most permissive for TDM activities in the world, especially when compared to laws that rely on stricter exceptions and limitations, like the EU (see here). This arguably makes the US an appealing jurisdiction for companies to develop generative AI tools.

From the output perspective, a number of copyright questions are relevant. Is an output generated by Midjourney, Stable Diffusion, Dalle-E, or Firefly protected by copyright? Does such an output infringe on a copyrighted work of a third party, especially those works “ingested” during the training stage of the AI system? Under US law, is the output a “derivative work” of the “ingested” copyrighted works? Do any copyright exceptions apply to outputs that might otherwise infringe copyright?

Some of these input and output questions are already being litigated in the US and the UK, most notably in a class action litigation against providers of Stable Diffusion (see complaint and motion to dismiss), as well as in lawsuits brought by Getty Images (reported here). On the related topic of software generation, it is also important to mention the class action lawsuit against Microsoft, GitHub, and OpenAI concerning the GitHub Copilot (reported here; case updates here).

In the US, the US Copyright Office (USCO) has taken a position on some of these issues. For example, in its “Zayra of the Dawn” decision, concerning a graphic art novel, the USCO refused registration of images generated by an AI tool while allowing registration of the accompanying human authored text (see also the earlier Thaler decision). In a more recent development, the USCO issued a Copyright Registration Guidance on works containing material generated by AI. Although it is beyond the scope of this post to discuss this Guidance in detail, the document emphasizes the application of the human authorship requirement in this context and draws some concrete conclusions regarding the application of the same to popular generative tools. The following passage illustrates the approach taken to the question of whether an AI output is protect by copyright:

“when an AI technology receives solely a prompt from a human and produces complex written, visual, or musical works in response, the “traditional elements of authorship” are determined and executed by the technology — not the human user… When an AI technology determines the expressive elements of its output, the generated material is not the product of human authorship. As a result, that material is not protected by copyright and must be disclaimed in a registration application” (p.4).

In the EU, there has been no litigation to date that we are aware of. However, there seems to be an appetite for that, at least if one considers the manifesto of artists and rights holders in the European Guild for AI Regulation. In any case, there is plenty of scholarship and analysis on the topics above (reported e.g. here), which we rely on in addressing the frequently asked questions below.

Q: For image generation models like Midjourney, Stable Diffusion, Dalle-E, or Firefly are the outputs copyrighted?

In theory, AI outputs may be copyrightable if they meet the legal requirements for protection. The general direction of both EU and US law is to emphasize human originality as expressed in an output.

In the EU, such activities would have to meet the originality requirement, as interpreted by the Court of Justice of the EU. The legal formula the Court relies on is that protected subject matter must express the author’s “own intellectual creation” and their “free and creative choices” (further analysis here). In the US, judging from the USCO’s guidance, the approach is similar or at least leads to similar outcomes:

“…a human may select or arrange AI-generated material in a sufficiently creative way that “the resulting work as a whole constitutes an original work of authorship.” Or an artist may modify material originally generated by AI technology to such a degree that the modifications meet the standard for copyright protection.” (p.4)

In practical terms, both EU and US approaches lead to the conclusion that merely introducing prompts into a generative tool would not be sufficient to grant the prompter copyright over the output. The result is likely the same for most types of basic “prompt engineering”. Whether or not highly sophisticated prompts might change this conclusion is a gray area and would have to be assessed on a case by case basis. What could however probably lead to the copyright protection of an AI output is subsequent curation or redaction of the output by the prompter before its publication. In other words, it may very well make sense to do some further editing of outputs before publication.

Q: Do I own the output if I prompt the model? What am I allowed to do with the output?

This is not as simple a question as you might think. The reason is that authorship and ownership are different elements and may in practice diverge. That is to say, just because you are the author of a work does not always mean you are its owner. To a large extent, authorship and ownership will depend on how applicable law governs this question and, in many cases, on contractual arrangements.

Proving or enforcing authorship or copyright ownership of a work may sometimes be difficult in practice. For this reason, in the EU, many Member States provide for rules that establish a (rebuttable) presumption of authorship or copyright ownership, in that the person indicated on or with the published work as the author is deemed to be the author, unless proven otherwise. Naturally, in the case of AI generated output, this may lead to false attributions of authorship and ownership to a natural person, e.g. the prompter that publishes the output as their own.

But it is worth considering how this could play out for AI generated tools. As a rule, assuming the output is protected by copyright and the providers’ terms of use are silent on the topic, the prompter will be both the author and owner of the output.

The question then is, for most cases that we are discussing here, what do the providers terms of use say? To answer this question, we are assuming that these terms are valid and enforceable under different national laws, meaning that they will govern the question of ownership of the outputs. Let’s look at two examples, one for a text generator (ChatGPT) and the other for an image generator (Midjourney).

ChatGPT

OpenAI’s Terms of Use (ToU; updated 14 March 2023) apply to their large language models, such as ChatGPT and GPT-3. (For additional policies see here, including the “content policy” applicable to DALL-E here.)

As a general rule, section 3(a) the ToU states that (1) all input (e.g. a prompt) is owned by the user and (2) subject to compliance with the ToU, all rights to the Output (e.g. the generated text) are assigned to the user. In short, if a user complies with the ToU, they will own the copyright to the output (assuming the same meets the legal standard of originality). This does not include ownership of similar or identical outputs that may be generated by another user as a result of a similar or identical prompt.

The ToU also incorporates a Sharing & publication policy (updated 14 November 2022), which includes rules on (1) Social media, livestreaming, and demonstrations, (2) Content co-authored with the OpenAI API (3) Research. We’ll focus on (1) and (2).

  • Rules on (1) refer to OpenAI’s policy on permitted sharing, including compliance with their Usage policies and rules that might have interest from the copyright perspective, namely: the need to attribute the output to a user’s name or your company; and to “indicate that the content is AI-generated in a way no user could reasonably miss or misunderstand.”
  • Rules on (2), Content co-authored with the OpenAI API, are even more interesting. These apply to creators that “wish to publish their first-party written content (e.g., a book, compendium of short stories) created in part with the OpenAI API”. For such users, a number of conditions are set out, in particular: the published content must be attributed to the user or his company; the role of the AI in “formulating the content is clearly disclosed”; topics of the content are in compliance with the ToU and the content policy, and are generally not offensive. Some examples (including “stock language”) are provided on how to make these disclosures in e.g. a Foreword or Introduction.

In our view, these rules do not materially change the general ownership provision mentioned above. However, they do impose some restrictions on how the content is presented to the outside world in a way that may impact presumptions of authorship or ownership. For instance, there are clear implications for a publication’s approach to bylines so as to be compliant with the OpenAI policy. Furthermore, ownership of the outputs is conditional upon compliance with the ToU and incorporated policies, meaning that breaching these obligations may have consequences for ownership purposes.

MidJourney

Midhourney’s Terms of Service (ToS) state that (1) images and other assets generated with the tool (called the “Service”), or (2) prompts a user might enter into the tool are called the “Assets”. For our purposes, we are concerned with category (1) AI outputs, not the prompts.

Section 4 of the ToS regulates “copyright and trademark”. In our view, this is drafted in a bit of a misleading way. As a general rule, it states that the user owns all outputs it creates with the tool, “to the extent possible under current law”. That is to say, the prompter is the owner of the copyright in the output. Of course, if the output does not qualify for copyright protection under applicable law, then you will not own a copyright on it.

An important clarification of scope follows. This general rule does not apply to upscaling the images of other users. Such images or outputs remain owned by the original creator of the prompt.

Then the ToS operate a bit of a magic trick, which negates the general rule. In essence, it boils down to the following: only paying users own outputs. Two aspects are key here.

  • First, if the users are (1) employees or owners of a company with more than $1,000,000 USD a year in gross revenue and (2) are using the tool on behalf of the employer, then (3) the users must purchase a “Pro” membership in order to own outputs.
  • Second, if a user is not a “Paid Member” it does not own the outputs they create. Instead, Midhourney grants you a Creative Commons Noncommercial 4.0 Attribution International License to use the output, also called the “Asset License”.

In short, if you are not a “Paid Member” and are making a commercial use of these outputs, you might be in breach of your agreement with Midjourney. So, as practical recommendations go, if you want to make a commercial use of outputs generated by Midjourney, then become a “Paid Member” or get a “Pro” plan.

Q: What about the copyright of the artists/illustrators/photographers whose works the AI is trained on?

This question can be looked at from two perspectives. On the one hand, we can consider whether a certain output infringes the rights of the creators of works used during the training of the model. This is mostly relevant for users of AI tools, like journalists. On the other hand, we can examine whether the TDM activities to develop a generative AI model infringe on the rights of those creators. This is mostly relevant for developers of AI tools and is at the crux of ongoing litigation mentioned above.

Let’s focus on the first aspect, which is the most relevant for our purposes here. In theory, copyright protection for the output is separate from protection for the works in which the model was trained (see above the distinction between the input vs. output stages). Generative models are able to “memorize” content they are trained on, i.e. producing identity between output and input works. Although cases of identity are theoretically possible and have been reported, they are rare. Even in the Stability AI class action lawsuit, the complaint recognizes that “none of the Stable Diffusion output images provided in response to a particular Text Prompt is likely to be a close match for any specific image in the training data” (see para 93). Still, if that occurs, then there is a likelihood that the output is infringing.

While this is a statistically rare occurrence, what may occur more frequently is that there is similarity between the output and one or several of the input works. Under many national laws, an output would be infringing if it is substantially similar to a pre-existing work in the training data (on copyright’s substantial similarity test in US law, see here). In any case, whether this similarity between input and output is sufficient to lead to infringement would have to be assessed on a case by case basis.

Let’s assume that the output generated by an AI model is not an exact replica of any works used during the training stage. It’s important to note that copyright only protects original expressions of human authorship, not ideas, procedures, methods of operation, concepts, or styles (on copyright and styles, see here). However, in practice, the line between expression and style can be blurry, especially for works of popular or iconic creators. For example, if we use a generative AI tool to produce something in the style of a famous artist, and the output is similar to an existing painting of that artist in the training dataset, it can be difficult to distinguish between expression and style. Nevertheless, the legal question is whether the output is substantially similar to the work of the artist, according to applicable law. Merely copying or mimicking a pre-existing style will not per se be sufficient to establish infringement.

A related question being discussed in the US litigation mentioned above is whether output can be considered a “derivative work” of the copyrighted works the model was trained on. At least as construed in the class action litigation against providers of Stable Diffusion, the argument does not appear very strong, as it is based on an incorrect representation of how this AI model works (see here for more detailed criticism and the aforementioned Motion to Dismiss the lawsuit).

On this topic, it is important to note that EU copyright law does not contain specific rules on derivative works, although some national laws may contain regimes with some shared characteristics. As such, it is unlikely that a lawsuit on these exact grounds is brought in a EU Member State. From our perspective, the most likely grounds for a lawsuit against an AI provider in the EU would be that its TDM activities for training a model are infringing upon the reproduction right of copyright holders and not covered by an existing exception.

Q: Are there legal risks to publishing these outputs, and if so, how can those risks be minimized?

As many lawyers would say, “it depends”. From the discussion above, including existing litigation, most of the risk lies with the generative tool provider. This is especially the case for their TDM activities when training the model. From the perspective of the user that prompts the tool and subsequently uses the generated output, the risk is reduced, especially for textual output.

Assuming there are no contractual restrictions to further use of the output imposed by the provider of the AI tool (but see above the analysis of ChatGPT and Midjourney’s terms of use), the main risk is that an output is identical or close to an existing work. As we mentioned, exact replication of input and output will in principle be a rare occurrence. The risk therefore relates mostly to substantially similar outputs.

To check for substantial similarity journalists might make an effort to find related images by using a reverse image search tool such as Google or TinEye on the outputs of the model. If the image search returns results that are substantially similar then an alternative image could be generated. If a case were brought this process could potentially help in establishing good faith and, at the very least limit damages (depending on the applicable law).

From a prompting point of view, it would seem that the more sophisticated and elaborate the prompt engineering, the less likely that the output would be substantially similar to a pre-existing work, and the lower the risk. Moreover, as a rule, prompting norms should be established so that the name or distinctive style of an artist is not used, unless that artist’s work is fully in the public domain. This is to explicitly avoid the chance of copying that artist’s work and reduce the potential for impacting the market for their work. One important exception would be where the prompter wishes to engage in a transformative use of that artist’s work or produce some type of political, artistic or socially relevant commentary over their work; in such a case, it is likely that the use in question is protected by existing exceptions, both in the EU and US.

In any case, good faith uses for non-commercial or journalistic purposes would generally fare better than bad faith and (exclusively) commercial uses in a potential infringement scenario. Risk could also be mitigated by having mechanisms in place to remove the allegedly infringing content upon obtaining knowledge of this potential illegality (e.g. a notice and take-down system).

Q: Considering the last question, is the legal situation any different for generative text than for images?

From our perspective, the risk seems to be lower for generative text than images. The main reason is that infringement of a pre-existing work will generally be harder to establish for textual content than for images. Moreover, the closer the text is to factual material, public domain material, or other unprotected material, the harder it is to establish infringement.

Q: Is there something news organizations can do to protect their work if they don’t want it used to train AI models? Do news organizations (or individual journalists) have any recourse to protect their material if it has already been incorporated into learned models without consent?

In the EU at least, the above mentioned “commercial” TDM exception provides a clear avenue to do this under an “opt-out” mechanism. Article 4 of the CDSM Directive sets forth an exception for reproductions and extractions of lawfully accessed works/subject matter for the purposes of TDM. This exception is subject to reservation by rights holders, including through “machine-readable means in the case of content made publicly available online”, for instance through the use of metadata and terms and conditions of a website or a service. This is usually called the “opt-out” provision and is already being used in practice by some creators, for instance through tools like those provided by spawning.ai. Some commentators consider that this approach has the potential to increase the bargaining power of rights holders and lead to licensing deals with (and remuneration from) AI providers, while others are more critical, arguing this will lead to market concentration and exploitation of creative workers by big companies. In theory at least, if they see this approach as promising, news organizations or individual journalists could do the same. To prevent future use of content not already included, they can also deploy technical restrictions to crawling or harvesting from their services, but it would be important to avoid that such technical blocks impact other welcomed uses of their content. Eventually, standards could emerge for both opt-outs and technical restrictions.

One shortcoming of the TDM opt-out approach is that it relies significantly on the public availability of training datasets (e.g. like LAION’s ) in order to effectively opt-out. You have to have some way to know that your image was actually used in training. In response to this issue and in the context of the proposed AI Act, EU lawmakers are currently considering requiring providers of generative AI systems (as types of “foundation models”) that they “make publicly available a summary disclose the use of training data protected under copyright law.”

Q: What are the biggest open legal questions surrounding generative AI right now?

Generative AI has legal implications beyond copyright law, such as liability (listen here and here) and privacy violations as seen in the recent restriction of ChatGPT in Italy. The biggest copyright law question in the EU and US is probably whether using in-copyright works to train generative AI systems is copyright infringement or falls under the TDM exception (in the EU) or fair use (in the US).

In the EU, the emergence of generative AI has disrupted the legislative process for the AI Act and forced lawmakers to reconsider how they categorize and assign responsibilities to AI systems (see here). Although the AI Act is not intended for copyright law, it may mandate transparency about using in-copyright training data, as noted above. This could allow creators to opt-out and establish a market for their works in training generative AI models. Whether this is feasible on a large-scale or desirable remains to be seen. In the US, TDM activities to train AI systems were probably considered transformative fair use until recently. However, as current generative AI systems directly train on artists’ works and produce outputs that compete with them in the market, it is unclear whether copying in-copyright works for training purposes qualifies as fair use.

Implications for Practice

Given the legal uncertainty which will persist while lawsuits are being litigated, practicing journalists can take steps now to try to mitigate potential ethical harms and legal risks. Even if generative AI models are ultimately permitted under the TDM exception or fair use, ethical and responsible use is nonetheless warranted.

As described above, we suggest the following as a starting point for responsible use and mitigating legal risks:

  • Consider doing additional editing or curation of outputs from generative models before publication. This increases the likelihood of copyright protection by meeting the originality requirement.
  • Read the terms of use of the specific models you want to use carefully to assess whether you are compliant, since this could have implications for your ownership or use of model outputs (e.g. byline policies).
  • Use a reverse image tool on any outputs from generative AI image models to search the web for copyrighted images that you deem substantially similar. If something matches and it could infringe on a copyright holder’s work, then you should generate an alternative image.
  • In most cases, the name or distinctive style of an artist’s work which is copyrighted should not be used to prompt image generation models. This will reduce the chance of copying that artist’s work and reduce the potential for impacting the market for their work.
  • News organizations in the EU can consider whether they want to opt out of having their content used to train generative models.