TDM report coverReport prepared for the EC Directorate-General for Research and Innovation, prepared by an Expert Group chaired by Ian Hargreaves with members Lucie Guibault, Christian Handke, Peggy Valcke, Bertin Martens, and Ros Lynch, supported by Ros Lynch and Sergey Filippov.  The full report is available here (PDF), and the executive summary follows.

Text and data mining (TDM) is an important technique for analysing and extracting new insights and knowledge from the exponentially increasing store of digital data (‘Big Data’). It is important to understand the extent to which the EU’s current legal framework encourages or obstructs this new form of research and to assess the scale of the economic issues at stake.

TDM is useful to researchers of all kinds, from historians to medical experts, and its methods are relevant to organisations throughout the public and private sectors. Because TDM research technology is not prohibitively expensive, it is readily available to lone entrepreneurs, individual post-graduate students, start-ups and small firms. It is also amenable to playful and highly speculative uses, enabling research connections between previously unconnected fields. There is growing recognition that we are at the threshold of the mass automation of service industries (automation of thinking) comparable with the robotic automation of manufacturing production lines (automation of muscle) in an earlier era. TDM will be widely used to provide insights in the redesign of this digital services economy.

When it comes to the deployment of TDM, there are worrying signs that European researchers may be falling behind, especially with regard to researchers in the United States. Researchers in Europe believe that this results, at least in part, from the nature of Europe’s laws with regard to copyright, database protection and, perhaps increasingly, data privacy. In the United States, the ‘fair use’ defence against copyright infringement appears to offer greater reassurance to  researchers than the comparable copyright framework in Europe, which relies upon a closed set of statutory exceptions. Recent court decisions, for example in the ten-year old ‘Google Books’ case, appear to confirm this. The US has no equivalent of Europe’s database protection laws.

In Europe, there are signs of a response among publishers to encourage wider use of TDM. Scientific publishers have recently proposed licensing terms designed to make TDM of their own archives easier, but many researchers dismiss these efforts as insufficient, arguing that ‘the right to read is the right to mine’ and that effective research demands freedom to mine all public domain databases without restriction. These pressures from researchers have increased as a result of a growing move to ‘Open Access’ scientific publishing in Europe and elsewhere. The UK and Ireland have already committed themselves to more permissive copyright rules with regard to TDM.

Stakeholders

An overview of the debate about TDM among stakeholders draws attention to the polarisation of views between publishers (especially of scientific journals) and scientific researchers, but notes that relevant communities of interest extend way beyond these groups to include heritage institutions, technology firms, data management companies, pharmaceuticals, newspapers, healthcare providers, advertising agencies and many more. Any organisation seeking to provide a bespoke service to its customers will potentially have an interest in TDM.

It is difficult to estimate accurately the level of TDM activity taking place in Europe, though it would appear to be limited in some fields. A small study conducted by the Lisbon Council among European academics mainly in the social sciences found that few were aware of or used TDM themselves. In other fields, such as computational linguistics, TDM is said to account for almost 30% of all research projects. Some publishers report little interest in TDM; others report signs of growth. Researchers suggest this may reflect problems of data access, time-consuming procedures, legal uncertainties and shortages of sufficiently skilled researchers.

Traditional publishers distinguish between ‘access’ and ‘mining’, arguing that they are two different activities that require their own licence and may bring with them different terms and conditions. Providing researchers with ongoing, reliable access to high quality content for text and data mining is said to involve a significant investment in validation, correction and refinements to content, plus investment in systems to hold that content in a secure manner. At the same time, there is some acceptance among scientific publishers that the present arrangements are inefficient and costly and would not scale if demand for TDM were to grow as predicted.

Following on from the EU’s ‘Licences for Europe’ process traditional publishers have argued for a ‘market solution’ based upon collaboration between the various parties. Reed Elsevier recently announced that researchers at academic institutions can use their online interface (API) to batch-download documents in computer-readable XML format, with a limit of 10,000 articles per month. PLOS, on the other hand, recently announced that it would require authors to sign a data availability statement that would guarantee, unless in few exceptional cases, that all the data used in a publication is publicly accessible to anyone at the moment the article is published.

Many researchers, however, do not believe that licensing can solve the problems they face. They call for a revision of copyright law, perhaps in the form of an exception for TDM along the lines proposed in the UK and Ireland, along with reform of EU database law.

Researchers and publishers also disagree about a number of the technical difficulties involved in improving the conditions for TDM and related costs. The growth of Open Access publishing has tended to support the argument that researchers using TDM should not face restrictions. This argument has been supported in the context of the EU’s Horizon 2020 strategic research and innovation framework. It is acknowledged that the changes in the technologies which support research present serious questions for the business models of some publishers.

Economic issues

In thinking about copyright, economic policy-makers aim for a welfare-maximising balance between benefits for users and incentives for rights holders. There is a severe lack of empirical evidence upon which to base such calculations, though the theoretical issues are relatively well understood. These rest upon striking the right balance between incentivising the production of ‘works’, whilst avoiding ‘deadweight’ welfare losses, for example through excessive transaction costs.

Solid evidence about the prevalence of TDM is scarce, but what evidence there is suggests strong rates of growth from a low base in the last five years. Based upon an analysis of citations which mention data mining in the title of a publication, US researchers appear to be more active than in other countries, though there are also disparities between European countries.

Based upon assumptions in a range of studies, estimates are made of the potential value of TDM to Europe’s economy, assuming an increase in researcher productivity of 2 per cent and consequent growth in the volume of research and its associated benefits. On conservative assumptions (a narrow definition of the scope for TDM), a GDP gain in Europe ‘of the order of magnitude of tens of billions of Euros’ appears feasible.

A discussion of market failure and the shortfall in competitive TDM in Europe considers three reasons why the transformative and economically valuable secondary use of copyright works (as exemplified by TDM) may be suboptimal. These factors are: transaction costs, strategic behaviour by copyright holders and externalities. In considering the potential economic consequences of changes in the law governing TDM, five definitions of the boundaries of TDM are considered in order to address the critical economic question of the extent to which any given legal reform will or will not adversely affect the supply of new works, in ways likely to affect the balance of welfare.

In considering various possible forms of legal exception from copyright and database law for text and data miners, the argument is made that from an economic perspective it makes little sense to propose a distinction between commercial and non-commercial TDM. A well-designed copyright regime should provide appropriate stimulus for all types of research and, at the same time, an appropriate level of protection for all rights owners. Once this balance has been reached, there is no reason to distinguish between commercial and non-commercial research.

Legal issues

This section asks whether legal barriers impede the conduct of TDM for research purposes and, if so, how these barriers might be alleviated in the light of the current European legal framework, taking the interests of all stakeholders into account. A range of potential reforms is discussed.

A description is offered of the application of intellectual property laws relevant to TDM in the United States and four other countries. In the US, it is judged reasonable to assume that copying acts by American TDM researchers for the purpose of extracting non-expressive metadata could be considered fair use under US law. Under Canadian law, TDM activities would likewise probably qualify as fair dealing. Australia’s legal regime appears to be more restrictive than in North America. The picture is less clear cut in Japan and Israel, though in both these countries there have been legal changes which may be helpful to researchers using TDM.

The extent to which TDM in Europe is facilitated by any existing exceptions to either EU copyright or database law appears unclear. The application of a copyright and database exception relating to teaching or scientific research is optional and has not been implemented at all in some Member States. This has contributed to uncertainty in the European scientific research community.

Encouraging TDM for research purposes without fear of infringing IP rights could be achieved in a number of ways: through an adjustment of licensing practices; through a revised, normative interpretation of the ‘reproduction right’; through the introduction of a new exception in copyright and database laws, or through the adoption of an ‘open norm’ designed to guide the courts to take a more flexible view of what users are permitted to do.  Should an exception be introduced in the European legal framework, the legislator would also need to consider whether to ensure that it cannot be over-ridden through the enforcement of restrictive contractual clauses or technological protection measures.

An approach based upon licensing alone would probably be insufficient to allow TDM to take place in all instances where it would be socially desirable because of uneven levels of access, high transaction costs and patchy availability of works covered by a creative commons licence.

A more promising route could involve reconsideration of the right of reproduction in copyright law, along with the right of extraction in the database regime. These have traditionally been subject to increasingly broad interpretation, but the need to boost TDM in Europe provides impetus to consider a change of emphasis. This would involve the legislator adopting a ‘normative’ approach, designed to ensure that protection is supported by the courts only for acts of reproduction or extraction that entail ‘expressive’ exploitation of the rights-protected material. This would put TDM’s non-expressive and socially beneficial mechanical sifting of data beyond successful challenge in the courts. Such a shift could be achieved through an interpretation instrument issued by the European legislator, accompanied by a re-assessment of the Database Directive, building upon the European Commission’s own highly critical evaluation report in 2005.

A third alternative would be to introduce a new exception in copyright and the database law. This might take one of two forms: an exception specifically permitting TDM for the purpose of research or an open norm. The first would provide more immediate clarity; the second would offer more flexibility in a fast changing technological environment. An ‘open norm’ approach could involve a re-balanced interpretation of the Berne Convention’s Three Step Test.

Finally, two areas of legal discussion beyond IP law are considered. The first concerns demands to resist the ‘monopolisation of information’ by major holders of data, potentially through the operation of competition law. Among the ideas discussed is the call for a more general regime of mandatory openness and interoperability (with open standards) in online environments, designed to prevent a major data holder (one might think of Facebook, Twitter, Google or other online players) ‘from erecting a fence around its piece of the information commons.’

The second area of non-IP law concerns data privacy, where already strong European laws protecting individual privacy stand to be strengthened by the draft Data Protection Regulation currently under consideration. This draft legislation includes a provision explicitly permitting the processing of even sensitive personal data for the purposes of historical, statistical or scientific research, subject to certain safeguards. It has been argued, however, that the draft legislation will prove problematic for TDM, because mining requires sweeping assemblies of data and an exploratory, iterative approach to research goals.

Some researchers argue for a shift of regulatory attention away from data collection and towards the way that data and knowledge based on data are used or abused.

Conclusions

From the analysis in this paper, we can draw the following analytical conclusions about TDM and the challenge it presents to policymakers in Europe:

  • Text and data mining is an important research technique which is certain to become more important as researchers acquire the skills and the technology to address and investigate datasets of increasing size, complexity and diversity in all media: text, numbers, images, audio files and in any other form.
  •  TDM represents a significant economic opportunity for Europe. Prolific use of TDM would add tens of billions of Euros in value to the EU’s aggregate GDP. This would result chiefly from higher productivity among researchers and from the effects (‘externalities’) of increased levels of research.
  • At present, the use of TDM tools by researchers in Europe appears to be lower, and probably significantly lower, than is the case in the United States and some other countries in the Americas and Asia. This probably reflects, among other factors, disadvantages created by the European legal framework with regard to TDM.
  • The European legislator needs to re-consider and reform the EU’s legal framework with regard to copyright, database protection and possibly data privacy, in order to support the international competitiveness of Europe’s research base.
  • There is a serious risk that Europe’s relative competitive position as a research location for the exploitation of ‘Big Data’ will deteriorate further, if steps are not taken to address the issues discussed in this report. The results of this might well include a loss of talent and a loss of investment to more favourable research locations.

In response to this analysis, the Expert Review group proposes three action points:

  1. We welcome initiatives to make licensing of works for the purpose of text and data mining easier. In the short term, these will add value to the economy and help to build the skills-base and culture necessary for successful ‘big data’ research in the digital economy. This activity, however, should be seen as a prologue to legal reform, not an end in itself.
  2. A specific and mandatory exception to remove text and data mining for scientific purposes from the reach of European copyright and database law should be drafted. This should be regarded as a short-term amelioration, in the event that our third proposal, below, cannot make timely progress.
  3. The best approach to reform, aimed at securing a competitive legal framework for European research, is to establish a durable distinction in European law between copyright’s longstanding and legitimate role in protecting the rights of authors of ‘expressive’ works and copyright’s questionable role in the digital age of presenting a barrier to modern research techniques and so to the pursuit of new knowledge. This initiative should be at the heart of a new copyright directive in Europe, following the consultations currently being undertaken by the European Commission. The legal analysis in this report offers more than one route via which a reform of this kind might be pursued; for example by introducing a suitable ‘interpretative instrument’ into a new Copyright Directive. We also urge the legislator, including the European Parliament, to ensure that the currently proposed reform of Europe’s data protection laws avoids the unintended consequence of creating further impediments to the work of scientific researchers. We make these recommendations in the interests of the international competitiveness of the European Union’s research base.

Full EC Expert Working Group Report:  Standardisation in the Area of Innovation and Technological Development, Notably in the Field of Text and Data Mining