Sean Flynn and Lokesh Vyas

Introduction

On December 2, 2022, Science Magazine published a joint academic opinion by leading copyright scholars from around the world calling for copyright reform to enable text and data mining (TDM) research.[1] The opinion calls for all countries to evaluate their laws, and for international institutions to guide them, so that text and data mining research can take place everywhere, including through cross-border collaborations between researchers in different countries. In this article, we survey some of the kinds of important TDM projects that need copyright permission to be enabled everywhere.

I. Speeding Literature Review                                                                                                                        

One of the most common uses of TDM is to help scholars find, read, and analyze information in academic journals and other sources.[2] With the decisions in the USA and legal changes in other countries permitting greater use of TDM without additional licensing of the underlying works, a number of research tools have become available to aid automated literature review, including EvidenceFinder,[3] ASReview,[4] Covidence,[5] DistillerSR,[6] JBI SUMARI,[7] Colandr,[8] Rayyan,[9] RobotAnalyst,[10] Anni.[11] Many of the projects described below use this basic function of TDM in various specific applications. Studies have shown significant improvement in results from TDM research using full-text articles, which are often behind a paywall or subject to additional licensing, rather than mere abstracts, which are often more freely available.[12] Requirements for additional permissions to use published research in TDM delay research and discourage the use of broader and more accurate data sets in studies.[13]

II. Enabling Medical Discovery                                                                                                                    

In 2003, text mining suggested that the thalidomide drug, taken off the market decades earlier, had the potential to treat chronic hepatitis and other diseases not previously associated with the drug.[14] In 2007, scientists discovered a new link between genes and osteoporosis by using a TDM tool to analyze PubMed, a database of 30 million citations for biomedical literature.[15] In 2014, doctors at the Hospital del Mar Institute for Medical Research (IMIM) in Spain evaluated the usefulness of TDM in respiratory diseases, concluding that TDM can play a significant role in its research and clinical care.[16]

III. Epidemic and Pandemic Tracking                                                                                                           

The outbreak of a novel coronavirus from Wuhan, China, later named COVID-19, was first discovered by a Canadian artificial intelligence firm called BlueDot which analyzed “a variety of information sources, including chomping through 100,000 news reports in 65 languages a day” to recognize patterns between health outbreaks and travel.[17] Other TDM projects have examined social media and other online sources to track and explain COVID-19 vaccination hesitancy, and to identify High-Risk COVID-19 patients.[18]

IV. Vaccine Research                                                                                                                                    

COVID-19 research benefitted from TDM projects that mined scientific publications about the coronavirus family, helping to speed the identification of vaccine candidates.[19]

V. Inadequacy of Restricting Research to Open Access Sources                                                        

Wellcome Trust’s 2012 submission to the UK IPO consultations on TDM noted that of the UK PubMed Central repository of 2,930 full-text articles published since 2000 that use the word “malaria” in the title, only 62% are open to text and data mining research.[20]

VI. Identifying Disinformation and Hate Speech in Media                                                                      

TDM researchers seeking to track and expose disinformation need to make and share reproductions of multiple different kinds of copyrighted media, including news reports, blogs, websites, social media, and other sources.[21] Combatting disinformation about COVID-19 vaccines, treatment, and methods of transmission is one recent example where TDM’s uses of copyrighted material have been of critical importance.[22]

VII. Decolonizing Science with Multilingual Translation Tools                                                     

TDM of news and other sources is used to train machine learning programs to translate articles and other documents from one language into another, radically expanding the ability of all of the world to process material produced in any language. Projects in South Africa and Kenya, for example, are building translation tools that can translate academic articles into Swahili, Zulu and other indigenous languages in an effort to “decolonise science.”[23] But training these tools requires the ability to reproduce and mine newspaper articles written in African languages,[24] which publishers may refuse permission for in the absence of adequate copyright exceptions.

VIII. Examining Gender in Literature                                                           

A study on the Transformation of Gender examined a collection of over 100,000 novels in the HathiTrust Digital Library collection from 1703 to 2009.[25] It analysed the differences in language used to discuss male-identified and female-identified fictional characters, finding that from the nineteenth century through to the early 1960s, the proportion of female-identified character space decreased. The study was made possible through the reproductions of books by the Google Books project and provided to the HathiTrust, whose making available of the resource for text and data mining research was held to be a fair use in Authors Guild v. HathiTrust, 755 F.3d 87 (2d Cir. 2014).

IX. Learning Analytics to Improve Educational Policies in South America (Uruguay)

The educational authorities of Uruguay have signed a contract with a well-known company that provides virtual classroom services for the Primary and Secondary levels of public and private education. But the terms of use of the platform do not allow text and data mining research and Uruguay’s law does not provide an applicable exception. This lack of clear legal authority has dissuaded the National Research Agency of Uruguay from using learning platform data in its project to create “Prediction models for the determination of academic risk,[26] which seeks to create an early warning system for academic risk in public primary and secondary education students in Uruguay.


[1] Sean M. Fill Flynn et al., Legal reform to enhance global text and data mining research, 378 Science, 951 (2022).

[2] “Text and data mining” (“TDM”) describes any application of a computational process to materials to derive data from or about those works. TDM can be used to help train computer applications to engage in machine learning or artificial intelligence (“AI”) which applies additional analysis and processes to enable machines to dynamically “learn” new tasks for which they were not specifically programmed.

[3] See EvidenceFinder, https://bio.tools/evidence_finder (“Search tool under development in the UKPMC project. It locates sentences that present evidence in the text of research papers. Sentences from 1.5M papers in the UPMC corpus are indexed based on linguistic analysis and NER. Initial search results take the form of questions representing the most frequent types of the relevant evidence in the index. On selecting a question, the user may review details of the documents containing the evidence, including the indexed sentences.”).

[4] ASReview, https://asreview.nl (“ASReview uses state-of-the-art active learning techniques to solve one of the most interesting challenges in screening large amounts of text”); see also Van de Schoot et al., An Open Source Machine Learning Framework for Efficient and Transparent Systematic Reviews, 3(2) Nature Machine Intelligence, 125 (2021).

[5] Covidence, https://www.covidence.org (last visited Nov. 2, 2022).

[6] DistillerSR, https://www.evidencepartners.com (last visited Nov. 2, 2022); see also Gerald Gartlehner et al., Assessing the accuracy of machine-assisted abstract screening with DistillerAI: a user study, 8 Systematic Rev. 277 (2019).

[7] JBI SUMARI, https://sumari.jbi.global/ (last visited Nov. 2, 2022).

[8] See Melissa Kahili-Heede & K. J. Hillgren, Colandr, 109(3) J. Med. Libr. Ass’n. (2021).

[9] Rayyan, https://www.rayyan.ai; see Rayyan for Systematic Reviews, McGill Library https://libraryguides.mcgill.ca/rayyan/home (“a free web-tool (Beta) designed to help researchers working on systematic reviews, scoping reviews and other knowledge synthesis projects, by dramatically speeding up the process of screening and selecting studies”); see also Hanna Olofsson et al., Can abstract screening workload be reduced using text mining? User experiences of the tool rayyan, 8(3) Research Synthesis Methods 275 (2017).

[10] RobotAnalyst, www.nactem.ac.uk/robotanalyst/ (last visited Nov. 1, 2022) (“designed for searching and screening reference collections obtained from literature database queries.”).

[11] See Rob Jelier et al., Anni 2.0: a multipurpose text-mining tool for the life sciences, 9 Genome Biology R96 (2008).

[12] David Westergaard et al., A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts, PLOS Computation Biology (2018).

[13] See Openminted Communications, TDM Stories: A Text & Data Miner Talks About Analysing The Recent Past, OpenMinted (Feb. 2, 2018) openminted.eu/tdm-stories-text-data-miner-talks-analysing-recent-past/; see also Openminted Communications, TDM Stories: How Zalando Links Languages With TDM, Openminted Communications (Feb. 2, 2018) openminted.eu/tdm-stories-zalando-links-languages-tdm/; c.f. Amanda Levendowski, How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Problem, 93 Wash. L. Rev. 579 (2018).

[14] Marc Weeber et al., Generating Hypotheses by Discovering Implicit Associations in the Literature: A Case Report of a Search for New Potential Therapeutic Uses for Thalidomide, 10(3) J. of the Amer. Med. Infor. Ass’n 252 (2003).

[15] Varun K.Gajendran et al., An application of bioinformatics and text mining to the discovery of novel genes related to bone biology, 40(5) Science Direct 1278 (2007).

[16] David Piedra et al., Text Mining and Medicine: Usefulness in Respiratory Diseases, 50(3) Archivos de Bronconeumología (2014).

[17] See Marc Prosser, How AI Helped Predict the Coronavirus Outbreak Before It Happene, Singularity Hub (Feb. 5 2020) https://singularityhub.com/2020/02/05/how-ai-helped-predict-the-coronavirus-outbreak-before-it-happened/; Corey Stieg, How this Canadian Start-Up Spotted Coronavirus Before Everyone Else Knew About it, Make It CNBC (Mar. 3, 2020) (describing how BlueDot discovered the path of a spreading virus by combining various datasets into a machine learning program); see also Jingxian You et al., Using text mining to track outbreak trends in global surveillance of emerging diseases: ProMED-mail 184(4), J. of Royal Statistical Soc., 1245 (2021).

[18] C.f. Thanh Thi Nguyen et al., Artificial Intelligence in the Battle against Coronavirus (COVID-19): A Survey and Future Research Directions, arXiv (Mar. 17, 2022) https://arxiv.org/abs/2008.07343; accord Teng, S., Jiang, N. & Khong, K.W., Using big data to understand the online ecology of COVID-19 vaccination hesitancy, 9 Humanit Soc Sci Commun, 158 (2022) (analysing Youtube comments to find out the reasons for Vaccine hesitancy behaviour.); Rajiv Leventhal, Medical Home Network Uses AI to Identify High-Risk COVID-19 Patients,Medical Home Network (Mar 19, 2020) https://medicalhomenetwork.org/press/medical-home-network-uses-ai-to-identify-high-risk-covid-19-patients.

[19] See Computational predictions of protein structures associated with COVID-19, DeepMind https://www.deepmind.com/open-source/computational-predictions-of-protein-structures-associated-with-covid-19 (last visited Oct. 12, 2022, 9:30 PM); Will Knight, Researchers Will Deploy AI to Better Understand Coronavirus, Wired (Mar. 17, 2020) https://www.wired.com/story/researchers-deploy-ai-better-understand-coronavirus/; Hao Lv, Application of artificial intelligence and machine learning for COVID-19 drug discovery and vaccine design, 22(6) Brief Bioinform (2021); A. S. Albahri et al., Role of biological Data Mining and Machine Learning Techniques in Detecting and Diagnosing the Novel Coronavirus (COVID-19): A Systematic Review 44(7) J. Med. Syst. (2020); Xu Li et al, Network Bioinformatics Analysis Provides Insight into Drug Repurposing for COVID-19, Med Drug Discov. (2021);  Xu Li et al., Network Bioinformatics Analysis Provides Insight into Drug Repurposing for COVID-2019 https://www.preprints.org/manuscript/202003.0286/v1 (analysing “the genome sequence of SARS-CoV-2 and identified SARS as the closest disease, based on genome similarity between both causal viruses, followed by MERS and other human coronavirus diseases); see also Feixiong Cheng et al., Network-based approach to prediction and population-based validation of in silico drug repurposing, Nature (Jul. 12, 2018) https://www.nature.com/articles/s41467-018-05116-5; Jiansong Fang et al., Network-based translation of GWAS findings to pathobiology and drug repurposing for Alzheimer’s disease MedRxiv (Jan. 18, 2020); Feixiong Cheng et al., Network-based prediction of drug combinations, Nature (Mar. 13, 2019) https://www.nature.com/articles/s41467-019-09186-x; Rajiv Leventhal, Medical Home Network Uses AI to Identify High-Risk COVID-19 Patients, Medical Home Network (Mar. 19, 2020) https://medicalhomenetwork.org/press/medical-home-network-uses-ai-to-identify-high-risk-covid-19-patients; Jonas Degrave et al., Magnetic control of tokamak plasmas through deep reinforcement learning, Nature (Feb. 16, 2022) https://www.nature.com/articles/s41586-021-04301-9; David Piedraa et al., Text Mining and Medicine: Usefulness in Respiratory Diseases, 50(3), Archivos de Bronconeumologia 113 (2014); Hisham Al-Mubaid & Rajit K. Singh, A New Text Mining Approach for Finding Protein-to-Disease Associations, 1(3) Amer. J. of Biochemistry & Biotechnology 145 (2005) (“identifying different molecules involved in Huntington’s Disease and several other life-threatening or debilitating conditions.”); Don R. Swanson, Fish Oil, Raynaud’s Syndrome, and Undiscovered Public Knowledge, 30 Persp. Biology & Med. 7 (1986) (identifying the beneficial effect of fish oil for patients suffering from Raynaud’s Disease using a new method of literature-based knowledge discovery.).

[20] Wellcome Trust, Intellectual Property Office: Consultation on Copyright Response by the Wellcome Trust (Mar. 2012) https://wellcome.org/sites/default/files/wtvm054838.pdf.

[21] See e.g. FANDANGO, https://fandango-project.eu (“[it] aim[s] . . . to aggregate and verify different typologies of news data, media sources, social media, open data, so as to detect fake news and provide a more efficient and verified communication for all European citizens.”); CORDIS, Fake news detection in social networks using geometric deep learning (last visited Oct. 20, 2022) https://cordis.europa.eu/project/id/812672; Fake news detection in social networks using geometric deep learning (last visited Oct. 20, 2022) https://cordis.europa.eu/project/id/812672; Elizabeth Gibney, The scientist who spots fake videos, Nature, (Oct. 06, 2017) https://www.nature.com/articles/nature.2017.22784; Tom Cassauwers, Can artificial intelligence help end fake news?, Horizon (Apr. 15, 2019) https://ec.europa.eu/research-and-innovation/en/horizon-magazine/can-artificial-intelligence-help-end-fake-news; Jia Xue et al., Harnessing big data for social justice: An exploration of violence against women-related conversations on Twitter, 1(3) Human Behavior and Emerging Technologies, 269 (2019).

[22] See William Shiaom & Evangelos E. Papalexakis, KI2TE: Knowledge-Infused InterpreTable Embeddings for COVID-19 Misinformation Detection, 1st International Workshop on Knowledge Graphs for Online Discourse Analysis, KnOD 2021 (Apr. 14, 2021) https://madlab.cs.ucr.edu/papers/Knod2021_paper_7.pdf; Holly Ober, Data mining tools combat COVID-19 misinformation and identify symptoms, UC Reiverside News (Aug. 19 2021) https://news.ucr.edu/articles/2021/08/19/data-mining-tools-combat-covid-19-misinformation-and-identify-symptom; Xuehua Han et al., Using social media to mine and analyze public opinion related to COVID-19 in China, 17(8) Int. J. Environ. Res. Public Health, 2788 (2020); Shasha Teng et al., Using big data to understand the online ecology of COVID-19 vaccination hesitancy, Nature (May 6, 2022) https://www.nature.com/articles/s41599-022-01185-6.

[23] Masakhane MT: Decolonise Science, Masakhane, https://www.masakhane.io/ongoing-projects/masakhane-mt-decolonise-science (last visited Dec. 2, 2022).

[24] Vukosi Marivate et al., Investigating an Approach for Low Resource Language Dataset Creation, Curation and Classification: Setswana and Sepedi, Euro. Language Resources Ass’n. (ELRA), 15 (2020); Rubungo Andre Niyongabo et al., KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi, International Committee on Computational Linguistics 5507 (2020).

[25] Ted Underwood et al., The Transformation of Gender in English-Language Fiction, 3(2) J. of. Cul. Analytics (2018).

[26] Prediction models for determining academic risk, Proeva, https://proeva.udelar.edu.uy/modelos-de-prediccion-para-la-determinacion-de-riesgo-academico/ (last accessed Dec. 2 2022.).