
Who has the ability to take language data from African communities and develop an AI tool? Anyone with the capability can do that. And you’ll find that there are certain jurisdictions that have more capability than those in Africa. And so, what will happen? They’ll develop tools faster than those in Africa. …So it means that African developers are relegated to data collectors. And what happens to the communities who have lent their voices?
Dr MELISSA OMINO, STRATHMORE UNIVERSITY
With the increasing prominence of AI in all sectors of our economy and society, access to training data has become an important topic for practitioners and policy makers. In the Global North, a small number of large corporations with deep pockets have gained a head start in AI development, using training data from all over the world. But what about the creators and the communities whose creative works and languages are being used to train AI models? Shouldn’t they also derive some benefit? And what about AI developers in Africa and the Global South, who often struggle to gain access to training data?
In an effort to try to level the playing field and ensure that AI supports the public interest, legal experts and practitioners in the Global South are developing new tools and protocols which aim to tackle these questions. One approach is to come up with new licenses for datasets. In a pathbreaking initiative, lawyers at the University of Strathmore in Nairobi have teamed up with their counterparts at the University of Pretoria to develop the NOODL license. NOODL is a tiered license, building on Creative Commons, but with preferential terms for developers in Africa and the Global South. It also opens the door for recognition and a flow of benefits to creators and communities. NOODL was inspired by researchers using African language works to develop Natural Language Processing systems, for purposes such as translation and language preservation.
In this presentation, Dr Melissa Omino, the Head of the Centre for Intellectual Property and Information Technology Law (CIPIT) at Strathmore University in Nairobi, Kenya, talks about the NOODL license.
This presentation was originally delivered at the Conference on Copyright and the Public Interest in Africa and the Global South, in Johannesburg in February 2025.
The full video of the presentation is available here.
Licensing African Datasets to ensure support for Research and AI in the Global South
Dr Melissa Omino
Introduction [Ben Cashdan]: We have Dr. Melissa Omino from CIPIT at the University of Strathmore in Nairobi to talk a little bit about a piece of work that they’re doing to try and ensure that the doors are not closed, that there is some opportunity to go on doing AI, doing research in Africa, but not necessarily throwing the doors open to everybody to do everything with all our stuff. Tell us a little bit about that.
[Dr Melissa Omino] Well, I really like that introduction. Yes, and that was the thinking behind it. Also, it’s interesting that I’m sitting next to Vukosi [Marivate, Professor of Computer Science at University of Pretoria] because Vukosi has a great influence on why the license exists. You’ve heard him talking about Masakhane and the language data that they needed.
In the previous ReCreate conference, where we talked about the JW300 dataset, I hope you all know about that. If you don’t know, this is a plug for the ReCreate YouTube channel so that you can go and look at that story. That’s a Masakhane story.
Background: The JW300 Dataset
To make sure that we’re all together in the room, I’ll give you a short synopsis about the JW300 dataset. Vukosi, you can jump in if I get something wrong. Essentially, Masakhane, as a group of African AI developers, were conducting text data mining online for African languages so that they could build AI tools that solve African problems.
We just had a wonderful example right now about the weather in Zulu, things like that. That’s what they wanted to cater for and the solutions they wanted to create. They went ahead and found [that there are] very minimal datasets or data available online for African problem solving, basically in African languages.
But they did find one useful resource, which was on the Jehovah’s Witness website, where it had a lot of African languages because they had translated the Bible into different African languages. They were utilizing this in what was called the JW300 dataset. However, somehow, I don’t know how, you guys thought about copyright.
They thought about copyright after text data mining. They thought, hey, can you actually use this dataset? That’s how they approached it. The first thing we did was look at the website.
Copyright notices excluding text and data mining
Most websites have a copyright notice, and a copyright notice lets you know what you can and can’t do with the copyright material that is presented on the website. The copyright notice on the Jehovah’s Witness website specifically excluded text data mining for the data that was there. We went back to Masakhane and said, sorry, you can’t use all this great work that you’ve collected.
You can’t use it because it belongs to Jehovah’s Witness, and Jehovah’s Witness is an American company registered in Pennsylvania. They asked us, how is it that this is African languages from different parts of Africa, and the copyright belongs to an American company, and we cannot use the language? I said, well, that’s how the law works. And so they abandoned the JW300 dataset.
This created a new avenue of research because Masakhane did not give up. They became innovative and decided to collect their own language datasets. And not only is Masakhane doing this, Kencorpus is also doing this by collecting their own language datasets.
Building a Corpus of African Language Data
But where do you get African languages from? People. You go to the people to collect the language, right? If you’re lucky, you can find a text that has the language, but not all African languages will have the text. Your first source would be the communities that speak the African languages, right? So you’re funded because collecting language is expensive – Vukosi can confirm. He’s collecting 3,000 languages or 3,000 hours of languages. His budget is crazy to collect that.
So you collect the language. You go to the community, record them however you want to do that. Copyright experts will tell you the minute you make the recording, then you have created a copyright work, and whoever has made the recording owns the copyright, okay? But what if you’re funded? If somebody has funded you to collect the data, what happens? Funding comes with conditions, terms and conditions as to what you can do with the money that you have been offered to collect the data.
And so many of these terms and conditions or funding agreements from various funders –we’re not going to call anybody out – require that the data must be CC0, which is a Creative Commons license. Basically, it means that whatever you have collected belongs to the public domain. That sounds good at first glance, right? That means everybody can use it. A ‘yay’ for open data.
Data extraction by the Global North
But who develops, or who has the ability to take that data and develop a tool? Anyone with capability. Anyone with capability can do that. And you’ll find that there are certain jurisdictions that have more capability than those in Africa. And so, what will happen? They’ll develop tools faster than those in Africa. They’ll market them at a faster rate.
And usually when I say marketing, I mean they target big ticket operators such as governments, because governments would have a use for language data tools, right? Or let’s think about public service delivery. So it means that African developers are relegated to being data collectors, and they cannot develop. But at least the developers have some sort of attribution.
We know who collected [the language data]. African AI developers collected. But the communities, what happens to the communities who literally have lent their voices in terms of the language? What happens to them? So that’s where the NOODL license comes in.
The NOODL license: a tiered approach
It is part of the solution. Part of it, because you have to start somewhere. So the NOODL license, or the Nwolite Obodo license, is basically a tiered license framework. Because we’re not trying to reinvent the wheel. And we’re not trying to say that there’s a major problem with open licenses or the Creative Commons licenses. We like the creative commons licenses. But we’re saying context requires a different type of license.
Or a different type of nuance to be applied when thinking about African language datasets. And if you go to our website, which is licensingAfricandatasets.com, you’ll find essentially a toolkit that would help you if you’re a licensor or if you’re a licensee and you’re dealing with African language datasets. And there we started from the very beginning by asking ourselves, what are the pre-licensing guidelines? Because those exist.
I’ve mentioned just one example. If the dataset was collected with funding, then already there’s a prescribed copyright regime on it, according to the grant agreement. That’s another problem we need to solve. But we might not be able to solve it here.
The second thing to think about is, who else was involved in this process other than the AI developer? You’ve mentioned people. That is a community. We also have sometimes linguists are involved. We have what they call language collaborators. I’m not an expert, but I’m happy Vukosi is here. To think about what type of license in the tier you can select. We also have the actual license there. The Nwolite Obodo license is available there.
It is still, it has a lot of legalese in it, but we’ve tried to explain it in the pre-license guideline as to what it does. And essentially, it is tiered in the fact that it depends on who you are, who wants to use the data. So, if you are another African AI developer, you can get the data on a Creative Commons zero license.
If you are from a developing country, let’s say you’re from Brazil and you want to use the data that is licensed under this regime, then you can use it under Creative Commons license. If you are a multi-million dollar tech company that wants to use the license, then you need to negotiate with the AI developers who collected the data. And the license also ensures that the community gets a benefit from that negotiation.
Leveling the playing field
[Ben] So, what are the criteria? It’s where you are, how much bucks you got, how rich you are. And also what you’re using it for. And what you’re using it for.
So, this may be the first license. Is there another one like this? This is like the first sort of semi-into Creative Commons license, because it’s Creative Commons for some people, but that is tiered and discriminates. You get it, but you don’t get it.
[Melissa] I wouldn’t want to say that it discriminates. I think actually it tries to make the playing field fair. But also thinking about the public interest in terms of access benefit sharing for the community. Because essentially communities are just seen as sources of data. Which is why I get really irritated when people say data is the new oil.
Because oil exists in a piece of land and you can go and drill it and you don’t have to talk to anybody, maybe other than a government or the landowner. But language is culture and that’s one of the principles that the license is based on. And culture belongs to community.
And you can’t treat culture as a commodity. You have to really think in a nuanced way about culture. And I think we’ll have a deeper discussion on that in Cape Town.
[Ben] Some of the lawyers in the room, the other lawyers, Melissa’s a lawyer, may want to comment on whether you can distinguish, I won’t use discriminate, whether you can distinguish between users in the way that you are doing. Can you use geographical determinants to determine, you want to come in on that straight away?
[Melissa] I mean, why can’t you? I’d like to hear why you can’t. I think there’s been a lot of debate around looking at intellectual property as part of the property regime or proprietary assets, thinking about human rights, thinking about the public interest.
It’s very interesting that people will be proponents of data coming from the least developed world as being free and open. But when we look at what’s happening in AI right now, if you look at what China launched recently as DeepSeek and you look at the reaction from the Global North, it’s really interesting. They’re up in arms about their proprietary assets.
But when it comes to this side, we’re not meant to be really that aggressive about our proprietary assets. And maybe we don’t, we shouldn’t be. Maybe we should think about it as a different regime.
The bottom line, and what this license is trying to say, is that there’s community, and that’s what Nwolite Obodo means. It means developing the community, raising the community, building community. It means that language does not exist in a vacuum.
It is not something that is free for use in the air. It is preserved and curated by communities. And if somebody collects it and ends up creating something that has a monetary value at the end of it, then where you obtained it from must have some sort of benefit.
And we can decide whether we’ll be capitalists about it, or we’ll use a different method to redress what has been extracted from the community, but that has to be done.
[Ben] I’m going to call it positive, progressive protectionism of the ability of communities to benefit and Africa to protect its languages, culture, and use.
[Melissa] Yes, and also to understand that this is the way that Africa is taking space within the AI development race. We might not build models, because we don’t have GPUs and all those things Vukosi knows about that can do that, but we have the data, which is actually quite valuable. And we need to recognize how we can then leverage that data in order for us to be at par with others in the AI race.