Lack of Africa-specific datasets challenge AI in education

While there is a lot of local data, there is a dearth of huge and accessible African datasets and challenges around local languages. However, broadband connectivity, 5G, the internet of things and smart electrical grids are spreading and will make data collection in centralised spaces for powerful AI analytics in education in Africa possible.

Africa-specific datasets are collections of data obtained about physical environments in Africa. Data is vital in modelling and comprehending an environment effectively, and for predicting and planning purposes.

In this context, an environment is one within which social and economic activities take place, such as: education, agriculture, water, health, manufacturing, transport, logistics and climate.

Existing data stores

The likely stores for such data are repositories of government ministries, global and local NGOs, university and research institutions, and global information portals that, for instance, save satellite data on weather, agriculture, forests, aquatic environments etc.

However, do these datasets exist? Are they adequate? Are they accessible to the research and development community? Does their adequacy or inadequacy affect AI in education?

Repositories of government ministries

Most government ministries keep data that is relevant to their mandate. Further, national bureaus of statistics store a more diverse collection. However, what is publicly accessible is not the raw data, but summaries and trends that have been appropriately distilled from the raw data.

The summaries are often in the format of the health data of Statistics South Africa, which indicates the extent of each cause of death by year.

The summaries are also often inconsistent. For instance, in the Kenyan agriculture ministry repository, it indicates “Fish production by different water bodies between 2012 and 2016”, yet there is no mention of the succeeding block of years.

Although consistent summaries can be used to predict future trends to a limit, raw data is essential due to its reusability in various contexts.

Repositories of global NGOs

The likes of the World Bank, World Food Programme and World Health Organisation (WHO) conduct extensive studies globally, and generate massive data. In certain instances, raw data is available on their sites, for example the WHO data library has an MS Excel file of raw health expenditure data for Cape Verde. However, most data on these sites are summaries and visualisations, such as rainfall data for countries all over the world on the World Food Programme site.

Local NGO repositories

Data acquisition is a costly affair, and typical indigenously founded and managed NGOs in Africa may be lacking resources to acquire their data, even in instances when they are already awake to the value of data. Their activities are, therefore, often based on insights and themes already developed by governments, research institutions and more global NGOs.

University and research institutions’ repositories

Apart from administrative data on a few university sites, research data is highly unlikely to be accessible within such repositories. Research activities are highly individual. Researchers in Africa make strenuous efforts to acquire their own data that they are often not inclined to share.

In-depth data is clearly unavailable on the above-mentioned platforms. Where data is available, it is usually inadequate, inconsistent or difficult to navigate.

The following platforms, however, hold probably the most accessible and consistent data.

Global information portals

The data in these portals is mostly satellite imagery and weather and climate information, for instance, Satellite Imaging Corporation, EOS Data Analytics and MSN weather.

However, their data is often remotely collected and focused on landscape or atmosphere. For more socially oriented environments, such as education, there is little value in that data.

The bottom line

There is a lot of local data on the African continent. However, most of these datasets were not collected for machine learning purposes and are likely to be in the wrong formats. Such data can be valuable for data mining, a process that attempts to opportunistically exploit any available patterns in any data.

However, data collected for a more focused machine learning objective must be deliberate, having all the essential independent variables that influence the specific variable that the researcher or developer wants to predict.

Projects such as the Lacuna Fund, which aims to put the benefits of machine learning in the hands of data scientists, researchers and social entrepreneurs around the world, have attempted to build deliberate and labelled local datasets for African machine learning applications in agriculture, health and natural languages, among others.

However, these are nascent efforts and a lot still needs to be done. Yet, it gets even trickier in the education sector.

AI in education

AI in the learning process is a most powerful concept. However, since learning is a more social and psychological environment than statistical work, the data to model the learning environment and to optimise it is more subtle and difficult to collect.

Most experts believe that the critical presence of teachers is irreplaceable, but AI could – and already does in some educational technologies – add significant value to the education process by automating the often lower level tasks, as follows:

AI to spare the teacher mundane and standard tasks

AI can drive efficiency and streamline administrative tasks – such as enrolment, admissions, assessment and grading, even for written content – to allow teachers the time and freedom to provide higher level competencies.

These are competencies at which humans perform better than machines, such as comprehension, talent identification and promotion, and adaptability. By allowing humans and machines to each keep to their best lanes, AI provides significant yield in education.

Individualised learning

Customising the educational experience to suit each individual student’s unique needs has been rapidly becoming the holy grail of education. However, in a typical class in Africa of 50 students, how can a single teacher or lecturer deliver that level of differentiation?

AI technologies currently exist, such as those by Content Technologies or Carnegie Learning, which have the ability to recognise a learner’s unique learning pattern or pace, offer just what the learner can handle at a particular time, and know when to move to the next level. Human gestures such as hesitation, facial frowning or a satiated expression, are factors that AI-based computer vision can now recognise, to tell if a learner is struggling or comfortable.

Universal content access

In Africa, there is huge contention between foreign languages such as English and French versus indigenous vernaculars, for instruction at the lower grades. At what point do schools transit from one to the other?

This intricate balance apparently influences the ability of learners, later on, to consume instruction and conduct research in the foreign languages. Yet the case for vernacular for lower grades is that early grade learners struggling to consume instruction in the foreign language are likely to miss many of the formative concepts about the world in which they live.

Natural language processing based AI can globalise instructional content developed in any language by translating to most vernaculars to make the said transition seamless. AI can also bridge the visual or hearing impairment gaps easily by shifting content between text and speech.

The AI-based MS PowerPoint-plugin application, Presentation Translator, can author subtitles based on an instructor’s speech, and in real time. This enables recorded lessons, in the appropriate medium and appropriate language – a powerful tool for children from pastoral communities that consistently skip school due to constant movement.

Homework support assistance

In schools and universities, students must increasingly learn on their own. AI tutors can assist learners to refine their comprehension of content already taught, and assist them with homework and even exam preparation. With internet connectivity, it can expose the learner to global tools that offer diverse instructional methods or approaches.

The challenge of localised data

AI and machine learning feed on massive data. To enable AI to perform the above mentioned sophisticated tasks, huge localised data is crucial.

Certainly, there have been many efforts to build corpora for various pairs of languages, including many African vernaculars against the main foreign languages of educational instruction.

However, to train an AI to perform, for instance, automated assessment and grading of written content, it is essential to expose it to a massive database not just of general natural language elements in the test language, but specific technical phrases that are prevalent in the test context. These phrases are then tagged against their technical meanings; a semantics level exercise that digs deeper than the common natural language processing platforms.

Even more difficult is for AI to comprehend each individual learner’s unique needs.

In this case, one would need to know the main points of performance variations between learners within that learning space. Do these variations occur in arithmetic, essay comprehension, or algebra?

These points of variations are likely to constitute the essential or causal variables in the dataset, and only localised data would illuminate them. More localised data would then be collected, in terms of each causal variable, to develop a model to diagnose an individual learner’s unique needs.

There has to be a very deliberate, sophisticated and well-designed process to collect such individual data, which hardly exists currently in any database on the continent.

Elaborate data collection infrastructure

Current technology, if used innovatively, holds great promise for the subtlety needed in data collection for AI in education. Some of the data about levels of individual learner comprehension, for any topic, could be harvested using cameras and pulse sensors, among others.

Still, conformational data – or data that is difficult to collect via sensors – can be obtained by interviewing the learner. Automatically administered interview questions, at precise learning moments, can be posed, even through phones, in vernacular, for levels of learners who are monolingual and illiterate, especially in rural areas where equipment is unavailable.

Broadband connectivity, 5G, and the internet of things and smart electrical grids, are all steadily invading the remotest corners of Africa. Such developments are going to make possible data collection in centralised spaces for powerful machine learning analytics in education.

Winston Ojenge is a senior research fellow in the STI, knowledge and society (STIKS) programme at the Africa Centre for Technology Studies (ACTS) in Nairobi, Kenya. He heads the digital economy programme within STIKS. Ojenge has a PhD in computer science from the Technical University of Kenya. His research interests are artificial intelligence, machine learning and the internet of things. He was the founder coordinator of the Innovation Lab at the Technical University of Kenya, and holds a patent for ‘TV Receiver Channel Consumption Monitoring Tool’, with four other patent applications under review at the Kenya Institute of Intellectual Property. He is currently the co-lead for the AI4D PhD Scholarships in AI – AI for development in Africa.