To draw meaningful insights from thousands of text messages in low-resource languages is difficult. To tackle this challenge, we’ve collaborated with the University of Cambridge to develop tools that help to streamline our data analysis approach. Here, Professor Alan Blackwell from Cambridge’s Computer Lab explains the inner workings of one such tool: CODA.
CODA is a software tool created to help researchers at Africa’s Voices Foundation (AVF) to quickly analyse large volumes of short texts. It’s currently being put to use for AVF’s ongoing partnership with UNICEF Somalia. For this project, more than 44,000 people so far have participated in interactive radio shows on a range of health and gender-related topics, leading to a rich dataset of over 250,000 text messages in the Somali language. Participants are from a wide variety of backgrounds and across every district in Somalia, including from remote regions and low-literacy populations. As such the responses vary in dialect, length, and legibility.
These thousands of messages need to be translated, analysed, and coded (categorised) by AVF Somali-speaking researchers: skilled analysts who must constantly make decisions about which code the message belongs to. These skills are rare, and their time is valuable.
The design philosophy for CODA was to combine principles of interaction design and artificial intelligence, to create a tool that would allow AVF researchers to use their time as effectively as possible, and get the greatest benefit from their analytic decisions.
How does CODA work?
When AVF researchers use CODA, they see a table containing one text message on each row, and columns alongside to record the code that the researcher decides to assign to each message. The codes were developed in a prior stage of processing, when thematic analysis was applied to draw out the key codes (these can be thought of as themes or categories) occurring in the dataset.
To start with, the whole table of messages is white. As the researcher decides on the code for each message, the colour of that row changes to a colour that has been chosen to go with that particular code. Codes can be assigned by choosing from a menu when the researcher is starting out, and adding new codes to expand the coding scheme as necessary.
Once the codes become more familiar, similar judgments can be made with a single keystroke, advancing through the table with only seconds needed for each judgement. As the researcher makes progress with the dataset, the table is progressively coloured in, so that the proportion of different codes can be reviewed.
An overview of the whole dataset can be seen on the left hand side of the table, as a scrollbar with coloured lines for each message. The researcher can sort the messages according to this scrollbar (the codes) at any time, in order to review the messages by code, see the proportions of different code colours, or carry on with the coding work by going to the white, uncoded regions.
Future developments: Automatic coding using AI
CODA is already proving to be a fast and convenient tool for manual coding, but we are working on optimising it further by using artificial intelligence and natural language processing techniques to accelerate the process. Every time the researcher makes a decision on a message, this decision will be sent to a machine learning algorithm that collects statistical information about the kinds of words and phrases that tend to be associated with that code.
CODA will continually review all the rows (messages) that are not yet coloured, to see whether it is possible to predict what the code should be, according to the statistical model. The colour of a predicted row will change automatically – not to be the same as an expert decision by the researcher, but a faded version of the same colour, showing that it might be possible to automate some decisions.
Researchers using CODA will be able to sort the coloured messages according to the statistical confidence of these predictions, checking how accurate the model is, and making faster judgments as it becomes more reliable. The system will always let the expert researcher be the final judge, so there is no danger of artificial intelligence being used for crude or invalid judgments. But each time the researcher confirms a prediction, we will get data that can be used to improve the reliability and confidence of the statistical model.
Over time, the code colours will become less faded, and the researcher can focus on the cases that are closest to white – where the message is ambiguous or the model incomplete, for the greatest possible efficiency in prioritising her time. Active development is still in progress, with a focus on enhancing the intelligence of the machine learning algorithms, providing additional assistance, and increasing the critical confidence levels as far as possible, while still retaining expert judgment and control as our overriding design principle.
The primary developer of CODA is Ana Semrov, with natural language processing research from Guy Emerson. Interaction design is by Ana Semrov and Alan Blackwell, and the CODA system architecture has been designed by Luke Church. Their work has been funded by the Global Challenges Research Fund.
The CODA application is open source software, meaning that its capabilities will be able to grow over time. The development and maintenance of the open source component of the software is managed by Africa’s Voices Foundation and being supported by a Shuttleworth Foundation Flash Grant, awarded to us in May 2017.
Learn more about Africa’s Voices method
CODA is just one tool in our multidisciplinary research and analysis toolkit. Watch the short animation below to get an overview of the various steps that together make up our data analysis approach. CODA contributes to the step of manual and automatic labelling of participants’ messages.