Gladys Tyen and Walter Myer are students of computational linguistics at the University of Cambridge, who joined Africa’s Voices for a summer internship. They developed machine learning techniques and a crowdsourcing web app that will help us to analyse text messages in dialects and slang like Sheng. Here, they explain their tasks, experiences, and successes as Africa’s Voices’ interns!
Over the course of the summer, we’ve had a remarkable opportunity to support the work of Africa’s Voices Foundation. Our project involved developing new tools to help analyse the sentiments expressed in over one million text messages from Well Told Story fans.
Well Told Story is a multi-media organisation that aims to promote positive social change among young Kenyans and Tanzanians by shifting attitudes and behaviours. In a monthly comic, Shujaaz, the recurring characters DJ Boyie and Maria Kim encounter and address a wide range of situations, many of them connected with relationships and contraception. Readers are invited to contact DJ Boyie by text message and social media to share and ask for information about contraception, relationships, and general advice of any kind. The 1.1 million messages received over the last five years are mostly written in Sheng, a slang spoken among young people in Kenya which combines Swahili and English elements, as well as other local languages.
Teaching a computer to understand Sheng
The objective of our project was to design a computer program capable of taking contraception-related messages written in Sheng, and outputting the (positive or negative) attitude towards contraception they expressed. This program should speed up the analysis of and replies to audiences’ messages.
Because Sheng is a particularly young language, there were very few resources available to draw on for this analysis task. We therefore turned to machine learning techniques to help classify the messages.
In supervised machine learning, a classifier (our computer program) is trained on a manually labelled dataset. For example, say you have 10,000 customer feedback messages, and you want to assign each of them one of the labels ‘happy’, ‘angry’ or ‘disappointed’. In a supervised method, one might begin by manually labelling the first 500 messages, giving each message one of the three labels above. By looking at the words that each labelled message contains, a program can observe that particular words and combinations of words occur disproportionately in messages of a particular label. For example, ‘overjoyed’ may almost always occur in ‘happy’ messages, while ‘outraged’ might usually occur in ‘angry’ messages. The classifier thus learns which words are most strongly correlated with which label. This knowledge is then used to classify the remaining 9,500 messages, by breaking each new message down to its component features and selecting the label that fits the features best.
In adopting this method, we needed to start by gathering data. We therefore divided our work into two broad tasks: (a) design a web app to crowdsource the manually labelled data we need; and (b) design a program to learn from this data and automatically classify unlabelled messages.
(a) Meet our robot: Crowdsourcing web app
We thought carefully about how to design our crowdsourcing web app in order to ensure maximum user engagement. To enliven the relatively dry tasks involved, we decided to construct a narrative that would keep users interested. In our narrative, a robot (nicknamed ‘Rob’) lands in Nairobi with the goal of learning human sentiment, with each task helping him along his way.
As users complete tasks, an accompanying image of the robot updates to show their progress. For instance, in the final sentiment annotation task, LEDs on the robot’s body light up in the shape of a heart as he ‘learns sentiment’.
We tried to incentivise users to think carefully about their answers by allowing them to earn points when they gave ‘accurate’responses. This measure of accuracy comes from comparing the feedback given on a particular message by different annotators (users), meaning that each user checks the other. Our aim was to boost the quality of the feedback, as well as provide a competitive framework on the user side with a leaderboard displaying the top ten users’ scores and rankings.
While the web app was designed with the Well Told Story project in mind, its structures are flexible and should prove generalisable to future crowdsourcing projects of a similar nature.
(b) Training our classifier
With a set of training data comprising about 1300 labelled messages, we began to train our classifier. To feed data into a machine learning algorithm, we had to extract linguistic features from messages in terms of numbers (frequency). The simplest type of feature is just the word itself, and the number extracted would represent how many times a particular word appears.
As is the nature of texting, the messages we had were very messy. Simply counting how many times “good” appears does not truly reflect that information if “good” is also spelled, say, “gud”. To maximise the linguistic information extracted from each message, we obtained two additional pieces of information for each word: a “standard” form of the word (i.e. what you would look up in a dictionary), and its part of speech (e.g. noun, verb, adjective, etc.).
Due to the lack of linguistic resources and standardised spelling for Sheng, we decided to group all words according to orthographic similarity, with the assumption that the most common form in a group would be the “standard” form. The more similar their spelling, the more likely they were to be grouped. This algorithm was customised to account for common spelling variations in Sheng, such as replacing “s” for “x”. It also detects and ignores prefixes and suffixes in English and Swahili: for example, “running” would be grouped with “run”.
For the parts of speech, we used existing resources such as dictionaries and language corpora to identify the part of speech of each word. For example, “and” is classified as a conjunction. This allowed us to identify certain linguistic constructions that might only occur under certain semantic circumstances.
The resulting classifier operates at about 83% accuracy, and will continue to improve as more annotated data is obtained through the web app. The algorithm can be applied to future messages that Well Told Story receives, and with some modifications, it can be adapted to other topics such as agriculture and religion.
While the project that we have started is far from over, we hope that our work will add real value to Africa’s Voices’ future projects. For our part, we feel privileged to have been able to contribute our skills to this exciting project. We also count ourselves lucky to have been involved in such a friendly, welcoming office, and have loved getting to know all the members of the Africa’s Voices team. Thank you to all of you for your friendship and support, and we hope we’ll be back for further collaboration in the future!