This summer, Africa’s Voices was lucky to have two computational linguistics interns join our team. Gladys Tyen built on a web-app for crowdsourcing sentiment analysis, and Lis Kerr worked on a governance project with Well Told Story.
Here they share what they got up to, and the challenges and achievements along the way.
Gladys: Let’s teach a robot
Last year in the robot project, an online user interface was created to facilitate the process of gathering data for machine learning. This year, I focused on expanding the design of the website to incorporate more user-friendly features and to allow for usage across multiple datasets.
As before, we have a robot nicknamed Rob who is struggling to understand human sentiment. Over three stages called “chapters”, users label messages to help Rob distinguish between them. Each label or annotation that the user provides will award them with a certain number of points, depending on the number of previous users who have agreed with the annotation.
This year, I made a few changes to the website to improve the user experience and the quality of the data.
- I decided to simplify one of the chapters, so that the instructions can be more easily understood.
- I implemented a confidence meter for each answer the user provides, so they can let us know if they are unsure about it.
- Taking on user comments from last year, I also decided to give immediate feedback to users after each annotation, informing them of the number of users who agreed with their annotation, and the number of points they acquired as a result.
In addition to the annotators’ online experience, I added a user interface for administrators at Africa’s Voices. They can choose to upload multiple datasets, so that the website can be used for multiple projects at the same time, and a range of customisation options have been added to tailor the user experience to each project.
Accompanying this change, the registration process has been carefully set up to adhere to AVF’s data protection policies. Administrators can restrict the registration process such that newly registered users must be manually approved by the Africa’s Voices team before accessing the chapters. Hence the messages, which can contain sensitive information revealing the author’s identity, will only be shown to people who are authorised to see them.
Lis: Kenyan youth attitudes to governance
I worked with Africa’s Voices on their partnership with Well Told Story (WTS) on a project investigating youth attitudes to governance during the Kenyan elections. I spent the final two weeks of my internship in Nairobi. This was a great way to collaborate more closely with the WTS team as well as the AVF Nairobi office, and to experience life in Kenya’s capital.
Machine learning to speed up message analysis
I began by adapting the existing Python code from the previous project with WTS on contraception to the new topic of governance. Given the broad nature of this topic, I developed different ways of breaking down the messages by theme.
First, I investigated various types of topic modelling and ended up creating a machine learning classifier which was able to automatically filter through every comment on the Well Told Story Facebook fan page (see example post below) and label it as being related or unrelated to governance with accuracy above 92%.With this in place, the WTS team could gain quick insight into their fan engagement on governance issues at a broad scale.
This classifier was made possible by the linguistic and cultural expertise of annotators at Well Told Story who labelled a subset of the dataset for use in training of the algorithm; this human knowledge is crucial for Africa’s Voices’ tools and data analysis.
Categorising messages by governance sub-themes
For those messages that were related to governance, I filtered them using regexes (regular expressions) based on a custom lexicon of English, Swahili, and Sheng words that I developed in correspondence with the WTS team. This led to each comment being categorised by a sub-theme – for example, whether it was discussing the elections themselves (the voting procedure, and particular candidates) or a more general theme such as infrastructure or youth engagement in politics.
By tagging each related comment in this way, this information can be combined with the timestamp from the Facebook API and sentiment scores in order to track how conversations about governance change over time. This insight can then be used by the WTS team to understand what their audience are saying and tailor their content to be as engaging as possible.
Tailored approaches for low-resource languages
In order to improve the accuracy of the machine learning scripts, I researched the morphosyntactic properties of Sheng as well as the different ways it is written, as all the data was in written format. Through contacting Guy de Pauw at Textgain I arranged for Africa’s Voices to use the Swahili Part-Of-Speech (POS) tagger demoed at the African Languages Technologies website (www.aflat.org/swatag) through the Textgain API (www.textgain.com/api).
By creating a bank of words alongside their POS tag (e.g. “nitawaste – Verb”) it is possible for machine learning algorithms to use the POS tag as a ‘feature’, giving more information about the distribution of Sheng words. This distribution is used alongside morphosyntactic analysis (breaking down the word into its parts – e.g. “nitawaste” → “ni-ta-waste” = “I will waste”) to predict sentiment polarity, and a ‘weight’ is assigned to each feature dependent on how reliable it is for predicting a given outcome.
Overall, this means that Africa’s Voices can boost its capacity for understanding Swahili and Sheng by using computational techniques alongside language-specific resources developed by myself and Gladys.
We both had a great time over the summer. While there was plenty of debugging, we felt that we finished the internship with a much deeper understanding of natural language processing, website creation, and data management, and we are very grateful to the Africa’s Voices team for the great working environment and training opportunities they provided us with. We would also like to thank our supervisors, Andrew Caines and Russell Moore, at the University of Cambridge’s Computer Laboratory, for the training they gave us, from machine learning to Python and Java.