This Tuesday 16th January at 4 pm, the dataLearning seminars are hosting David Ifeoluwa Adelani from UCL (DeepMind Academic Fellow). His talk is titled: “Natural Language Processing for Under-resourced African Languages”.

Join online: https://zoom.us/j/95291664305?pwd=b0lXeXNxbnllcjdCK2ZBaUs2WENyQT09

Abstract

African languages are spoken by over a billion people but are underrepresented in natural language processing (NLP) research and development. The challenges impeding progress include the limited availability of annotated datasets, multilingual representation models, as well as a lack of understanding of the settings where current methods are effective. These challenges are also shared by several under-resourced languages in different regions around the world. In this talk, we will describe the steps taken towards addressing some of these challenges. First, we describe the creation of the largest human-annotated named entity recognition (NER) dataset for 21 African languages through participatory research, we analyze our dataset and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We also study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages. Finally, we describe the development of multilingual pre-trained language models (PLMs) for 20 widely spoken and most-resourced languages in Africa through multilingual adaptive fine-tuning (MAFT) — fine-tuning a multilingual PLM on multiple languages simultaneously. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50%, thereby making PLMs more accessible to African research labs with less GPU and hardware resources. Our approach results in the state-of-the-art performance on several African language datasets comprising three NLP tasks — NER, news topic classification and sentiment analysis.

Image of David Ifeoluwa Adelani About the speaker: David Ifeoluwa Adelani is a Research Fellow or DeepMind Academic Fellow in the department of computer Science at University College London, and a member of the leadership of Masakhane – a grassroots organization whose mission is to strengthen and spur natural language processing (NLP) research in African languages, for Africans, by Africans. He was formerly a PhD student of computer science at the department of language science and technology in Saarland University. His research focuses on NLP for under-resourced languages, especially African languages, multilingual representation learning, machine translation, and privacy in NLP.