Hewlett Packard Enterprise Data Science Institute

2021-02-14

Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis [Data set].

This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process.

Texas Data Repository.