Publications

The novel coronavirus has forced the world to interact with data visualizations in order to make decisions at the individual level that have, sometimes, grave consequences. As a result, the lack of statistical literacy among the general public, as well as organizations that have a responsibility to share accurate, clear, and timely information with the general public, has resulted in widespread (mis)representations and (mis)interpretations. In this article, we showcase examples of how data related to the COVID-19 pandemic has been (mis)represented in the media and by governmental agencies and discuss plausible reasons why it has been (mis)represented. We then build on these examples to draw connections to how they could be used to enhance statistics teaching and learning, especially as it relates to secondary and introductory tertiary statistics and quantitative reasoning coursework.

The size of China’s State-owned media’s operations in Africa has grown significantly since the early 2000s. Previous research on the impact of increased Sino-African mediated engagements has been inconclusive. Some researchers hold that public opinion toward China in African nations has been improving because of the increased media presence. Others argue that the impact is rather limited, particularly when it comes to affecting how African media cover China-related stories. This article contributes to this debate by exploring the extent to which news media in 30 African countries relied on Chinese news sources to cover China and the COVID-19 outbreak during the first-half of 2020. By computationally analyzing a corpus of 500,000 written news stories, this paper shows that, compared to other major global players (e.g. Reuters, AFP), content distributed by Chinese media (e.g. Xinhua, China Daily) is much less likely to be used by African news organizations, both in English and French speaking countries. The analysis also reveals a gap in the prevailing themes in Chinese and African media’s coverage of the pandemic. The implications of these findings for the sub-field of Sino-African media relations, and the study of global news flows are discussed.

 

Popular computational catalyst design strategies rely on the identification of reactivity descriptors, which can be used along with Brønsted−Evans−Polanyi (BEP) and scaling relations as input to a microkinetic model (MKM) to make predictions for activity or selectivity trends. The main benefit of this approach is related to the inherent dimensionality reduction of the large material space to just a few catalyst descriptors. Conversely, it is well documented that a small set of descriptors is insufficient to capture the intricacies and complexities of a real catalytic system. The inclusion of coverage effects through lateral adsorbate-adsorbate interactions can narrow the gap between simplified descriptor predictions and real systems, but mean-field MKMs cannot properly account for local coverage effects. This shortcoming of the mean-field approximation can be rectified by switching to a lattice-based kinetic Monte Carlo (kMC) method using cluster expansion representation of adsorbate−adsorbate lateral interactions.

Using the prototypical CO oxidation reaction as an example, we critically evaluate the benefits of kMC over MKM in terms of trend predictions and computational cost when using only a small set of input parameters. After confirming that in the absence of lateral interactions the kMC and MKM approaches yield identical trends and mechanistic information, we observed substantial differences between the two kinetic models when lateral interactions were introduced. The mean-field implementation applies coverage corrections directly to the descriptors, causing an artificial overprediction of the activity of strongly binding metals. In contrast, the cluster expansion in kMC implementation can differentiate among the highly active metals but it is very sensitive to the set of included interaction parameters. Considering that computational screening relies on a minimal set of descriptors, for which MKM makes reasonable trend predictions at a ca. three orders of magnitude lower computational cost than kMC, the MKM approach does provide a better entry point for computational catalyst design.

Mapping biological processes in brain tissues requires piecing together numerous histological observations of multiple tissue samples. We present a direct method that generates readouts for a comprehensive panel of biomarkers from serial whole-brain slices, characterizing all major brain cell types, at scales ranging from subcellular compartments, individual cells, local multi-cellular niches, to whole-brain regions from each slice. We use iterative cycles of optimized 10-plex immunostaining with 10-color epifluorescence imaging to accumulate highly enriched image datasets from individual whole-brain slices, from which seamless signal-corrected mosaics are reconstructed. Specific fluorescent signals of interest are isolated computationally, rejecting autofluorescence, imaging noise, cross-channel bleedthrough, and cross-labeling. Reliable large-scale cell detection and segmentation are achieved using deep neural networks. Cell phenotyping is performed by analyzing unique biomarker combinations over appropriate subcellular compartments. This approach can accelerate preclinical drug evaluation and system-level brain histology studies by simultaneously profiling multiple biological processes in their native anatomical context.

This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process.

 

The kidney biopsy based diagnosis of Lupus Nephritis (LN) is characterized by low inter-observer agreement, with misdiagnosis being associated with increased patient morbidity and mortality. Although various Computer Aided Diagnosis (CAD) systems have been developed for other nephrohistopathological applications, little has been done to accurately classify kidneys based on their kidney level Lupus Glomerulonephritis (LGN) scores. The successful implementation of CAD systems has also been hindered by the diagnosing physician’s perceived classifier strengths and weaknesses, which has been shown to have a negative effect on patient outcomes. We propose an Uncertainty-Guided Bayesian Classification (UGBC) scheme that is designed to accurately classify control, class I/II, and class III/IV LGN (3 class) at both the glomerular-level classification task (26,634 segmented glomerulus images) and the kidney-level classification task (87 MRL/lpr mouse kidney sections). Data annotation was performed using a high throughput, bulk labeling scheme that is designed to take advantage of Deep Neural Network’s (or DNNs) resistance to label noise. Our augmented UGBC scheme achieved a 94.5% weighted glomerular-level accuracy while achieving a weighted kidney-level accuracy of 96.6%, improving upon the standard Convolutional Neural Network (CNN) architecture by 11.8% and 3.5% respectively. 

Douglas, V. A., Gao, W., Fontenot, E., & Malone, A. (2021), Beyond the numbers: Building a data information literacy program for undergraduate instruction, In J. Bauder (Ed.), Teaching Critical Thinking with Numbers: Data Literacy and the Framework for Information Literacy for Higher Education. (pp. 39-58). Chicago, IL: ALA Editions.

Deep reinforcement learning (DRL) augments the reinforcement learning framework, which learns a sequence of actions that maximizes the expected reward, with the representative power of deep neural networks. Recent works have demonstrated the great potential of DRL in medicine and healthcare. This paper presents a literature review of DRL in medical imaging. We start with a comprehensive tutorial of DRL, including the latest model-free and model-based algorithms. We then cover existing DRL applications for medical imaging, which are roughly divided into three main categories: (i) parametric medical image analysis tasks including landmark detection, object/lesion detection, registration, and view plane localization; (ii) solving optimization tasks including hyperparameter tuning, selecting augmentation strategies, and neural architecture search; and (iii) miscellaneous applications including surgical gesture segmentation, personalized mobile health intervention, and computational model personalization. The paper concludes with discussions of future perspectives.

The concept of a connected world using Internet of Things (IoT) has already taken pace during this decade. The efficient hardware and high throughput networks have made it possible to connect billions of devices, collecting and transmitting useable information. The benefit of IoT devices is that they enable automation however, a significant amount of energy is required for billions of connected devices communicating with each other. This requirement of energy, unless managed, can be one of the barriers in the complete implementation of IoT systems. This paper presents the energy management system for IoT devices. Both hardware and software aspects are considered. Energy transparency has been achieved by modelling energy consumed during sensing, processing, and communication. A multi-agent system has been introduced to model the IoT devices and their energy consumptions. Genetic algorithm is used to optimize the parameters of the multi-agent system. Finally, simulation tools such as MATLAB Simulink and OpenModelica are used to test the system. The optimization results have revealed substantial energy consumption with the implementation of decentralized intelligence of the multi-agent system.

Computer-aided diagnosis (CAD) systems must constantly cope with the perpetual changes in data distribution caused by different sensing technologies, imaging protocols, and patient populations. Adapting these systems to new domains often requires significant amounts of labeled data for re-training. This process is labor-intensive and time-consuming. We propose a memory-augmented capsule network for the rapid adaptation of CAD models to new domains. It consists of a capsule network that is meant to extract feature embeddings from some high-dimensional input, and a memory-augmented task network meant to exploit its stored knowledge from the target domains. Our network is able to efficiently adapt to unseen domains using only a few annotated samples. We evaluate our method using a large-scale public lung nodule dataset (LUNA), coupled with our own collected lung nodules and incidental lung nodules datasets. When trained on the LUNA dataset, our network requires only 30 additional samples from our collected lung nodule and incidental lung nodule datasets to achieve clinically relevant performance (0.925 and 0.891 area under receiving operating characteristic curves (AUROC), respectively). This result is equivalent to using two orders of magnitude less labeled training data while achieving the same performance. We further evaluate our method by introducing heavy noise, artifacts, and adversarial attacks. Under these severe conditions, our network’s AUROC remains above 0.7 while the performance of state-of-the-art approaches reduce to chance level.