# Publications

Computer Aided Diagnosis (CAD) systems for renal histopathology applications aim to understand and replicate nephropathologists’ assessments of individual morphological compartments (e.g. glomeruli) to render case-level histological diagnoses. Deep neural networks (DNNs) hold great promise in addressing the poor intra- and interobserver agreement between pathologists. This being said, the generalization ability of DNNs heavily depends on the quality and quantity of training labels. Current “consensus” labeling strategies require multiple pathologists to evaluate every compartment unit over thousands of crops, resulting in enormous annotative costs. Additionally, these techniques fail to address the underlying reproducibility issues we observe across various diagnostic feature assessment tasks. To address both of these limitations, we introduce MorphSet, an end-to-end architecture inspired by Set Transformers which maps the combined encoded representations of Monte Carlo (MC) sampled glomerular compartment crops to produce Whole Slide Image (WSI) predictions on a case basis without the need for expensive fine-grained morphological feature labels. To evaluate performance, we use a kidney transplant Antibody Mediated Rejection (AMR) dataset, and show that we are able to achieve 98.9% case level accuracy, outperforming the consensus label baseline. Finally, we generate a visualization of prediction confidence derived from our MC evaluation experiments, which provides physicians with valuable feedback.

Motivated by the increasing need to understand the distributed algorithmic foundations of large-scale graph computations, we study some fundamental graph problems in a message-passing model for distributed computing where k ≥ 2 machines jointly perform computations on graphs with n nodes (typically, n >> k). The input graph is assumed to be initially randomly partitioned among the k machines, a common implementation in many real-world systems. Communication is point-to-point, and the goal is to minimize the number of communication rounds of the computation.

Our main contribution is the General Lower Bound Theorem, a theorem that can be used to show non-trivial lower bounds on the round complexity of distributed large-scale data computations. This result is established via an information-theoretic approach that relates the round complexity to the minimal amount of information required by machines to solve the problem. Our approach is generic, and this theorem can be used in a “cookbook” fashion to show distributed lower bounds for several problems, including non-graph problems. We present two applications by showing (almost) tight lower bounds on the round complexity of two fundamental graph problems, namely, PageRank computation and triangle enumeration. These applications show that our approach can yield lower bounds for problems where the application of communication complexity techniques seems not obvious or gives weak bounds, including and especially under a stochastic partition of the input.

We study several fundamental problems in the k-machine model, a message-passing model for large-scale distributed computations where k ≥ 2 machines jointly perform computations on a large input of size N, (typically, N ≫ k). The input is initially partitioned (randomly or in a balanced fashion) among the k machines, a common implementation in many real-world systems. Communication is point-to-point, and the goal is to minimize the number of communication rounds of the computation. Our main result is a general technique for designing efficient deterministic distributed algorithms in the k-machine model using PRAM algorithms. Our technique works by efficiently simulating PRAM algorithms in the k-machine model in a deterministic way. This simulation allows us to arrive at new algorithms in the k-machine model for some problems for which no efficient k-machine algorithms are known before and also improve on existing results in the k-machine model for some problems. While our simulation allows us to obtain k-machine algorithms for any problem with a known PRAM algorithm, we mainly focus on graph problems. For an input graph on n vertices and m edges, we obtain Õ(m/k 2 ) round 4 algorithms for various graph problems such as r-connectivity for r = 1, 2, 3, 4, minimum spanning tree (MST), maximal independent set (MIS), (Δ + 1)-coloring, maximal matching, ear decomposition, and spanners under the assumption that the edges of the input graph are partitioned (randomly, or in an arbitrary, but balanced, fashion) among the k machines. For problems such as connectivity and MST, the above bound is (essentially) the best possible (up to logarithmic factors). Our simulation technique allows us to obtain the first known efficient deterministic algorithms in the k-machine model for other problems with known deterministic PRAM algorithms.

Evidence from several international studies indicates that criminal activity and involvement with the criminal justice system tend to be concentrated in families. Comparatively little work has studied factors that exacerbate or lessen intergenerational associations in crime. Knowledge about factors that are (1) malleable and (2) capable of lessening or exacerbating the intergenerational cycle of criminal behavior is highly relevant for prevention efforts and public policy-makers. This paper studied potential moderators of intergenerational associations in crime (from late childhood, adolescence, early adulthood). We found that late childhood cognitive function and early adult employment history, substance use, and romantic partner’s antisocial behavior moderated intergenerational associations in crime for a sample of at-risk men and their parents. The identified moderators may be used as selection criteria or targeted in prevention and treatment efforts aimed at reducing such associations.

Structural deformation monitoring is crucial for the identification of early signs of tunnelling-induced damage to adjacent structures and for the improvement of current damage assessment procedures. Satellite multi-temporal interferometric synthetic aperture radar (MT-InSAR) techniques enable measurement of building displacements over time with millimetre-scale accuracy. Compared to traditional ground-based monitoring, MT-InSAR can yield denser and cheaper building observations, representing a cost-effective monitoring tool. However, without integrating MT-InSAR techniques and structural assessment, the potential of InSAR monitoring cannot be fully exploited. This integration is particularly demanding for large construction projects, where big datasets need to be processed. In this paper, we present a new automated methodology that integrates MT-InSAR-based building deformations and damage assessment procedures to evaluate settlement-induced damage to buildings adjacent to tunnel excavations. The developed methodology was applied to the buildings along an 8-km segment of the Crossrail tunnel route in London, using COSMO-SkyMed MT-InSAR data from 2011 to 2015. The methodology enabled the identification of damage levels for 858 buildings along the Crossrail twin tunnels, providing an unprecedented number of high quality field observations for building response to settlements. The proposed methodology can be used to improve current damage assessment procedures, for the benefit of future underground excavation projects in urban areas.

The size of China’s State-owned media’s operations in Africa has grown significantly since the early 2000s. Previous research on the impact of increased Sino-African mediated engagements has been inconclusive. Some researchers hold that public opinion toward China in African nations has been improving because of the increased media presence. Others argue that the impact is rather limited, particularly when it comes to affecting how African media cover China-related stories. This article contributes to this debate by exploring the extent to which news media in 30 African countries relied on Chinese news sources to cover China and the COVID-19 outbreak during the first-half of 2020. By computationally analyzing a corpus of 500,000 written news stories, this paper shows that, compared to other major global players (e.g. Reuters, AFP), content distributed by Chinese media (e.g. Xinhua, *China Daily*) is much less likely to be used by African news organizations, both in English and French speaking countries. The analysis also reveals a gap in the prevailing themes in Chinese and African media’s coverage of the pandemic. The implications of these findings for the sub-field of Sino-African media relations, and the study of global news flows are discussed.

Popular computational catalyst design strategies rely on the identification of reactivity descriptors, which can be used along with Brønsted−Evans−Polanyi (BEP) and scaling relations as input to a microkinetic model (MKM) to make predictions for activity or selectivity trends. The main benefit of this approach is related to the inherent dimensionality reduction of the large material space to just a few catalyst descriptors. Conversely, it is well documented that a small set of descriptors is insufficient to capture the intricacies and complexities of a real catalytic system. The inclusion of coverage effects through lateral adsorbate-adsorbate interactions can narrow the gap between simplified descriptor predictions and real systems, but mean-field MKMs cannot properly account for local coverage effects. This shortcoming of the mean-field approximation can be rectified by switching to a lattice-based kinetic Monte Carlo (kMC) method using cluster expansion representation of adsorbate−adsorbate lateral interactions.

Using the prototypical CO oxidation reaction as an example, we critically evaluate the benefits of kMC over MKM in terms of trend predictions and computational cost when using only a small set of input parameters. After confirming that in the absence of lateral interactions the kMC and MKM approaches yield identical trends and mechanistic information, we observed substantial differences between the two kinetic models when lateral interactions were introduced. The mean-field implementation applies coverage corrections directly to the descriptors, causing an artificial overprediction of the activity of strongly binding metals. In contrast, the cluster expansion in kMC implementation can differentiate among the highly active metals but it is very sensitive to the set of included interaction parameters. Considering that computational screening relies on a minimal set of descriptors, for which MKM makes reasonable trend predictions at a ca. three orders of magnitude lower computational cost than kMC, the MKM approach does provide a better entry point for computational catalyst design.

Mapping biological processes in brain tissues requires piecing together numerous histological observations of multiple tissue samples. We present a direct method that generates readouts for a comprehensive panel of biomarkers from serial whole-brain slices, characterizing all major brain cell types, at scales ranging from subcellular compartments, individual cells, local multi-cellular niches, to whole-brain regions from each slice. We use iterative cycles of optimized 10-plex immunostaining with 10-color epifluorescence imaging to accumulate highly enriched image datasets from individual whole-brain slices, from which seamless signal-corrected mosaics are reconstructed. Specific fluorescent signals of interest are isolated computationally, rejecting autofluorescence, imaging noise, cross-channel bleedthrough, and cross-labeling. Reliable large-scale cell detection and segmentation are achieved using deep neural networks. Cell phenotyping is performed by analyzing unique biomarker combinations over appropriate subcellular compartments. This approach can accelerate preclinical drug evaluation and system-level brain histology studies by simultaneously profiling multiple biological processes in their native anatomical context.

This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process.

The kidney biopsy based diagnosis of Lupus Nephritis (LN) is characterized by low inter-observer agreement, with misdiagnosis being associated with increased patient morbidity and mortality. Although various Computer Aided Diagnosis (CAD) systems have been developed for other nephrohistopathological applications, little has been done to accurately classify kidneys based on their kidney level Lupus Glomerulonephritis (LGN) scores. The successful implementation of CAD systems has also been hindered by the diagnosing physician’s perceived classifier strengths and weaknesses, which has been shown to have a negative effect on patient outcomes. We propose an Uncertainty-Guided Bayesian Classification (UGBC) scheme that is designed to accurately classify control, class I/II, and class III/IV LGN (3 class) at both the glomerular-level classification task (26,634 segmented glomerulus images) and the kidney-level classification task (87 MRL/lpr mouse kidney sections). Data annotation was performed using a high throughput, bulk labeling scheme that is designed to take advantage of Deep Neural Network’s (or DNNs) resistance to label noise. Our augmented UGBC scheme achieved a 94.5% weighted glomerular-level accuracy while achieving a weighted kidney-level accuracy of 96.6%, improving upon the standard Convolutional Neural Network (CNN) architecture by 11.8% and 3.5% respectively.