Publications

2022-10-27

Transforming Curriculum and Building Capacity in K–12 Data Science Education

The recently released and updated Pre-K–12 Guidelines for Assessment and Instruction in Statistics Education (GAISE II; Bargagliotti et al., 2020) provides guidance as to how teachers can support the development of data literacy for all students in the pre-K–12 curriculum. However, to truly meet the vision of the GAISE II report and to support all students in developing data literacy for today’s societies, significant transformations need to be made to the educational system as a whole to build capacity for such development. In this article we discuss the current state of the K–12 curriculum focusing on the mathematics curriculum where statistics and data concepts are most frequently situated, presenting some challenges and exciting examples. We then discuss areas of need for capacity building that must come at all levels, including: K–12 school curriculum, K–12 teacher professional development, K–12 teacher preparation, statistics and data science education research, and policies. We also provide a set of recommendations for building capacity to develop the data literacy of all students through the teaching of data science and statistics concepts and practices in the K–12 mathematics curriculum to support democratic equity through engaged citizenship.

Transforming Curriculum and Building Capacity in K–12 Data Science Education

2022-06-09

ML / AI, Visualization

Opportunities for K-8 Students to Learn Statistics Created by States’ Standards in the United States

Statistical literacy is key in this heavily polarized information age for an informed and critical citizenry to make sense of arguments in the media and society. The responsibility of developing statistical literacy is often left to the K-12 mathematics curriculum. In this article, we discuss our investigation of K-8 students’ current opportunities to learn statistics created by state mathematics standards. We analyze the standards for alignment to the Guidelines for the Assessment and Instruction in Statistics Education (GAISE II) PreK-12 report and summarize the conceptual themes that emerged. We found that while states provide K-8 students opportunities to analyze and interpret data, they do not offer many opportunities for students to engage in formulating questions and collecting/considering data. We discuss the implications of the findings for policy makers and researchers and provide recommendations for policy makers and standards writers.

Opportunities for K-8 Students to Learn Statistics Created by States’ Standards in the United States

2022-02-02

Wenli Gao

Natural Language Processing

Promoting data literacy with campus partners

1. Gao,W. (2022), Data Visualization Day: Promoting data literacy with campus partners, In K. Getz & M. Brodsky (Ed.), ACRL Data Literacy Cookbook. (pp. 203-205). Chicago, IL: Association of College and Research Libraries.

2022-02-01

Richard Meisel

Natural Language Processing

Sex‐specific aging in animals: perspective and future directions

Sex differences in aging occur in many animal species, and they include sex differences in lifespan, in the onset and progression of age‐associated decline, and in physiological and molecular markers of aging. Sex differences in aging vary greatly across the animal kingdom. For example, there are species with longer‐lived females, species where males live longer, and species lacking sex differences in lifespan. The underlying causes of sex differences in aging remain mostly unknown. Currently, we do not understand the molecular drivers of sex differences in aging, or whether they are related to the accepted hallmarks or pillars of aging or linked to other well‐characterized processes. In particular, understanding the role of sex‐determination mechanisms and sex differences in aging is relatively understudied. Here, we take a comparative, interdisciplinary approach to explore various hypotheses about how sex.

Sex‐specific aging in animals: perspective and future directions

2021-12-08

ML / AI, Visualization

The Science of Data, Data Science: Perversions and Possibilities in the Anthropocene Through a Spatial Justice Lens

In the Anthropocene statistics, data science, and mathematical models have become a perversion of reality that society has largely chosen to ignore and is embraced as a great savior because people often view numbers as objective purveyors of truth. However, numbers do not interpret themselves, they do not tell their own story; people do that in all their subjective glory. In this chapter, I start by making connections between the Anthropocene and the disciplines of statistics and data science specifically through the context of spatial data. From this discussion I focus on two main points, which I connect to education. The first is that there is a dialectic tension involved in spatial data enquiry between creating new realities using spatial data and using spatial data to make sense of our reality. The second point is that people can choose how to investigate and use spatial data based on their ethics. I believe students should have opportunities to investigate and use spatial statistics through a spatial justice lens both to learn about the world around them and to shape the world around them.

The Science of Data, Data Science: Perversions and Possibilities in the Anthropocene Through a Spatial Justice Lens

2021-09-21

Hien Nguyen

ML / AI

"MorphSet: Improving Renal Histopathology Case Assessment Through Learned Prognostic Vectors." In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 319-328. Springer, Cham, 2021

Computer Aided Diagnosis (CAD) systems for renal histopathology applications aim to understand and replicate nephropathologists’ assessments of individual morphological compartments (e.g. glomeruli) to render case-level histological diagnoses. Deep neural networks (DNNs) hold great promise in addressing the poor intra- and interobserver agreement between pathologists. This being said, the generalization ability of DNNs heavily depends on the quality and quantity of training labels. Current “consensus” labeling strategies require multiple pathologists to evaluate every compartment unit over thousands of crops, resulting in enormous annotative costs. Additionally, these techniques fail to address the underlying reproducibility issues we observe across various diagnostic feature assessment tasks. To address both of these limitations, we introduce MorphSet, an end-to-end architecture inspired by Set Transformers which maps the combined encoded representations of Monte Carlo (MC) sampled glomerular compartment crops to produce Whole Slide Image (WSI) predictions on a case basis without the need for expensive fine-grained morphological feature labels. To evaluate performance, we use a kidney transplant Antibody Mediated Rejection (AMR) dataset, and show that we are able to achieve 98.9% case level accuracy, outperforming the consensus label baseline. Finally, we generate a visualization of prediction confidence derived from our MC evaluation experiments, which provides physicians with valuable feedback.

MorphSet: Improving Renal Histopathology Case Assessment Through Learned Prognostic Vectors

2021-07-15

Gopal Pandurangan

Scientific Computing

On the Distributed Complexity of Large-Scale Graph Computations

Motivated by the increasing need to understand the distributed algorithmic foundations of large-scale graph computations, we study some fundamental graph problems in a message-passing model for distributed computing where k ≥ 2 machines jointly perform computations on graphs with n nodes (typically, n >> k). The input graph is assumed to be initially randomly partitioned among the k machines, a common implementation in many real-world systems. Communication is point-to-point, and the goal is to minimize the number of communication rounds of the computation.

Our main contribution is the General Lower Bound Theorem, a theorem that can be used to show non-trivial lower bounds on the round complexity of distributed large-scale data computations. This result is established via an information-theoretic approach that relates the round complexity to the minimal amount of information required by machines to solve the problem. Our approach is generic, and this theorem can be used in a “cookbook” fashion to show distributed lower bounds for several problems, including non-graph problems. We present two applications by showing (almost) tight lower bounds on the round complexity of two fundamental graph problems, namely, PageRank computation and triangle enumeration. These applications show that our approach can yield lower bounds for problems where the application of communication complexity techniques seems not obvious or gives weak bounds, including and especially under a stochastic partition of the input.

ACM Transactions on Parallel Computing, Volume 8, Issue 2.

2021-06-28

Gopal Pandurangan

Scientific Computing

Efficient Distributed Algorithms in the k-machine model via PRAM Simulations

We study several fundamental problems in the k-machine model, a message-passing model for large-scale distributed computations where k ≥ 2 machines jointly perform computations on a large input of size N, (typically, N ≫ k). The input is initially partitioned (randomly or in a balanced fashion) among the k machines, a common implementation in many real-world systems. Communication is point-to-point, and the goal is to minimize the number of communication rounds of the computation. Our main result is a general technique for designing efficient deterministic distributed algorithms in the k-machine model using PRAM algorithms. Our technique works by efficiently simulating PRAM algorithms in the k-machine model in a deterministic way. This simulation allows us to arrive at new algorithms in the k-machine model for some problems for which no efficient k-machine algorithms are known before and also improve on existing results in the k-machine model for some problems. While our simulation allows us to obtain k-machine algorithms for any problem with a known PRAM algorithm, we mainly focus on graph problems. For an input graph on n vertices and m edges, we obtain Õ(m/k 2 ) round 4 algorithms for various graph problems such as r-connectivity for r = 1, 2, 3, 4, minimum spanning tree (MST), maximal independent set (MIS), (Δ + 1)-coloring, maximal matching, ear decomposition, and spanners under the assumption that the edges of the input graph are partitioned (randomly, or in an arbitrary, but balanced, fashion) among the k machines. For problems such as connectivity and MST, the above bound is (essentially) the best possible (up to logarithmic factors). Our simulation technique allows us to obtain the first known efficient deterministic algorithms in the k-machine model for other problems with known deterministic PRAM algorithms.

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

2021-06-12

Margit Wiesner

Scientific Computing

Intergenerational Associations in Crime for an At-Risk Sample of U.S. Men: Factors that May Mitigate or Exacerbate Transmission

Evidence from several international studies indicates that criminal activity and involvement with the criminal justice system tend to be concentrated in families. Comparatively little work has studied factors that exacerbate or lessen intergenerational associations in crime. Knowledge about factors that are (1) malleable and (2) capable of lessening or exacerbating the intergenerational cycle of criminal behavior is highly relevant for prevention efforts and public policy-makers. This paper studied potential moderators of intergenerational associations in crime (from late childhood, adolescence, early adulthood). We found that late childhood cognitive function and early adult employment history, substance use, and romantic partner’s antisocial behavior moderated intergenerational associations in crime for a sample of at-risk men and their parents. The identified moderators may be used as selection criteria or targeted in prevention and treatment efforts aimed at reducing such associations.

Journal of Developmental and Life-Course Criminology.

2021-05-22

Pietro Milillo

Image Analysis, Scientific Computing

Integrated InSAR monitoring and structural assessment of tunnelling-induced building deformations

Structural deformation monitoring is crucial for the identification of early signs of tunnelling-induced damage to adjacent structures and for the improvement of current damage assessment procedures. Satellite multi-temporal interferometric synthetic aperture radar (MT-InSAR) techniques enable measurement of building displacements over time with millimetre-scale accuracy. Compared to traditional ground-based monitoring, MT-InSAR can yield denser and cheaper building observations, representing a cost-effective monitoring tool. However, without integrating MT-InSAR techniques and structural assessment, the potential of InSAR monitoring cannot be fully exploited. This integration is particularly demanding for large construction projects, where big datasets need to be processed. In this paper, we present a new automated methodology that integrates MT-InSAR-based building deformations and damage assessment procedures to evaluate settlement-induced damage to buildings adjacent to tunnel excavations. The developed methodology was applied to the buildings along an 8-km segment of the Crossrail tunnel route in London, using COSMO-SkyMed MT-InSAR data from 2011 to 2015. The methodology enabled the identification of damage levels for 858 buildings along the Crossrail twin tunnels, providing an unprecedented number of high quality field observations for building response to settlements. The proposed methodology can be used to improve current damage assessment procedures, for the benefit of future underground excavation projects in urban areas.

Integrated InSAR monitoring and structural assessment of tunnelling-induced building deformations