Research & Software

Current Projects

  • Sequential Sensitivity Analysis for Multiple Assumptions: A Framework for Understanding Racial Disparity in Police Use of Force - [link]

    Joint work with Tom Leavitt and Luke Miratrix. Statistical inference about racial discrimination in police use of force---the average causal effect of civilian race on use of force---requires two assumptions: that officers do not discriminate in whom they would stop, and that within patrol context the probability an encounter is with a minority civilian does not vary across encounters. Existing sensitivity analyses address these assumptions one at a time. Building on Knox et al. (2020), we develop a framework that varies both sequentially---first positing a level of discrimination in stops, then assessing sensitivity to bias in encounters on the resulting data---and apply it to NYPD Stop, Question, and Frisk data (2003-2013). Under plausible levels of discrimination in stops, we find substantial racial disparity in use of force; the conclusion that this disparity reflects discrimination by officers, however, is fragile to modest departures from no-bias-in-encounters that census-based calibration suggests are demographically feasible. Revise and resubmit at the Journal of the American Statistical Association.

    Co-authors: Tom Leavitt, Luke Miratrix

  • Detecting Where Effects Occur by Testing Hypotheses in Order - [link]

    Joint work with Nuole Chen and David Kim, including the manytestsr R package. Experimental evaluations of public policies often randomize a new intervention within many sites or blocks. After a report of an overall result---statistically significant or not---the natural question from a policy maker is: where did any effects occur? Standard adjustments for multiple testing provide little power to answer this question. In simulations modeled after a 44-block education trial, the Hommel adjustment---among the most powerful procedures controlling the family-wise error rate (FWER)---detects effects in only 11% of truly non-null blocks. We develop a procedure that tests hypotheses top-down through a tree: test the overall null at the root, then groups of blocks, then individual blocks, stopping any branch where the null is not rejected. In the same 44-block design, this approach detects effects in 44% of non-null blocks---roughly four times the detection rate. A stopping rule and valid tests at each node suffice for weak FWER control. We show that the strong-sense FWER depends on how rejection probabilities accumulate along paths through the tree. This yields a diagnostic: when power decays fast enough relative to branching, no adjustment is needed; otherwise, an adaptive alpha-adjustment restores control. We apply the method to 25 MDRC education trials and provide an R package, manytestsr.

    Co-authors: Nuole Chen, David Kim

  • Randomization Tests for Distributions of Individual Treatment Effects via Combined Rank Statistics - [link]

    Joint work with David Kim, Yongchang Su, and Xinran Li. What proportion of treated units actually benefited from an experimental intervention? What is the median or the largest individual treatment effect? This paper develops methods for answering such questions about the distribution of individual causal effects in randomized experiments. Existing approaches require the analyst to select a rank-based test statistic before observing the data. A poor choice can substantially reduce power, while searching over multiple test statistics and adjusting for multiplicity using Bonferroni correction also incurs power loss. We propose inference procedures that adaptively combine multiple rank-based statistics while maintaining finite-sample validity. For stratified experiments, we further develop weighting schemes that effectively aggregate evidence across strata of heterogeneous sizes. The resulting combined test achieves power comparable to, or exceeding, that of the best individual test, without requiring prior knowledge of the optimal statistic. When applied to a randomized experiment evaluating a teacher training program, the combined test suggests that roughly half of treated teachers benefited, whereas a single rank-based test may indicate only a small minority. Thus, the choice of test determined whether the program appears broadly successful or narrowly effective. Under review.

    Co-authors: Xinran Li, David Kim, Yongchang Su

  • Fully Specified Bayes Factors for Hypothesis Testing in Qualitative Research

    Joint work with Matias López and Daniel Gajardo Cooper. Process tracing rests on the evaluation of observations about a single case in light of competing hypotheses, however different scholars may read the same observations differently. Fairfield and Charman (2022) propose summarizing within-case evidence as a Bayes factor, but their method requires subjective assessments of the probability and weight of evidence, and this has raised sharp criticism (Zaks 2021). In this paper, we propose a solution by deriving such probabilities directly from two fully specified generative models of observation tailored to process tracing research designs. Each model can substitute for the researcher's per-observation judgments, but a researcher can also incorporate weights of evidence. Within each model, we derive the version most favorable to the rival, thus the reported Bayes factor is a conservative lower bound on evidence favoring the working theory. Most importantly, we enable researchers to report how much (a) coding error, (b) observation bias, (c) weighting, and (d) rival-tilted prior a positive conclusion can absorb before flipping in favor of the rival. In practice, this means that final conclusions are driven by sensitivity tests more than by Bayes factors themselves. To show the usefulness of our approach we apply the framework to six recent process-tracing studies in top political science journals. In preparation.

    Co-authors: Matias López, Daniel Gajardo Cooper

  • Experimental Reasoning in Process Tracing: A Method for Calculating P-Values for Qualitative Causal Inference - [link]

    Joint work with Matias López. We introduce a method of statistical inference for calculating p-values to test causal hypotheses in qualitative research a la process tracing. As in an experiment, our p-value tells us how often one would make the same or more compelling observations while entertaining a rival theory. We adapt Fisher's (1935) randomization-based urn model to the reality of qualitative researchers, who cannot randomize history, but can make observations about historical processes. Our test includes a method of sensitivity analysis, as well as a framework for representing the varying strength of individual pieces of evidence, altogether informing the robustness of qualitative causal inference. We provide simulations and replications to illustrate how to calculate p-values using any type of qualitative data about one case. This approach fosters plurality in the uses of probability theory in theory-testing process tracing by offering a simple model of statistical inference with provable conservatism, while relying on few assumptions which we address directly. Under review.

    Co-authors: Matias López

  • The Causal Inference for Social Impact (CISIL) Data Challenge - [link]

    Joint work with Carrie Cihak, Betsy Rajala, Quinn Waeiss, Ryan Moore, Laura Stoker, Laura Feeney, Ben Hansen, Crystal Hall, and Anjali Chainani. We recruited 30 teams to evaluate transportation policy impacts in King County, WA to study how analytic decisions affect results and policy conclusions. Launched in February 2022 with 31 teams from 10 countries. In preparation.

    Co-authors: Carrie Cihak, Betsy Rajala, Quinn Waeiss, Ryan Moore, Laura Stoker, Laura Feeney, Ben Hansen, Crystal Hall, Anjali Chainani

Selected Older Work

  • How to Increase the Precision of Causal Inferences in Experiments Using Machine Learning (but without Data Snooping) - [link]

    With Mark Fredrickson and Ben Hansen. On using covariates and machine learning to improve precision in randomization-based causal inferences without data snooping.

  • Ethnicity and Electoral Fraud in New Democracies: Modelling Political Party Agents in Ghana - [link]

    With Nahomi Ichino and Mark Fredrickson. Randomization inference with agent-based models of network propagation of voter registration fraud for theories of party competition and ethnicity in Ghana.

  • Regression without Regrets: A Modular Approach to Linear Models in (Quasi-)Experiments - [link]

    With Mark Fredrickson and Costas Panagopoulos. A modular use of penalized linear machine-learning models in (quasi-)experiments, framed by randomization inference.

  • Fisher's Randomization Mode of Statistical Inference, Then and Now - [link]

    With Costas Panagopoulos. On Fisher's randomization-based mode of statistical inference, its history and its current uses.

  • A General Representation of Potential Outcomes for Graphs/Networks - [link]

    Sole-authored draft, 2012. A potential-outcomes representation for outcomes that depend on graph or network structure.

  • Fixing Broken Experiments: How to Bolster the Case for Ignorability with Full Matching - [link]

    With Ben Hansen, August 2007. On using full matching to bolster ignorability assumptions in compromised experiments.

  • Cycling Involvements: Frequency Domain Time Series Analysis and Political Participation in the USA - [link]

    Sole-authored. A frequency-domain time-series decomposition of aggregate political participation in the United States.

  • A Proposal for a Political Science Registry - [link]

    Committee report drafted with Macartan Humphreys, John Gerring, Don Green, Alan Jacobs, and Jonathan Nagler for the Society for Political Methodology, in consultation with the APSA Experimental Methods and Qualitative & Mixed Methods subsections. The committee did not use author order.

  • Political Participation as a Dynamic Sporadic Process in the Lives of Ordinary Americans (later 'The Shape of Political Participation') - [link]

    With Paul Testa, June 2012. A theory and description of how political participation varies within and across individual lives in the United States.

  • Dissertation Overview - [link]

    Sole-authored overview of my doctoral dissertation.

Software

  • RItools - [code]

    This R package implements randomization inference methods for assessing balance in matched or stratified observational studies or randomized studies or causal effects in complete, blocked/stratified, and/or cluster-randomized studies. It implements the $d^2$ test for omnibus tests of the null hypothesis of no relationship between any covariate and treatment (for balance tests) or tests of the null hypothesis of no effects on any of multiple outcomes from a single treatment.

  • manytestsr (development) - [code]

    A package to do many tests to localize causal effects in (sets of) experimental blocks

  • DrBristol (development) - [code]

    A set of functions to compute p-values and perform sensitivity analysis, adapting Fisher’s p-value test to case studies and process tracing following López and Bowers (2025). It uses unbiased and biased urn models to draw null distributions in the absence of randomization.

  • DrWrinch (development) - [code]

    An R package for computing fully specified Bayes factors for process tracing, following López, Bowers, and Gajardo Cooper (2026). Implements a binomial model for open-ended evidence collection and a hypergeometric urn model for bounded archives. Companion to DrBristol, which implements p-value methods for the same class of problems.

  • CMRSS (development) - [code]

    R package for conducting randomization inference for quantiles of individual treatment effects, using combined rank sum statistics, both for completely randomized and stratified randomized experiments. Companion to Kim, Su, Bowers, and Li, Randomization Tests for Distributions of Individual Treatment Effects via Combined Rank Statistics.