Research & Software

Current Projects

Sequential Sensitivity Analysis for Multiple Assumptions: A Framework for Understanding Racial Disparity in Police Use of Force - [link]
Joint work with Tom Leavitt and Luke Miratrix. Statistical inference about racial discrimination in police use of force---the average causal effect of civilian race on use of force---requires two assumptions: that officers do not discriminate in whom they would stop, and that within patrol context the probability an encounter is with a minority civilian does not vary across encounters. Existing sensitivity analyses address these assumptions one at a time. Building on Knox et al. (2020), we develop a framework that varies both sequentially---first positing a level of discrimination in stops, then assessing sensitivity to bias in encounters on the resulting data---and apply it to NYPD Stop, Question, and Frisk data (2003-2013). Under plausible levels of discrimination in stops, we find substantial racial disparity in use of force; the conclusion that this disparity reflects discrimination by officers, however, is fragile to modest departures from no-bias-in-encounters that census-based calibration suggests are demographically feasible. Revise and resubmit at the Journal of the American Statistical Association.

Co-authors: Tom Leavitt, Luke Miratrix
Detecting Where Effects Occur by Testing Hypotheses in Order - [link]
Joint work with Nuole Chen and David Kim, including the manytestsr R package. Experimental evaluations of public policies often randomize a new intervention within many sites or blocks. After an overall statistically significant result is reported the natural question from a policy maker is: \emph{where} did effects occur? Standard adjustments for multiple testing answer this question with little power because they ignore how the experiment is organized: blocks nest within cohorts, sites, and districts. We organize the hypotheses in the shape of a tree that follows this administrative structure and test them top-down. We test the overall null at the root, then within groups of blocks, and finally individual blocks, stopping at any branch where the null is not rejected. A stopping rule and valid tests at each node suffice for weak control of the family-wise error rate (FWER). Whether the procedure also controls the FWER in the \emph{strong} sense depends on a single quantity we can compute before any data are tested --- an \emph{error load} that summarizes how rejection probability accumulates along paths through the tree. The error load is a diagnostic that tells an analyst, in advance and from design quantities alone, whether the unadjusted top-down procedure controls the FWER or whether an adjustment is required. We apply the method to 25 block-randomized MDRC education trials. In every one the diagnostic indicates that no adjustment is needed, so the stopping rule and valid tests alone control the FWER while each test runs at the full nominal level. The top-down procedure detects individual affected blocks that the Hommel adjustment --- among the most powerful FWER procedures --- misses entirely, and locates higher-level groups of blocks containing effects that bottom-up testing cannot evaluate. Building the diagnostic required deriving what the adjustment would be for designs with high error load. We develop an adaptive $\alpha$-schedule for that regime, prove it controls the FWER on regular, irregular, and pruned trees, and confirm it in simulation. The same diagnostic identifies which designs require that adjustment: in a design calibrated to the National Job Corps Study --- a wide, well-powered multisite trial of about one hundred centers --- the diagnostic flags a high error load, the unadjusted procedure inflates the family-wise error rate, the adaptive $\alpha$-schedule restores control, and the top-down procedure detects more affected sites than either a bottom-up or a hierarchical correction. Under review.

Co-authors: Nuole Chen, David Kim
Randomization Tests for Distributions of Individual Treatment Effects via Combined Rank Statistics - [link]
Joint work with David Kim, Yongchang Su, and Xinran Li. What proportion of treated units actually benefited from an experimental intervention? What is the median or the largest individual treatment effect? This paper develops methods for answering such questions about the distribution of individual causal effects in randomized experiments. Existing approaches require the analyst to select a rank-based test statistic before observing the data. A poor choice can substantially reduce power, while searching over multiple test statistics and adjusting for multiplicity using Bonferroni correction also incurs power loss. We propose inference procedures that adaptively combine multiple rank-based statistics while maintaining finite-sample validity. For stratified experiments, we further develop weighting schemes that effectively aggregate evidence across strata of heterogeneous sizes. The resulting combined test achieves power comparable to, or exceeding, that of the best individual test, without requiring prior knowledge of the optimal statistic. When applied to a randomized experiment evaluating a teacher training program, the combined test suggests that roughly half of treated teachers benefited, whereas a single rank-based test may indicate only a small minority. Thus, the choice of test determined whether the program appears broadly successful or narrowly effective. Under review.

Co-authors: Xinran Li, David Kim, Yongchang Su
Fully Specified Bayes Factors for Hypothesis Testing in Qualitative Research - [link]
Joint work with Matias López and Daniel Gajardo Cooper. Process tracing rests on the evaluation of observations about a single case in light of competing hypotheses, however different scholars may read the same observations differently. Fairfield and Charman (2022) propose summarizing within-case evidence as a Bayes factor, but their method requires subjective assessments of the probability and weight of evidence, and this has raised sharp criticism (Zaks 2021). In this paper, we propose a solution by deriving such probabilities directly from two fully specified generative models of observation tailored to process tracing research designs. Each model can substitute for the researcher's per-observation judgments, but a researcher can also incorporate weights of evidence. Within each model, we derive the version most favorable to the rival, thus the reported Bayes factor is a conservative lower bound on evidence favoring the working theory. Most importantly, we enable researchers to report how much (a) coding error, (b) observation bias, (c) weighting, and (d) rival-tilted prior a positive conclusion can absorb before flipping in favor of the rival. In practice, this means that final conclusions are driven by sensitivity tests more than by Bayes factors themselves. To show the usefulness of our approach we apply the framework to six recent process-tracing studies in top political science journals. Under review.

Co-authors: Matias López, Daniel Gajardo Cooper
Experimental Reasoning in Process Tracing: A Method for Calculating P-Values for Qualitative Causal Inference - [link]
Joint work with Matias López. We introduce a method of statistical inference for calculating p-values to test causal hypotheses in qualitative research a la process tracing. As in an experiment, our p-value tells us how often one would make the same or more compelling observations while entertaining a rival theory. We adapt Fisher's (1935) randomization-based urn model to the reality of qualitative researchers, who cannot randomize history, but can make observations about historical processes. Our test includes a method of sensitivity analysis, as well as a framework for representing the varying strength of individual pieces of evidence, altogether informing the robustness of qualitative causal inference. We provide simulations and replications to illustrate how to calculate p-values using any type of qualitative data about one case. This approach fosters plurality in the uses of probability theory in theory-testing process tracing by offering a simple model of statistical inference with provable conservatism, while relying on few assumptions which we address directly.

Co-authors: Matias López
The Causal Inference for Social Impact (CISIL) Data Challenge - [link]
Joint work with Carrie Cihak, Betsy Rajala, Quinn Waeiss, Ryan Moore, Laura Stoker, Laura Feeney, Ben Hansen, Crystal Hall, and Anjali Chainani. We recruited 30 teams to evaluate transportation policy impacts in King County, WA to study how analytic decisions affect results and policy conclusions. Launched in February 2022 with 31 teams from 10 countries. In preparation.

Co-authors: Carrie Cihak, Betsy Rajala, Quinn Waeiss, Ryan Moore, Laura Stoker, Laura Feeney, Ben Hansen, Crystal Hall, Anjali Chainani
Barriers to Engaging the State: Evidence from Six Randomized Experiments on Transaction Costs, Public Services, and Taxation in the Global South
We present six harmonized RCTs to assess whether removing bureaucratic hurdles can encourage individuals to engage formally with the state in six countries in the Global South. Recent work has argued that the trade-offs inherent to formalization are more acceptable to individuals when formalization is tied to publicly-derived benefits, such as access to legal recourse in disputes, public services, or public utilities. Yet, even in these cases, bureaucratic barriers to formalization might be insurmountable. So far, disconnected research on different bureaucratic procedures has led to contrasting conclusions about how entry costs affect formalization. Our interventions involved in-person assistance to reduce the upfront transaction costs of dealing with the bureaucracy in three types of policy domain: titling property, registration of small businesses, and access to public utilities (i.e., municipal water) and public services (i.e., garbage collection). Our meta-analysis shows that the average effect of these interventions on individuals' formalization, tax payment, and access to services is indistinguishable from zero. We also find substantial heterogeneity in individuals' intent to undertake and complete the bureaucratic process. Across policy domains, individuals are more willing to bear formalization's downstream costs when the benefits are individual. Our results also suggest that local bureaucratic incentives must be aligned for demand-side interventions to work. Under review.

Co-authors: Ana De La O, Donald Green, Peter John, Rafael Goldszmidt, Anna-Katerina Lenz, Martín Valdivia, Cesar Zucco, Darin Christensen, Francisco Garfias, Pablo Balán, Augustin Bergeron, Gabriel Tourek, Jonathan Weigel, Jessica Gottlieb, Adrienne LeBas, Janica Magat, Nonso Obikili, Nuole Chen, Christopher Grady, Matthew Winters, Nikhar Gaikwad, Gareth Nellis, Anjali Thomas, Susan Hyde

Selected Older Work

How to Increase the Precision of Causal Inferences in Experiments Using Machine Learning (but without Data Snooping) - [link]
With Mark Fredrickson and Ben Hansen. On using covariates and machine learning to improve precision in randomization-based causal inferences without data snooping.
Ethnicity and Electoral Fraud in New Democracies: Modelling Political Party Agents in Ghana - [link]
With Nahomi Ichino and Mark Fredrickson. Randomization inference with agent-based models of network propagation of voter registration fraud for theories of party competition and ethnicity in Ghana.
Regression without Regrets: A Modular Approach to Linear Models in (Quasi-)Experiments - [link]
With Mark Fredrickson and Costas Panagopoulos. A modular use of penalized linear machine-learning models in (quasi-)experiments, framed by randomization inference.
Fisher's Randomization Mode of Statistical Inference, Then and Now - [link]
With Costas Panagopoulos. On Fisher's randomization-based mode of statistical inference, its history and its current uses.
A General Representation of Potential Outcomes for Graphs/Networks - [link]
Sole-authored draft, 2012. A potential-outcomes representation for outcomes that depend on graph or network structure.
Fixing Broken Experiments: How to Bolster the Case for Ignorability with Full Matching - [link]
With Ben Hansen, August 2007. On using full matching to bolster ignorability assumptions in compromised experiments.
Cycling Involvements: Frequency Domain Time Series Analysis and Political Participation in the USA - [link]
Sole-authored. A frequency-domain time-series decomposition of aggregate political participation in the United States.
A Proposal for a Political Science Registry - [link]
Committee report drafted with Macartan Humphreys, John Gerring, Don Green, Alan Jacobs, and Jonathan Nagler for the Society for Political Methodology, in consultation with the APSA Experimental Methods and Qualitative & Mixed Methods subsections. The committee did not use author order.
Political Participation as a Dynamic Sporadic Process in the Lives of Ordinary Americans (later 'The Shape of Political Participation') - [link]
With Paul Testa, June 2012. A theory and description of how political participation varies within and across individual lives in the United States.
Dissertation Overview - [link]
Sole-authored overview of my doctoral dissertation.

Software

RItools - [code]
This R package implements randomization inference methods for assessing balance in matched or stratified observational studies or randomized studies or causal effects in complete, blocked/stratified, and/or cluster-randomized studies. It implements the $d^2$ test for omnibus tests of the null hypothesis of no relationship between any covariate and treatment (for balance tests) or tests of the null hypothesis of no effects on any of multiple outcomes from a single treatment.
manytestsr (development) - [code]
A package to do many tests to localize causal effects in (sets of) experimental blocks
DrBristol (development) - [code]
A set of functions to compute p-values and perform sensitivity analysis, adapting Fisher’s p-value test to case studies and process tracing following López and Bowers (2025). It uses unbiased and biased urn models to draw null distributions in the absence of randomization.
DrWrinch (development) - [code]
An R package for computing fully specified Bayes factors for process tracing, following López, Bowers, and Gajardo Cooper (2026). Implements a binomial model for open-ended evidence collection and a hypergeometric urn model for bounded archives. Companion to DrBristol, which implements p-value methods for the same class of problems.
CMRSS (development) - [code]
R package for conducting randomization inference for quantiles of individual treatment effects, using combined rank sum statistics, both for completely randomized and stratified randomized experiments. Companion to Kim, Su, Bowers, and Li, Randomization Tests for Distributions of Individual Treatment Effects via Combined Rank Statistics.