Research & Software
Current Projects
-
Joint Sensitivity Analysis for Multiple Assumptions: Unpacking Racial Disparity in Police Use of Force
Joint work with Tom Leavitt and Luke Miratrix on a joint sensitivity analysis for bias in police encounters and stops, applied to NYPD Stop, Question, and Frisk data (2003-2013). We show how dependence between encounter bias and stop bias changes inference about racial disparities in police use of force, and we assess robustness as violations increase. Revise and resubmit at the Journal of the American Statistical Association.
-
Detecting Where Effects Occur by Testing Hypotheses in Order
- [link]
Joint work with Nuole Chen and David Kim, including the manytestsr R package. Experimental evaluations of public policies often randomize a new intervention within many sites or blocks. After a report of an overall result---statistically significant or not---the natural question from a policy maker is: where did any effects occur? Standard adjustments for multiple testing provide little power to answer this question. In simulations modeled after a 44-block education trial, the Hommel adjustment---among the most powerful procedures controlling the family-wise error rate (FWER)---detects effects in only 11% of truly non-null blocks. We develop a procedure that tests hypotheses top-down through a tree: test the overall null at the root, then groups of blocks, then individual blocks, stopping any branch where the null is not rejected. In the same 44-block design, this approach detects effects in 44% of non-null blocks---roughly four times the detection rate. A stopping rule and valid tests at each node suffice for weak FWER control. We show that the strong-sense FWER depends on how rejection probabilities accumulate along paths through the tree. This yields a diagnostic: when power decays fast enough relative to branching, no adjustment is needed; otherwise, an adaptive alpha-adjustment restores control. We apply the method to 25 MDRC education trials and provide an R package, manytestsr.
-
Randomization Tests for Distributions of Individual Treatment Effects Combining Multiple Rank Statistics
Joint work with Xinran Li, David Kim, and Yongchang Su. What proportion of treated units actually benefited from an experimental intervention? What was the largest treatment effect? This paper develops methods to answer questions about distributions of individual causal effects in randomized experiments. Existing rank-based approaches require the analyst to choose a tuning parameter before seeing the data. Choosing wrong costs power. Searching across multiple tests for a tuning parameter and then correcting for multiplicity also sacrifices power. This paper presents inference procedures that adaptively combine multiple rank statistics while preserving finite-sample validity. For stratified experiments, we develop weighting strategies that aggregate evidence across strata of varying sizes. The combined test matches or beats the best individual test---without requiring the analyst to know the best tuning parameter in advance. When applied to a randomized experiment evaluating teacher training, the combined test suggests that roughly half of treated teachers benefited, while a single rank test could suggest only a small minority did. The choice of test determined whether the program appeared broadly successful or narrowly effective.
-
Fully Specified Bayes Factors for Hypothesis Testing in Qualitative Research
Joint work with Matias Lopez and Daniel Gajardo Cooper. We enhance the use of Bayes Factors for hypothesis testing in case-studies and process tracing by proposing two generative models of how evidence was produced. The researcher classifies each observation as favoring one theory or the other---a directional judgment, not a magnitude judgment---and a probability model supplies the rest. We provide two models: a binomial model for open-ended research designs where the evidence base could always grow, and a hypergeometric urn model for bounded archives where the researcher has examined a substantial fraction of what exists. We develop a decision threshold grounded in the e-values literature and a sensitivity analysis that answers a concrete question---how many observations would the researcher need to have miscoded to overturn the finding? We illustrate with competing explanations for Weimar Germany's democratic collapse and provide a companion R package, DrBristol. In preparation.
-
Experimental Reasoning in Process Tracing: A Method for Calculating P-Values for Qualitative Causal Inference
- [link]
Joint work with Matias Lopez. We adapt Fisher's urn model to qualitative process tracing to compute p-values for evidence supporting one theory over another. The method includes sensitivity analysis for observation bias and a framework for weighing evidence strength, illustrated with simulations and replications. Under review.
-
The Causal Inference for Social Impact (CISIL) Data Challenge
- [link]
Joint work with Carrie Cihak, Betsy Rajala, Quinn Waeiss, Ryan Moore, Laura Stoker, Laura Feeney, Ben Hansen, Crystal Hall, and Anjali Chainani. We recruited 30 teams to evaluate transportation policy impacts in King County, WA to study how analytic decisions affect results and policy conclusions. Launched in February 2022 with 31 teams from 10 countries. In preparation.
Software
-
RItools
- [code]
This R package implements randomization inference methods for assessing balance in matched or stratified observational studies or randomized studies or causal effects in complete, blocked/stratified, and/or cluster-randomized studies. It implements the $d^2$ test for omnibus tests of the null hypothesis of no relationship between any covariate and treatment (for balance tests) or tests of the null hypothesis of no effects on any of multiple outcomes from a single treatment.
-
manytestsr (development)
- [code]
A package to do many tests to localize causal effects in (sets of) experimental blocks
-
DrBristol (development)
- [code]
A set of functions to compute p-values and perform sensitivity analysis, adapting Fisher’s p-value test to case studies and process tracing following López and Bowers (2025). It uses unbiased and biased urn models to draw null distributions in the absence of randomization.
-
CMRSS (development)
- [code]
R package for conducting randomization inference for quantiles of individual treatment effects, using combined rank sum statistics, both for completely randomized and stratified randomized experiments