## Summary

The *surrogate index* method allows policymakers to estimate long-run treatment effects before long-run outcomes are observable. We meta-analyse this approach over nine long-run RCTs in development economics, comparing surrogate estimates to estimates from actual long-run RCT outcomes. We introduce the *M-lasso* algorithm for constructing the surrogate approach’s first-stage predictive model and compare its performance with other surrogate estimation methods. Across methods, we find a negative bias in surrogate estimates. For the M-lasso method, in particular, we investigate reasons for this bias and quantify significant precision gains. This provides evidence that the surrogate index method incurs a bias-variance trade-off.

## Introduction

The long-term effects of treatments and policies are important in many different fields. In medicine, one may want to estimate the effect of a surgery on life expectancy; in economics, the effect of a conditional cash transfer during childhood on adult income. One way to measure these effects would be to run a randomised controlled trial (RCT) and then wait to observe the long-run outcomes. However, the results would be observed too late to inform policy decisions made today.

A prominent solution to this issue is the *surrogate index*, a method for estimating long-run effects without long-run outcome data, which was originally proposed by Athey, Chetty, Imbens, and Kang (2019). Our paper contributes to the evolving literature on this method by examining its empirical performance in a wide range of RCT contexts. We also extend the discourse initiated by LaLonde (1986) on the bias of non-experimental methods, extending the set of estimators studied to those focused on long-term effects. Our findings and recommendations aim to guide practitioners intending to use the surrogate index method, thereby aiding in the development of effective long-term treatment strategies.

We test the surrogate approach on data from nine RCTs in development economics. These RCTs are selected on the basis of being long-running and having a sufficiently large sample size.

In each RCT, we first produce an unbiased estimate of the standard experimental average treatment effect by regressing long-term outcomes on treatment status. Next, we reanalyse the data using the surrogate index approach. If the surrogate estimate is close to the unbiased estimate from the experimental approach, then the surrogate index method is working well. We run meta-analyses on the difference between these estimates to understand how well the surrogate index method performs under different conditions.

We test many different implementations of the surrogate index estimator, varying (1) the set of surrogates used, (2) the first-stage prediction method used, and (3) the observational dataset used to construct the surrogate index. Notably, we introduce a new estimator called the *M-lasso*, which is specifically designed for use with the surrogate method.

When meta-analysing our results, we find that the surrogate index method is consistently negatively biased and underestimates positive long-term treatment effects by 0.05 standard deviations on average. This is the case regardless of which estimation method we use. We suggest that this is due to missing surrogates, as well as bias in the first-stage predictive model of the surrogate procedure.

While it is important to understand this negative bias as a potential shortcoming of the surrogate approach, we would not necessarily take it to dissuade researchers from this method altogether. Instead, one could interpret surrogate estimates as a reasonable lower bound on the true long-term treatment effect. Furthermore, there is often no better alternative for estimating the true effect.

We also study potential determinants of the surrogate bias for the M-lasso estimator. In particular, we find suggestive evidence that M-lasso bias is smaller for simpler interventions. However, we do not find that this bias depends on the predictive accuracy of the first-stage model in the observational dataset. Our evidence is also inconclusive about how bias is affected by longer time horizons between the surrogates and the outcomes.

We further show that despite the potential bias from using the surrogate index method, it results in significant precision gains, with standard errors on average 52% the size of those from the long-term RCT estimates. Hence, even if researchers had access to long-term outcomes, they might still choose to use the surrogate index, depending on their willingness to trade off bias and variance.

The rest of this paper proceeds as follows. Section 2 discusses related literature. Section 3 summarises the econometric theory behind the surrogate index approach, and section 4 describes in more detail the data we use. Section 5 explains the methods we use to estimate comparable long-term RCT and surrogate index estimates. Section 6 presents results of the meta-analysis over 9 RCTs for different implementations of the surrogate index. In it, we empirically characterise the bias and standard errors for the surrogate method, as well as examine which surrogates are selected by the M-lasso. Finally, section 7 concludes.

I liked this a lot. For context, I work as a RA on an impact evaluation project. I have light interests / familiarity with meta-analysis + machine learning, but I did not know what surrogate indices were going into the paper. Some comments below, roughly in order of importance:

Unclear contribution. I feel there's 3 contributions here: (1) an application of surrogate method to long-term development RCTs, (2) a graduate-level intro to the surrogate method, and (3) a new M-Lasso method which I mostly ignored. I read the paper mostly for the first 2 contributions, so I was surprised to find out that the novel contribution was actually M-LassoMissing relevance for "Very Long-Run" Outcomes.Given the mission of Global Priorities Institute, I was thinking throughout how the surrogate method would work when predicting outcomes on a 100-year horizon or 1000-year horizon. Long-run RCTs will get you around the 10-year mark. But presumably, one could apply this technique to some historical econ studies with (I would assume) shaky foundations.Intuition and layout is good.I followed a lot of this pretty well despite not knowing the fiddly mechanics of many methods. And I had a good idea on what insight I would gain if I dived into the details in each section. It's also great that the paper led with a graph diagram and progressed from simple kitchen sink regression before going into the black box ML methods.Estimator properties could use more clarity.Unsure what "negative bias" is.I don't know if the "negative bias" in surrogate index is an empirical result arising from this application, or a theoretical result where the estimator is biased in a negative direction. I'm also unsure if this is attenuation (biasing towards 0) or a honest-to-god negative bias. The paper sometimes mentions attenuation and other times negative bias but as far as I can tell, there's one surrogacy technique usedIs surrogate index biased and inconsistent?Maybe machine learning sees this differently, but I think of estimators as ideally being unbiased and consistent (i.e. consistent meaning more probability mass around the true value as sample size tends to infinity). I get that the surrogate index has a bias of some kind, but I'm unclear on if there's also the asymptotic property of consistency. And at some point, a limit is mentioned but not what it's a limit with respect to (larger sample size within each trial is my guess, but I'm not sure)How would null effects perform?I might be wrong about this but I think normalization of standard errors wouldn't work if treatment effects are 0...Got confused on relation between Prentice criterion and regular unconfoundedness.Maybe this is something I just have to sit down and learn one day, but I initially read Prentice criterion as a standard econometric assumption of exogeneity. But then the theory section mentions Prentice criterion (Assumption 3) as distinct from unconfoundedness (Assumption 1). It is good the assumptions are spelt are since that pointed out a bad assumption I was working with but perhaps this can be clarified.Analogy to Instrumental Variables / mediatorscould use a bit more emphasis.The econometric section (lit review?) buries this analogy towards the end. I'm glad it's mentioned since it clarifies the first-stage vibes I was getting through the theory section, but I feel it's (1) possibly a good hook to lead the the theory section and (2) something worth discussing a bit moreCould expand Table 1 with summary counts on outcomes per treatment.9 RCTs sounds tiny, until I remember that these have giant sample sizes, multiple outcomes, and multiple possible surrogates. A summary table of sample size, outcomes, and surrogates used might give a bit more heft to what's forming the estimates.Other stuff I really likedThe "selection bias" in long-term RCTs is cool.I like the paragraph discussing how these results are biased by what gets a long-term RCT. Perhaps it's good emphasizing this as a limitation in the intro or perhaps it's a good follow-on paper. Another idea is how surrogates would perform in dynamic effects that grow over time. Urban investments, for example, might have no effect until agglomeration kicks in.The surprising result of surrogates being more precise than actual RCTs outcomes. This was a pretty good hook for me but I could have easily passed over in in the intro. I also think the result here captures the core intuition of bias-variance tradeoff + surrogate assumption in the paper quite strongly.Hi Geoffrey, thanks for these comments, they are really helpful as we move to submitting this to journals. Some miscellaneous responses:

4a. The negative bias is purely an empirical result, but one that we expect to rise in many applications. We can't say for sure whether it's always negative or attenuation bias, but the hypothesis we suggest to explain it is compatible with attenuation bias of the treatment effects to 0 and treatment effects generally being positive. However, when we talk about attenuation in the paper, we're typically talking about attenuation in the prediction of long-run outcomes, not attenuation in the treatment effects.

4b. The surrogate index is unbiased and consistent if the assumptions behind it are satisfied. This is the case for most econometric estimators. What we do in the paper is show that the key surrogacy assumption is empirically not perfectly satisfied in a variety of contexts. Since this assumption is not satisfied, then the estimator is empirically biased and inconsistent in our applications. However, this is not what people typically mean when they say an estimator is theoretically biased and inconsistent. Personally, I think econometrics focuses too heavily on unbiasedness and am sympathetic to the ML willingness to trade off bias and variance, and cares too much about asymptotic properties of estimators and too little about how well they perform in these empirical LaLonde-style tests.

4c. The normalisation depends on the standard deviation of the control group, not the standard error, so we should be fine to do that regardless of what the actual treatment effect is. We would be in trouble if there was no variation in the control group outcome, but this seems to occur very rarely (or never).