Introduction
When I worked as a doctor, we had a lecture by a paediatric haematologist, on a condition called Acute Lymphoblastic Leukaemia. I remember being impressed that very large proportions of patients were being offered trials randomising them between different treatment regimens, currently in clinical equipoise, to establish which had the edge. At the time, one of the areas of interest was, given the disease tended to have a good prognosis, whether one could reduce treatment intensity to reduce the long term side-effects of the treatment whilst not adversely affecting survival.
On a later rotation I worked in adult medicine, and one of the patients admitted to my team had an extremely rare cancer, with a (recognised) incidence of a handful of cases worldwide per year. It happened the world authority on this condition worked as a professor of medicine in London, and she came down to see them. She explained to me that treatment for this disease was almost entirely based on first principles, informed by a smattering of case reports. The disease unfortunately had a bleak prognosis, although she was uncertain whether this was because it was an aggressive cancer to which current medical science has no answer, or whether there was an effective treatment out there if only it could be found.
I aver that many problems EA concerns itself with are closer to the second story than the first. That in many cases, sufficient data is not only absent in practice but impossible to obtain in principle. Reality is often underpowered for us to wring the answers from it we desire.
Big units of analysis, small samples
The main driver of this problem for ‘EA topics’ is that the outcomes of interest have units of analysis for which the whole population (leave alone any sample from it) is small-n: e.g. outcomes at the level of a whole company, or a whole state, or whole populations. For these big unit of analysis/small sample problems, RCTs face formidable in principle challenges:
- Even if by magic you could get (e.g.) all countries on earth to agree to randomly allocate themselves to policy X or Y, this is merely a sample size of ~200. If you’re looking at companies relevant to cage-free campaigns, or administrative regions within a given state, this can easily fall another order of magnitude.
- These units of analysis tend highly heterogeneous, almost certainly in ways that affect the outcome of interest. Although the key ‘selling point’ of the RCT is it implicitly controls for all confounders (even ones you don’t know about), this statistical control is a (convex) function of sample size, and isn’t hugely impressive at ~ 100 per arm: it is well within the realms of possibility for the randomisation happen to give arms with unbalanced allocation of any given confounding factor.
- ‘Roughly’ (in expectation) balanced intervention arms are unlikely to be good enough in cases where the intervention is expected to have much less effect on the outcome than other factors (e.g. wealth, education, size, whatever), thus an effect size that favours one arm or the other can be alternatively attributed to one of these.
- Supplementing this raw randomisation by explicitly controlling for confounders you suspect (cf. block randomisation, propensity matching, etc.) has limited value when don’t know all the factors which plausibly ‘swamp’ the likely intervention effect (i.e. you don’t have a good predictive model for the outcome but-for the intervention tested). In any case, they tend to trade-off against the already scarce resource of sample size.
These ‘small sample’ problems aren’t peculiar to RCTs, but endemic to all other empirical approaches. The wealth of econometric and quasi-experimental methods (e.g. IVs, regression discontinuity analysis), still run up against hard data limits, as well those owed to in whatever respect they fall short of the ‘ideal’ RCT set-up (e.g. imperfect instrumentation, omitted variable bias, nagging concerns about reverse causation). Qualitative work (case studies, etc.) have the same problems even if other ones (e.g. selection) loom larger.
None of this means such work has zero value - big enough effect sizes can still be reliably detected, and even underpowered studies still give us information. But we may learn very little on the margin of common sense. Suppose we are interested in ‘what makes social movements succeed or fail?’ and we retrospectively assess a (somehow) representative sample of social movements. It seems plausible the results of this investigation is the big (and plausibly generalisable) hits may prove commonsensical (e.g. “Social movements are more likely to grow if members talk to other people about the social movement”), whilst the ‘new lessons’ remain equivocal and uncertain.
We should expect to see this if we believe the distribution of relevant effect sizes is heavy-tailed, with most of the variance in (say) social movement success owed to a small number of factors, with the rest comprised of a large multitude of smaller effects. In such case, modest increases in information (e.g. from small sample data) may bring even more modest increases in either explaining the outcome or identifying what contributes to it:

Toy example, where we propose a roughly pareto distribution of effect size among contributory factors. The largest factors (which nonetheless explain a minority of the variance) may prove to be obvious to the naked eye (blue). Adding in the accessible data may only slightly lower detection threshold, with modest impacts on identifying further factors (green) and overall accuracy. The great bulk of the variance remains in virtue of a large ensemble of small factors which cannot be identified (red). Note that detection threshold tends to have diminishing returns with sample size.
The scientific revolution for doing good?
The foregoing should not be read as general scepticism to using data. The triumphs of evidence-based medicine, although not unalloyed, have been substantial, and there remain considerable gains that remain on the table (e.g. leveraging routine clinical practice). The ‘randomista’ trend in international development is generally one to celebrate, especially (as I understand) it increasingly aims to isolate factors that have credible external validity. The people who run cluster-randomised, stepped-wedge, and other study designs with big units of analysis are not ignorant of their limitations, and can deploy these judiciously.
But it should temper our enthusiasm about how many insights we can glean by getting some data and doing something sciency to it. The early successes of EA in global health owes a lot to this being one of the easier areas to get crisp, intersubjective and legible answers from a wealth of available data. For many to most other issues, data-driven demonstration of ‘what really works’ will never be possible.
We see that people do better than chance (or better than others) in terms of prediction and strategic judgement. Yet, at least judging by the superforecasters (this writeup by AI impacts is an excellent overview), how they do is much more indirectly data-driven: one may have to weigh between several facially-relevant ‘base rates’, adjusting these rates by factors where the coefficient may be estimated by role in loosely analogous cases, and so forth. Although this process may be informed by statistical and numerical literacy (e.g. decomposition, ‘fermi-ization’), it seems to me the main action going on ‘under the hood’ is developing a large (and implicit, and mostly illegible) set of gestalts and impressions to determine how to ‘weigh’ relevant data that is nonetheless fairly remote to the question at issue.
Three final EA takeaways:
- Most who (e.g.) write up a case study or a small-sample analysis tend to be well aware of the limitations of their work. Nonetheless I think it is worth paying more attention to how these bear on overall value of information before one embarks on these pieces of work. Small nuggets of information may not be worth the time to excavate even when the audience are ideal reasoners. As they aren’t, one risks them (or yourself) over-weighing their value when considering problems which should demand tricky aggregation of a multitude of data sources.
- There can be good reasons why expert communities in some areas haven’t tried to use data explicitly to answer problems in their field. In these cases, the ‘calling card’ of EA-style analysis of doing this anyway can be less of a disruptive breakthrough and more a stigma of intellectual naivete.
- In areas where ‘being driven by the data’ isn’t a huge advantage, it can be hard to identify an ‘edge’ that the EA community has. There are other candidates: investigating topics neglected by existing work, better aligned incentives, etc. We should be sceptical of stories which boil down a generalized ‘EA exceptionalism’.
Thanks Greg - I really enjoyed this post.
I don't think that this is what you're saying, but I think if someone drew the lesson from your post that, when reality is underpowered, there's no point in doing research into the question, that would be a mistake.
When I look at tiny-n sample sizes for important questions (e.g.: "How have new ideas made major changes to the focus of academic economics?" or "Why have social movements collapsed in the past?"), I generally don't feel at all like I'm trying to get a p<0.05 ; it feels more like hypothesis generation. So when I find out that Kahneman and Tversky spent 5 years honing the article Prospect Theory into a form that could be published in an economics journal, I think "wow, ok, maybe that's the sort of time investment that we should be thinking of". Or when I see social movements collapse because of in-fighting (e.g. pre-Copenhagen UK climate movement), or romantic disputes between leaders (e.g. Objectivism), then - insofar as we just want to take all the easy wins to mitigate catastrophic risks to the EA community - I know that this risk is something to think about and focus on for EA.
For these sorts of areas, the right approach seems to be granular qualitative research - trying to really understand in depth what happened in some other circumstance, and then think through what lessons that entail for the circumstance you're interested in. I think that, as a matter of fact, EA does this quite a lot when relevant. (E.g. Grace on Szilard, or existing EA discussion of previous social movements). So I think this gives us extra reason to push against the idea that "EA-style analysis" = "quant-y RCT-esque analysis" rather than "whatever research methods are most appropriate to the field at hand". But even on qualitative research I think the "EA mindset" can be quite distinctive - certainly I think, for example, that a Bayesian-heavy approach to historical questions, often addressing counterfactual questions, and looking at those issues that are most interesting from an EA perspective (e.g. how modern-day values would be different if Christianity had never taken off), would be really quite different from almost all existing historical research.
Thanks, Will!
I definitely agree we can look at qualitative data for hypothesis generation (after all, n=1 is still an existence proof). But I'd generally recommend breadth-first rather than depth-first if we're trying to adduce considerations.
For many/most sorts of policy decisions although we may find a case of X (some factor) --> Y (some desirable outcome), we can probably also find cases of ¬X --> Y and X --> ¬Y. E.g., contrasting with what happened with prospect theory, there are also cases where someone happened on an important breakthrough with much less time/effort, or where people over-committed to an intellectual dead-end (naturally, partisans of X or ¬X tend to be good at cultivating sets of case-studies which facially support the claim it leads to Y.)
I generally see getting a steer of the correlation of X and Y (so the relative abundance of (¬/)X --> (¬/)Y across a broad reference class as more valuable than determining whether in a given case (even one which seems nearby to the problem we're interested in) X really was playing a causal role in driving Y. Problems of selection are formidable, but I take the problems of external validity to tend even worse (and worse enough to make the former have a better ratio of insight:resources).
Thus I'd be much more interested to see (e.g.) a wide survey of cases which suggests movements prone to in-fighting tend to be less successful than an in depth look of how in-fighting caused the destruction of a nearby analogue to the EA community. Ditto the 'macro' in macrohistory being at least partly about trying to adduce takeaways across history, as well as trying to divine its big contours.
And although I think work like this is worthwhile to attempt, I think in some instances we may come to learn that reality is so underpowered that there's essentially no point doing research (e.g. maybe large bits of history are just ultra-chaotic, so all we can ever see is noise).
I agree with your points, but from my perspective they somewhat miss the mark.
Specifically, your discussion seems to assume that we have a fixed, exogenously given set of propositions or factors X, Y, ..., and that our sole task is to establish relations of correlation and causation between them. In this context, I agree on preferring "wide surveys" etc.
However, in fact, doing research also requires the following tasks:
I think that depth can help with these three tasks in ways in which breadth can't.
For instance, in Will's example, my guess is that the main value of considering the history of Objectivism does not come from moving my estimate for the strength of the hypothesis "X = romantic involvement between movement leaders -> Y = movement collapses". Rather, the source of value is including "romantic involvement between movement leaders" into the set of factors I'm considering in the first place. Only then am I able to investigate its relation to outcomes of interests, whether by a "wide survey of cases" or otherwise. Moreover, I might only have learned about the potential relevance of "romantic involvement between movement leaders" by looking at some depth into the history of Objectivism. (I know very little about Objectivism, and so don't know if this is true in this instance; it's certainly possible that the issue of romantic involvement between Objectivist leaders is so well known that it would be mentioned in any 5-sentence summary one would encounter during a breadth-first process. But it also seems possible that it's not, and I'm sure I could come up with examples where the interesting factor was buried deeply.)
My model here squares well with your observation that a "common feature among superforecasters is they read a lot", and in fact makes a more specific prediction: I expect that we'd find that superforecasters read a fair amount (say, >10% of their total reading) of deep, small-n case studies - for example, historical accounts of a single war, economic policy, or biographies.
[My guess is that my comment is largely just restating Will's points from his above comment in other words.]
(FWIW, I think some generators of my overall model here are: