(Content warning: this post mentions a question from the 2024 EA Survey. If you haven't answered it yet and plan to do so, please do that first before reading on)
The 2024 EA survey asks people which of the following interventions they prefer:
- An intervention that averts 1,000 DALYs with 100% probability
- An intervention that averts 100,000 DALYs with 1.5% probability
This is a simple question in theory: (2) has 50% more expected value.
In practice, I believe this is an absurd premise, the kind that never happens in real life. How would you know that the probability that an intervention works is 1.5%?
My rule of thumb is that most real-world probabilities could be off by a percentage point or so. Note that this is different from it being 1% too low or too high; it is an entire percentage point. For the survey question, it might well be that intervention (1)'s success rate is only 99%, and intervention (2) could have a success rate anywhere in the low percentages.
I don't have a good justification for this rule of thumb[1]. Part of it is probably psychological: humans are most familiar with concepts like "rare". We occasionally use percentages but rarely (no pun intended) use permilles or smaller units. Parts of it is technical: probabilities that are small are harder to directly measure, so they are derived from a model. The model is imperfect, and the model inputs are likely to be imprecise.
For intervention (1), my rule of thumb does not have a large effect on the overall impact. For intervention (2), the effect is very large[2]. This makes the survey question so hard to answer, and the answers so hard to interpret.
There are, of course, established ways to deal with this mathematically. For example, one could use a portfolio approach that allocates some fraction of resources to intervention (2). Such strategies are valuable, even necessary, to deal with this type of question. As a survey respondent, I felt frustrated with having just two options. I feel that the survey question creates a false sense of "all you need is expected value"; it asks for a black-and-white answer where the reality has lots of shades.[3]
My recommendation and plea: Please communicate humbly, especially when using very low probabilities. Consider that all your numbers, but low probabilities especially, might be inaccurate. When designing thought experiments, keep them as realistic as possible, so that they elicit better answers. This reduces misunderstandings, pitfalls, and potentially compounding errors. It produces better communication overall.
- I welcome pointers to research about this! ↩︎
- The effect is large, in the sense that the expected intervention value could be 500 DALYs or 2500 DALYs. However, the expected expected intervention value does not change if we just add symmetric error margins. ↩︎
- Caveat: I don't know what the survey question was intended to measure. It might well be a good question, given its goal. ↩︎
Thanks for the reply, and sorry for the wall of text I'm posting now (no need to reply further, this is probably too much text for this sort of discussion)...
I agree that uncertainty is in someone's mind rather than out there in the world. Still, granting the accuracy of probability estimates feels no different from granting the accuracy of factual assumptions. Say I was interested in eliciting people's welfare tradeoffs between chicken sentience and cow sentience in the context of eating meat (how that translates into suffering caused per calorie of meat). Even if we lived in a world where false-labelling of meat was super common (such that, say, when you buy things labelled as 'cow', you might half the time get tuna, and when you buy chicken, you might half the time get ostrich), if I'm asking specifically for people's estimates on the moral disvalue from chicken calories vs cow calories, it would be strange if survey respondees factored in information about tunas and ostriches. Surely, if I was also interested in how people thought about calories from tunas and ostriches, I'd be asking about those animals too!
Also, circumstances about the labelling of meat products can change over time, so that previously elicited estimates on "chicken/cow-labelled things" would now be off. Survey results will be more timeless if we don't contaminate straightforward thought experiments with confounding empirical considerations that weren't part of the question.
A respondee might mention Kant and how all our knowledge about the world is indirect, how there's trust involved in taking assumptions for granted. That's accurate, but let's just take them for granted anyway and move on?
On whether "1.5%" is too precise of an estimate for contexts where we don't have extensive data: If we grant that thought experiments can be arbitrarily outlandish, then it doesn't really matter.
Still, I could imagine that you'd change your mind about never using these estimates if you thought more about situations where they might become relevant. For instance, I used estimates in that area (roughly around 1.5% chance of something happening) several times within the last two years:
My wife developed lupus a few years ago, which is the illness that often makes it onto the whiteboard in the show Dr House because it can throw up symptoms that mimic tons of other diseases, sometimes serious ones. We had a bunch of health scares where we were thinking "this is most likely just some weird lupus-related symptom that isn't actually dangerous, but it also resembles that other thing (which is also a common secondary complication from lupus or its medications), which would be a true emergency. In these situations, should we go to the ER for a check-up or not? With a 4-5h average A&E waiting time and the chance to catch viral illnesses while there (which are extra bad when you already have lupus), it probably doesn't make sense to go in if we think the chance of a true emergency is only <0.5%. However, at 2% or higher, we'd for sure want to go in. (In between those two, we'd probably continue to feel stressed and undecided, and maybe go in primarily for peace of mind, lol). Narrowing things down from "most likely it's nothing, but some small chance that it's bad!" to either "I'm confident this is <0.5%" or "I'm confident this is at least 2%" is not easy, but it worked in some instances. This suggests some usefulness (as a matter of practical necessity of making medical decisions in a context of long A&E waiting times) to making decisions based on a fairly narrowed down low-probability estimate. Sure, the process I described is still a bit more fuzzy than just pulling a 1.5% point estimate from somewhere, but I feel like it approaches similar levels of precision needed to narrow things down that much, and I think many other people would have similar decision thresholds in a situation like ours.
Admittedly, medical contexts are better studied than charity contexts, and especially influencing-the-distant-future charity contexts. So, it makes sense if you're especially skeptical of that level of precision in charitable contexts. (And I indeed agree with this; I'm not defending that level of precision in practice for EA charities!) Still, like habryka pointed out in another comment, I don't think there's a red line were fundamental changes happen as probabilities get lower and lower. The world isn't inherently frequentist, but we can often find plausibly-relevant base rates. Admittedly, there's always some subjectivity, some art, in choosing relevant base rates, assessing additional risk factors, making judgment calls about "how much is this symptom a match?." But if you find the right context for it (meaning: a context where you're justifiably anchoring to some very low-probability base rate), you can get well below the 0.5% level for practically-relevant decisions (and maybe make proportional upwards or downwards adjustments from there). For these reasons, it doesn't strike me as totally outlandish that some group will at some point come up with ranged very-low-probability estimate of averting some risk (like asteroid risk or whatever), while being well-calibrated. I'm not saying I have a concrete example in mind, but I wouldn't rule it out.