(Content warning: this post mentions a question from the 2024 EA Survey. If you haven't answered it yet and plan to do so, please do that first before reading on)


The 2024 EA survey asks people which of the following interventions they prefer:

  1. An intervention that averts 1,000 DALYs with 100% probability
  2. An intervention that averts 100,000 DALYs with 1.5% probability

This is a simple question in theory: (2) has 50% more expected value.

In practice, I believe this is an absurd premise, the kind that never happens in real life. How would you know that the probability that an intervention works is 1.5%?

My rule of thumb is that most real-world probabilities could be off by a percentage point or so. Note that this is different from it being 1% too low or too high; it is an entire percentage point. For the survey question, it might well be that intervention (1)'s success rate is only 99%, and intervention (2) could have a success rate anywhere in the low percentages.

I don't have a good justification for this rule of thumb[1]. Part of it is probably psychological: humans are most familiar with concepts like "rare". We occasionally use percentages but rarely (no pun intended) use permilles or smaller units. Parts of it is technical: probabilities that are small are harder to directly measure, so they are derived from a model. The model is imperfect, and the model inputs are likely to be imprecise.

For intervention (1), my rule of thumb does not have a large effect on the overall impact. For intervention (2), the effect is very large[2]. This makes the survey question so hard to answer, and the answers so hard to interpret.

There are, of course, established ways to deal with this mathematically. For example, one could use a portfolio approach that allocates some fraction of resources to intervention (2). Such strategies are valuable, even necessary, to deal with this type of question. As a survey respondent, I felt frustrated with having just two options. I feel that the survey question creates a false sense of "all you need is expected value"; it asks for a black-and-white answer where the reality has lots of shades.[3]

My recommendation and plea: Please communicate humbly, especially when using very low probabilities. Consider that all your numbers, but low probabilities especially, might be inaccurate. When designing thought experiments, keep them as realistic as possible, so that they elicit better answers. This reduces misunderstandings, pitfalls, and potentially compounding errors. It produces better communication overall.


  1. I welcome pointers to research about this! ↩︎
  2. The effect is large, in the sense that the expected intervention value could be 500 DALYs or 2500 DALYs. However, the expected expected intervention value does not change if we just add symmetric error margins. ↩︎
  3. Caveat: I don't know what the survey question was intended to measure. It might well be a good question, given its goal. ↩︎

24

2
5

Reactions

2
5

More posts like this

Comments18
Sorted by Click to highlight new comments since:

Clearly you believe that probabilities can be less than 1%, reliably. Your probably of being struck by lightning today is not "0% or maybe 1%", it's on the order of 0.001%. Your probability of winning the lottery is not "0% or 1%" it's ~0.0000001%. I am confident you deal with probabilities that have much less than 1% error all the time, and feel comfortable using them.

It doesn't make sense to think of humility as something absolute like "don't give highly specific probabilities". You frequently have justified belief of a probability being very highly specific (the probability that random.org's random number generator will generate "2" when asked about a random number between 1 and 10 is exactly 10%, not 11%, not 9%, exactly 10%, with very little uncertainty about that number).

This is a great point.

Clearly you are right. That said, the examples that you give are the kind of frequentist probabilities for which one can actually measure rates. This is quite different from the probability given in the survey, which presumably comes from an imperfect Bayesian model with imprecise inputs.

I also don't want to belabor the point... but I'm pretty sure my probability of being stuck by lightning today is far from 0.001%. Given where I live and today's weather, it could be a few orders of magnitude lower. If I use your unadjusted probability (10 micromorts) and am willing to spend $25 to avert a micromort, I would conclude that I should invest $250 in lightning protection today... that seems the kind of wrong conclusion that my post warns about.

I think humility is useful in cases like the present survey question, when a specific low probability, derived from an imperfect model, can change the entire conclusion. There are many computations where the outcome is fairly robust to small absolute estimation errors (e.g., intervention (1) in the question). On the other hand, for computations that depend on a low probability with high sensitivity, we should be extra careful about that probability.

Richard Chappell writes something similar here, better than I could. Thanks Lizka for linking to that post!

Pascalian probabilities are instead (I propose) ones that lack robust epistemic support. They're more or less made up, and could easily be "off" by many, many orders of magnitude. Per Holden Karnofsky's argument in 'Why we can't take explicit expected value estimates literally', Bayesian adjustments would plausibly mandate massively discounting these non-robust initial estimates (roughly in proportion to their claims to massive impact), leading to low adjusted expected value after all.

Maybe I should have titled this post differently, for example "Beware of non-robust probability estimates multiplied by large numbers".

When I answered this question, I answered it with an implied premise that an EA org is making these claims about the possibilities, and went for number 1, because I don't trust EA orgs to be accurate in their "1.5%" probability estimates, and I expect these to be more likely overestimates than underestimates.

As a datapoint: despite (already) agreeing to a large extent with this post,[1] IIRC I answered the question assuming that I do trust the premise. 


Despite my agreement, I do think there are certain kinds of situations in which we can reasonably use small probabilities. (Related post: Most* small probabilities aren't pascalian, and maybe also related.) 


More generally: I remember appreciating some discussion on the kinds of thought experiments that are useful, when, etc. I can't find it quickly, but possible starting points could be this LW post, Least Convenient Possible World, maybe this post from Richard, and stuff about fictional evidence

Writing quickly based on a skim, sorry for lack of clarity/misinterpretations! 

  1. ^

    My view is roughly something like: 

    at least in the most obviously analogous situations, it's very rare that we can properly tell the difference between 1.5% and 0.15% (and so the premise is somewhat absurd)

My intuitive reaction to this is "Way to screw up a survey." 

Considering that three people agree-voted your post, I realize I should probably come away with this with a very different takeaway, more like "oops, survey designers need to put in extra effort if they want to get accurate results, and I would've totally fallen for this pitfall myself."

Still, I struggle with understanding your and the OP's point of view. My reaction to the original post was something like:

Why would this matter? If the estimate could be off by 1 percentage point, it could be down to 0.5% or up to 2.5%, which is still 1.5% in expectation. Also, if this question's intention were about the likelihood of EA orgs being biased, surely they would've asked much more directly about how much respondees trust an estimate of some example EA org.

We seem to disagree on use of thought experiments. The OP writes:

When designing thought experiments, keep them as realistic as possible, so that they elicit better answers. This reduces misunderstandings, pitfalls, and potentially compounding errors. It produces better communication overall.

I don't think this is necessary and I could even see it backfiring. If someone goes out of their way to make a thought experiment particularly realistic, maybe respondees might get the impression that it is asking about a real-world situation where they are invited to bring in all kinds of potentially confounding considerations. But that would defeat the point of the thought experiment (e.g., people might answer based on how much they trust the modesty of EA orgs, as opposed to giving you their personal tolerance of risk of the feeling of having had no effect/wasted money in hindsight). The way I see it, the whole point of thought experiments is to get ourselves to think very carefully and cleanly about the principles we find most important. We do this by getting rid of all the potentially confounding variables. See here for a longer explanation of this view. 

Maybe future surveys should have a test to figure out how people understand the use of thought experiments. Then, we could split responses between people who were trying to play the thought experiment game the intended way, and people who were refusing to play (i.e., questioning premises and adding further assumptions).

*On some occasions, it makes sense to question the applicability of a thought experiment. For instance, in the classical "what if you're a doctor who has the opportunity to kill a healthy patient during routine chek-up so that we could save the lives of 4 people needing urgent organ transplants," it makes little sense to just go "all else is equa! Let's abstract away all other societal considerations or the effect on the doctor's moral character." 
So, if I were to write a post on thought experiments today, I would add something about the importance of re-contextualizing lessons learned within a thought experiments to the nuances of real-world situations. In short, I think my formula would be something like, "decouple within thought experiments, but make sure add an extra thinking step from 'answers inside a thought experiment' to 'what can we draw from this in terms of real-life applications.'" (Credit to Kaj Sotala, who once articulated a similar point in probably a better way.)

I agree that our different reactions come partly from having different intuitions about the boundaries of a thought experiment. Which factors should one include vs exclude when evaluating answers?

For me, I assumed that the question can't be just about expected values. This seemed too trivial. For simple questions like that, it would be clearer to ask the question directly (e.g., "Are you in favor of high-risk interventions with large expected rewards?") than to use a thought experiment. So I concluded that the thought experiment probably goes a bit further.

If it goes further, there are many factors that might come into play:

  • How certain are we of the numbers?
  • Are there any negative effects if the intervention fails? These could be direct negative outcomes, but also indirect ones like difficulty to raise funds in the future, reputation loss...
  • Are we allocating a small part of a budget, or our total money? Is this a repeated decision or a one-off?

I had no good answers, and no good guesses about the question's intent. Maybe this is clearer for you, given that you mention "the way EA culture has handled thought experiments thus far" in a comment below. I, for one, decided to skip the question :/

Feels like taking into account the likelihood that the "1.5% probability of 100,000 DALYs averted" estimate is a credence based on some marginally-relevant base rate[1] that might have been chosen with a significant bias towards optimism is very much in keeping with the spirit of the question (which presumably is about gauging attitudes towards uncertainty, not testing basic EV calculation skills)[2]

A very low percentage chance of averting a lot of DALYs feels a lot more like "1.5% of clinical trials of therapies for X succeeded; this untested idea might also have a 1.5% chance" optimism attached to a proposal offering little reason to believe it's above average rather than an estimate based on somewhat robust statistics (we inferred that 1.5% of people who receive this drug will be cured from the 1.5% of people who had that outcome in trials). So it seems quite reasonable to assume that the 1.5% chance of a positive binary outcome estimate might be biased upwards. Even more so in the context of "we acknowledge this is a long shot and high-certainty solutions to other pressing problems exist, but if the chance of this making an impact was as high as 0.0x%..." style fundraising appeals to EAs' determination to avoid scope insensitivity.

  1. ^

    either that or someone's been remarkably precise in their subjective estimates or collected some unusual type of empirical data. I certainly can't imagine reaching the conclusion an option has exactly 1.5% chance of averting 100k DALYs myself 

  2. ^

    if you want to show off you understand EV and risk estimation you'd answer (C) "here's how I'd construct my portfolio" anyway :-) 

If we're considering realistic scenarios instead of staying with the spirit of the thought experiment (which I think we should not, partly precisely because it introduces lots of possible ambiguities in how people interpret the question, and partly because this probably isn't what the surveyors intended, given the way EA culture has handled thought experiments thus far – see for instance the links in Lizka's answer, or the way EA draws heavily from analytic philosophy, where straightforwardly engaging with unrealistic thought experiments is a standard component of the toolkit), then I agree that an advertized 1.5% chance of having a huge impact could be more likely upwards-biased than the other way around. (But it depends on who's doing the estimate – some people are actually well-calibrated or prone to be extra modest.)

[...] is very much in keeping with the spirit of the question (which presumably is about gauging attitudes towards uncertainty, not testing basic EV calculation skills

(1) what you described seems to me best characterized as being about trust. Trust in other's risk estimates. That would be separate from attitudes about uncertainty (and if that's what the surveyors wanted to elicit, they'd probably have asked the question very differently). 

(Or maybe what you're thinking about could be someone having radical doubts about the entire epistemology behind "low probabilities"? I'm picturing a position that goes something like, "it's philosophically impossible to reason sanely about low probabilities; besides, when we make mistakes, we'll almost always overestimate rather than underestimate our ability to have effects on the world." Maybe that's what you think people are thinking – but as an absolute, this would seem weirdly detailed and radical to me, and I feel like there's a prudential wager against believing that our reasoning is doomed from the start in a way that would prohibit everyone from pursuing ambitious plans.)

(2) What I meant wasn't about basic EV calculation skills (obviously) – I didn't mean to suggest that just because the EV of the low-probability intervention is greater than the EV of the certain intervention, it's a no-brainer that it should be taken. I was just saying that the OP's point about probabilities maybe being off by one percentage point, by itself, without some allegation of systematic bias in the measurement, doesn't change the nature of the question. There's still the further question of whether we want to bring in other considerations besides EV. (I think "attitudes towards uncertainty" fits well here as a title, but again, I would reserve it for the thing I'm describing, which is clearly different from "do you think other people/orgs within EA are going to be optimistically biased?.") 

(Note that it's one question whether people would go by EV for cases that are well within the bounds of numbers of people that exist currently on earth. I think it becomes a separate question when you go further to extremes, like whether people would continue gambling in the St Petersburg paradox or how they relate to claims about vastly larger realms than anything we understand to be in current physics, the way Pascal's mugging postulates.)

Finally, I realize that maybe the other people here in the thread have so little trust in the survey designers that they're worried that, if they answer with the low-probability, higher-EV option, the survey designers will write takeaways like "more EAs are in favor of donating to speculative AI risk interventions." I agree that, if you think survey designers will make too strong of an update from your answers to a thought experiment, you should point out all the ways that you're not automatically endorsing their preferred option. But I feel like the EA survey already has lots of practical questions along the lines of "Where do you actually donate to?" So, it feels unlikely that this question is trying to trick respondees or that the survey designers will just generally draw takeaways from this that aren't warranted?

I realize that maybe the other people here in the thread have so little trust in the survey designers that they're worried that, if they answer with the low-probability, higher-EV option, the survey designers will write takeaways like "more EAs are in favor of donating to speculative AI risk."

I'm one of the people who agreed with @titotal's comment, and it was because of something like this.

It's not that I'm worried per se that the survey designers will write a takeaway that puts a spin on this question (last time they just reported it neutrally). It's more that I expect this question[1] to be taken by other orgs/people as a proxy metric for the EA community's support for hits-based interventions. And because of the practicalities of how information is acted on the subtlety of the wording of the question might be lost in the process (e.g. in an organisation someone might raise the issue at some point, but it would eventually end up as a number in a spreadsheet or BOTEC, and there is no principled way to adjust for the issue that titotal describes).

  1. ^

    And one other about supporting low-probability/high-impact interventions

That makes sense; I understand that concern.

I wonder if, next time, the survey makers could write something to reassure us that they're not going to be using any results out of context or with an unwarranted spin (esp. in cases like the one here, where the question is related to a big 'divide' within EA, but worded as an abstract thought experiment.)

Thanks for the thoughtful response.

On (1) I'm not really sure the uncertainty and the trust in the estimate are separable. A probability estimate of a nonrecurring event[1] fundamentally is a label someone[2] applies to how confident they are something will happen. A corollary of this is that you should probably take into account how probability estimates could have actually been reached, your trust in that reasoning and the likelihood of bias when deciding how to act. [3]

On (2) I agree with your comments about the OP's point; if the probabilities are +/-1 percentage point with error symmetrically distributed they're still on average 1.5%[4], though in some circumstances introducing error bars might affect how you handle risk. But as I've said, I don't think the distribution of errors looks like this when it comes to assessing whether long shots are worth pursuing or not (not even under the assumption of good faith). I'd be pretty worried if hits based grant-makers didn't, frankly, and this question puts me in their shoes. 

Your point about analytic philosophy often expecting literal answers to slightly weird hypotheticals is a good one. But EA isn't just analytic philosophy and St Petersburg Paradoxes, it's also people literally coming up with best guesses of probabilities of things they think might work and multiplying them (and a whole subculture based on that, and guesstimating just how impactful "crazy train" long shot ideas they're curious about might be). So I think it's pretty reasonable to treat it not as a slightly daft hypothetical where a 1.5% probability is an empirical reality,[5] but as a real world decision grant award scenario where the "1.5% probability" is a suspiciously precise credence, and you've got to decide whether to trust it enough to fund it over something that definitely works. In that situation, I think I'm discounting the estimated chance of success of the long shot by more than 50%.

FWIW I don't take the question as evidence the survey designers are biased in any way

  1. ^

    "this will either avert 100,000 DALYs or have no effect" doesn't feel like a proposition based on well-evidenced statistical regularities...

  2. ^

    not me. Or at least a "1.5%" chance of working for thousands of people and implicitly a 98.5% chance of having no effect on anyone certainly doesn't feel like the sort of degree of precision I'd estimate to...

  3. ^

    Whilst it's the unintended consequences of how the question was framed, this example feels particularly fishy. We're asked to contemplate trading off something that certainly will work against something potentially higher yielding that is highly unlikely to work, and yet the thing that is highly unlikely to work turns out to have the higher EV because someone has speculated on its likelihood to a very high degree of precision, and those extra 5 thousandths made all the difference. What's the chance the latter estimate is completely bogus or finessed to favour the latter option? I'd say in real world scenarios (and certainly not just EA scenarios) it's quite a bit more than 5 in 1000....

  4. ^

    that one's a math test too ;-)

  5. ^

    maybe a universe where physics is a god with an RNG...

Thanks for the reply, and sorry for the wall of text I'm posting now (no need to reply further, this is probably too much text for this sort of discussion)...

I agree that uncertainty is in someone's mind rather than out there in the world. Still, granting the accuracy of probability estimates feels no different from granting the accuracy of factual assumptions. Say I was interested in eliciting people's welfare tradeoffs between chicken sentience and cow sentience in the context of eating meat (how that translates into suffering caused per calorie of meat). Even if we lived in a world where false-labelling of meat was super common (such that, say, when you buy things labelled as 'cow', you might half the time get tuna, and when you buy chicken, you might half the time get ostrich), if I'm asking specifically for people's estimates on the moral disvalue from chicken calories vs cow calories, it would be strange if survey respondees factored in information about tunas and ostriches. Surely, if I was also interested in how people thought about calories from tunas and ostriches, I'd be asking about those animals too!

Also, circumstances about the labelling of meat products can change over time, so that previously elicited estimates on "chicken/cow-labelled things" would now be off. Survey results will be more timeless if we don't contaminate straightforward thought experiments with confounding empirical considerations that weren't part of the question.

A respondee might mention Kant and how all our knowledge about the world is indirect, how there's trust involved in taking assumptions for granted. That's accurate, but let's just take them for granted anyway and move on?

On whether "1.5%" is too precise of an estimate for contexts where we don't have extensive data: If we grant that thought experiments can be arbitrarily outlandish, then it doesn't really matter.

Still, I could imagine that you'd change your mind about never using these estimates if you thought more about situations where they might become relevant. For instance, I used estimates in that area (roughly around 1.5% chance of something happening) several times within the last two years:

My wife developed lupus a few years ago, which is the illness that often makes it onto the whiteboard in the show Dr House because it can throw up symptoms that mimic tons of other diseases, sometimes serious ones. We had a bunch of health scares where we were thinking "this is most likely just some weird lupus-related symptom that isn't actually dangerous, but it also resembles that other thing (which is also a common secondary complication from lupus or its medications), which would be a true emergency. In these situations, should we go to the ER for a check-up or not? With a 4-5h average A&E waiting time and the chance to catch viral illnesses while there (which are extra bad when you already have lupus), it probably doesn't make sense to go in if we think the chance of a true emergency is only <0.5%. However, at 2% or higher, we'd for sure want to go in. (In between those two, we'd probably continue to feel stressed and undecided, and maybe go in primarily for peace of mind, lol). Narrowing things down from "most likely it's nothing, but some small chance that it's bad!" to either "I'm confident this is <0.5%" or "I'm confident this is at least 2%" is not easy, but it worked in some instances. This suggests some usefulness (as a matter of practical necessity of making medical decisions in a context of long A&E waiting times) to making decisions based on a fairly narrowed down low-probability estimate. Sure, the process I described is still a bit more fuzzy than just pulling a 1.5% point estimate from somewhere, but I feel like it approaches similar levels of precision needed to narrow things down that much, and I think many other people would have similar decision thresholds in a situation like ours.

Admittedly, medical contexts are better studied than charity contexts, and especially influencing-the-distant-future charity contexts. So, it makes sense if you're especially skeptical of that level of precision in charitable contexts. (And I indeed agree with this; I'm not defending that level of precision in practice for EA charities!) Still, like habryka pointed out in another comment, I don't think there's a red line were fundamental changes happen as probabilities get lower and lower. The world isn't inherently frequentist, but we can often find plausibly-relevant base rates. Admittedly, there's always some subjectivity, some art, in choosing relevant base rates, assessing additional risk factors, making judgment calls about "how much is this symptom a match?." But if you find the right context for it (meaning: a context where you're justifiably anchoring to some very low-probability base rate), you can get well below the 0.5% level for practically-relevant decisions (and maybe make proportional upwards or downwards adjustments from there). For these reasons, it doesn't strike me as totally outlandish that some group will at some point come up with ranged very-low-probability estimate of averting some risk (like asteroid risk or whatever), while being well-calibrated. I'm not saying I have a concrete example in mind, but I wouldn't rule it out.

OP here :) Thanks for the interesting discussion that the two of you have had!

Lukas_Gloor, I think we agree on most points. Your example of estimating a low probability of medical emergency is great! And I reckon that you are communicating appropriately about it. You're probably telling your doctor something like "we came because we couldn't rule out complication X" and not "we came because X has a probability of 2%" ;-)

You also seem to be well aware of the uncertainty. Your situation does not feel like one where you went to the ER 50 times, were sent home 49 times, and have from this developed a good calibration. It looks more like a situation where you know about danger signs which could be caused by emergencies, and have some rules like "if we see A and B and not C, we need to go to the ER".[1]

Your situation and my post both involve low probabilities in high-stakes situations. That said, the goal of my post is to remind people that this type of probability is often uncertain, and that they should communicate this with the appropriate humility.


  1. That's how I would think about it, at least... it might well be that you're more rational than I, and use probabilities more explicitly. ↩︎

I'm more concerned that the actual survey language is "avert" not "save" - and obviously, we shouldn't do any projects which avert DALYs.

DALYs, unlike QALYs, are a negative measure. You don't want to increase the number of DALYs.

Sorry for having been imprecise in my post -- I wrote the question from memory after having already submitted the survey. I'll change it to "avert".

Executive summary: When dealing with interventions that have very low probability but high impact, we should be cautious about precise probability estimates since they could easily be off by a percentage point, significantly affecting expected value calculations.

Key points:

  1. Real-world probability estimates, especially small ones, are likely to be imprecise by about one percentage point, making expected value calculations less reliable
  2. For high-impact, low-probability interventions, this uncertainty can dramatically affect the expected value (e.g., 1.5% ± 1% could mean anywhere from 500 to 2500 DALYs saved)
  3. Binary choices between interventions with very different probability profiles (like in the EA Survey) may oversimplify decision-making
  4. Practical recommendation: Use portfolio approaches to handle uncertainty, and communicate probability estimates with appropriate humility
  5. When designing surveys or thought experiments, maintain realism to elicit more meaningful responses

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

More from Sjlver
Curated and popular this week
Relevant opportunities