This is a linkpost for https://markxu.com/strong-evidence

Portions of this are taken directly from Three Things I've Learned About Bayes' Rule.

One time, someone asked me what my name was. I said, “Mark Xu.” Afterward, they probably believed my name was “Mark Xu.” I’m guessing they would have happily accepted a bet at 20:1 odds that my driver’s license would say “Mark Xu” on it.

The prior odds that someone’s name is “Mark Xu” are generously 1:1,000,000. Posterior odds of 20:1 implies that the odds ratio of me saying “Mark Xu” is 20,000,000:1, or roughly 24 bits of evidence. That’s a lot of evidence.

Seeing a Wikipedia page say “X is the capital of Y” is tremendous evidence that X is the capital of Y. Someone telling you “I can juggle” is massive evidence that they can juggle. Putting an expression into Mathematica and getting Z is enormous evidence that the expression evaluates to Z. Vast odds ratios lurk behind many encounters.

One implication of the Efficient Market Hypothesis (EMH) is that is it difficult to make money on the stock market. Generously, maybe only the top 1% of traders will be profitable. How difficult is it to get into the top 1% of traders? To be 50% sure you're in the top 1%, you only need 200:1 evidence. This seemingly large odds ratio might be easy to get.

On average, people are overconfident, but 12% aren't. It only takes 50:1 evidence to conclude you are much less overconfident than average. An hour or so of calibration training and the resulting calibration plots might be enough.

Running through Bayes’ Rule explicitly might produce a bias towards middling values. Extraordinary claims require extraordinary evidence, but extraordinary evidence might be more common than you think.

50

0
0

Reactions

0
0
Comments7


Sorted by Click to highlight new comments since:

I think in the real world there are many situations where (if we were to put explicit Bayesian probabilities on such beliefs, which we almost never do), beliefs with ex ante ~0 credence quickly get extraordinary updates. My favorite example is sense perception. If I woke up after sleeping on a bus and were to put explicit Bayesian probabilities on anticipating what I will see next time I open my eyes, then my belief I'd assign in the true outcome (ignoring practical constraints like computation and my near inability to have any visual imagery) has ~0 credence. Yet it's easy to get strong Bayesian updates: I just open my eyes. In most cases, this should be a large enough update, and I go on my merry way. 

But suppose I open my eyes and instead see  people who are  approximate lookalikes of dead US presidents sitting around the bus. Then at that point (even though the ex ante probability of this outcome and that of a specific other thing I saw isn't much different), I will correctly be surprised, and have some reasons to doubt my sense perception.

Likewise, if instead of saying your name is Mark Xu, you instead said "Lee Kuan Yew", I at least would be pretty suspicious that your actual name is Lee Kuan Yew.

I think a lot of this confusion in intuitions can be resolved by looking at what MacAskill calls the difference between unlikelihood and fishiness:

Lots of things are a priori extremely unlikely yet we should have high credence in them: for example, the chance that you just dealt this particular (random-seeming) sequence of cards from a well-shuffled deck of 52 cards is 1 in 52! ≈ 1 in 10^68, yet you should often have high credence in claims of that form.  But the claim that we’re at an extremely special time is also fishy. That is, it’s more like the claim that you just dealt a deck of cards in perfect order (2 to Ace of clubs, then 2 to Ace of diamonds, etc) from a well-shuffled deck of cards. 

Being fishy is different than just being unlikely. The difference between unlikelihood and fishiness is the availability of alternative, not wildly improbable, alternative hypotheses, on which the outcome or evidence is reasonably likely. If I deal the random-seeming sequence of cards, I don’t have reason to question my assumption that the deck was shuffled, because there’s no alternative background assumption on which the random-seeming sequence is a likely occurrence.  If, however, I deal the deck of cards in perfect order, I do have reason to significantly update that the deck was not in fact shuffled, because the probability of getting cards in perfect order if the cards were not shuffled is reasonably high. That is: P(cards not shuffled)P(cards in perfect order | cards not shuffled) >> P(cards shuffled)P(cards in perfect order | cards shuffled), even if my prior credence was that P(cards shuffled) > P(cards not shuffled), so I should update towards the cards having not been shuffled.

Put another way, we can dissolve this by looking explicitly at Bayes' theorem. 

and in turn, 

 is high in both the "fishy" and "non-fishy" regimes. However, is much higher for fishy hypotheses than  for non-fishy hypotheses, even if the surface-level evidence looks similar!

More Facebook discussion of this post:

___________________________

Satvik Beri:  I think Bayes' Theorem is extremely hard to apply usefully, to the point that I rarely use it at all despite working in data science.

A major problem that leads people to be underconfident is the temptation to round down evidence to reasonable odds, like the post mentions. A major problem that leads people to be overconfident is applying lots of small pieces of information while discounting the correlations between them.

A comment [on LessWrong] mentions that if you have excellent returns for a year, that's strong evidence you're a top 1% trader. That's not really true, the market tends to move in regimes for long periods of time, so a strategy that works well for a year is pretty likely to have average performance the next year. Studies on hedge fund managers have found it is extremely difficult to find consistent outperformers, e.g. 5-year performance on pretty much any metric is uncorrelated to the performance on that metric next year.

I didn’t say anything about what size/duration of returns would make you a top 1% trader.

Facebook discussion of this post:

___________________________

Duncan Sabien:  This is ... not a clean argument. Haven't read the full post, but I feel the feeling of someone trying to do sleight-of-hand on me.

[Added by Duncan: "my apologies for not being able to devote more time to clarity and constructivity.  Mark Xu is good people in my experience."]

Rob Bensinger:  Isn't 'my prior odds were x, my posterior odds were y, therefore my evidence strength must be z' already good enough?

Are you worried that the person might not actually have a posterior that extreme? Like, if they actually took 21 bets like that they'd get more than 1 of them wrong?

Guy Srinivasan:  I feel like "fight! fight!" except with the word "unpack!"

Duncan Sabien:  > The prior odds that someone’s name is 'Mark Xu' are generously 1:1,000,000. Posterior odds of 20:1 implies that the odds ratio of me saying 'Mark Xu' is 20,000,000:1, or roughly 24 bits of evidence. That’s a lot of evidence.

This is beyond "spherical frictionless cows" and into disingenuous adversarial levels of oversimplification. I'm having a hard time clarifying what's sending up red flags here, except to say "the claim that his mere assertion provided 24 bits of evidence is false, and saying it in this oddly specific and confident way will cow less literate reasoners into just believing him, and I feel gross."

Guy Srinivasan:  Could it be that there's a smuggled intuition here that we're trying to distinguish between names in a good faith world, and that the bad faith hypothesis is important in ways that "the name might be John" isn't, and that just rounding it off to bits of evidence makes it seem like the extra 0.1 bits "maybe this exchange is bad faith" are small in comparison when actually they are the most important bits to gain?

(the above is not math)

Marcello Herreshoff:  I share Duncan's intuition that there's a sleight of hand happening here. Here's my candidate for where the slight of hand might live:

Vast odds ratios do lurk behind many encounters, but specifically, they show up much more often in situations that raise an improbable hypothesis to consideration worthiness (as in Mark Xu's first set of examples) than in the situation where they raise consideration worthy hypotheses to very high levels of certainty (as in Mark Xu's second set of examples.)

Put another way, how correlated your available observations are to some variable puts a ceiling on how certain you're ever allowed to get about that variable. So we should often expect the last mile of updates in favor of a hypothesis to be much harder to obtain than the first mile.

Ronny Fernandez:  @Duncan Sabien   So is the prior higher or is the posterior lower?

Chana Messinger:  I wonder if this is similar to my confusion at whether expected conservation of evidence is violated if you have a really good experiment that would give you strong evidence for A if it comes out one way and strong evidence for B if it comes out the other way.

Ronny Fernandez:  @Marcello Mathias Herreshoff I don’t think I actually understand the last paragraph in your explanation. Feel like elaborating?

Marcello Herreshoff:  Consider the driver's license example. If we suppose 1/1000 of people are identity thieves carrying perfect driver's license forgeries (of randomly selected victims), then there is absolutely nothing you can do (using drivers licenses alone) to get your level of certainty that the person you're talking to is Mark Xu above 99.9%, because the evidence you can access can't separate the real Mark Xu from a potential impersonator. That's the flavor of effect the first sentence of the last paragraph was trying to point at.

I’m guessing they would have happily accepted a bet at 20:1 odds that my driver’s license would say “Mark Xu” on it

Pretty minor point, but personally there are many situations where I'd be happy to accept the other side of that bet for many (most?) people named Mark Xu, if the only information  I and the other person had was someone saying "Hi, I'm Mark Xu."

More Facebook discussion of this post:

___________________________

Ronny Fernandez:  I think maybe what’s actually going on here is that extraordinary claims usually have much lower prior prob than 10^-6

Genuinely extraordinary claims, not claims that seem weird

Curated and popular this week
 ·  · 13m read
 · 
Notes  The following text explores, in a speculative manner, the evolutionary question: Did high-intensity affective states, specifically Pain, emerge early in evolutionary history, or did they develop gradually over time? Note: We are not neuroscientists; our work draws on our evolutionary biology background and our efforts to develop welfare metrics that accurately reflect reality and effectively reduce suffering. We hope these ideas may interest researchers in neuroscience, comparative cognition, and animal welfare science. This discussion is part of a broader manuscript in progress, focusing on interspecific comparisons of affective capacities—a critical question for advancing animal welfare science and estimating the Welfare Footprint of animal-sourced products.     Key points  Ultimate question: Do primitive sentient organisms experience extreme pain intensities, or fine-grained pain intensity discrimination, or both? Scientific framing: Pain functions as a biological signalling system that guides behavior by encoding motivational importance. The evolution of Pain signalling —its intensity range and resolution (i.e., the granularity with which differences in Pain intensity can be perceived)— can be viewed as an optimization problem, where neural architectures must balance computational efficiency, survival-driven signal prioritization, and adaptive flexibility. Mathematical clarification: Resolution is a fundamental requirement for encoding and processing information. Pain varies not only in overall intensity but also in granularity—how finely intensity levels can be distinguished.  Hypothetical Evolutionary Pathways: by analysing affective intensity (low, high) and resolution (low, high) as independent dimensions, we describe four illustrative evolutionary scenarios that provide a structured framework to examine whether primitive sentient organisms can experience Pain of high intensity, nuanced affective intensities, both, or neither.     Introdu
 ·  · 3m read
 · 
We’ve redesigned effectivealtruism.org to improve understanding and perception of effective altruism, and make it easier to take action.  View the new site → I led the redesign and will be writing in the first person here, but many others contributed research, feedback, writing, editing, and development. I’d love to hear what you think, here is a feedback form. Redesign goals This redesign is part of CEA’s broader efforts to improve how effective altruism is understood and perceived. I focused on goals aligned with CEA’s branding and growth strategy: 1. Improve understanding of what effective altruism is Make the core ideas easier to grasp by simplifying language, addressing common misconceptions, and showcasing more real-world examples of people and projects. 2. Improve the perception of effective altruism I worked from a set of brand associations defined by the group working on the EA brand project[1]. These are words we want people to associate with effective altruism more strongly—like compassionate, competent, and action-oriented. 3. Increase impactful actions Make it easier for visitors to take meaningful next steps, like signing up for the newsletter or intro course, exploring career opportunities, or donating. We focused especially on three key audiences: * To-be direct workers: young people and professionals who might explore impactful career paths * Opinion shapers and people in power: journalists, policymakers, and senior professionals in relevant fields * Donors: from large funders to smaller individual givers and peer foundations Before and after The changes across the site are aimed at making it clearer, more skimmable, and easier to navigate. Here are some side-by-side comparisons: Landing page Some of the changes: * Replaced the economic growth graph with a short video highlighting different cause areas and effective altruism in action * Updated tagline to "Find the best ways to help others" based on testing by Rethink
 ·  · 8m read
 · 
Around 1 month ago, I wrote a similar Forum post on the Easterlin Paradox. I decided to take it down because: 1) after useful comments, the method looked a little half-baked; 2) I got in touch with two academics – Profs. Caspar Kaiser and Andrew Oswald – and we are now working on a paper together using a related method.  That blog post actually came to the opposite conclusion, but, as mentioned, I don't think the method was fully thought through.  I'm a little more confident about this work. It essentially summarises my Undergraduate dissertation. You can read a full version here. I'm hoping to publish this somewhere, over the Summer. So all feedback is welcome.  TLDR * Life satisfaction (LS) appears flat over time, despite massive economic growth — the “Easterlin Paradox.” * Some argue that happiness is rising, but we’re reporting it more conservatively — a phenomenon called rescaling. * I test this hypothesis using a large (panel) dataset by asking a simple question: has the emotional impact of life events — e.g., unemployment, new relationships — weakened over time? If happiness scales have stretched, life events should “move the needle” less now than in the past. * That’s exactly what I find: on average, the effect of the average life event on reported happiness has fallen by around 40%. * This result is surprisingly robust to various model specifications. It suggests rescaling is a real phenomenon, and that (under 2 strong assumptions), underlying happiness may be 60% higher than reported happiness. * There are some interesting EA-relevant implications for the merits of material abundance, and the limits to subjective wellbeing data. 1. Background: A Happiness Paradox Here is a claim that I suspect most EAs would agree with: humans today live longer, richer, and healthier lives than any point in history. Yet we seem no happier for it. Self-reported life satisfaction (LS), usually measured on a 0–10 scale, has remained remarkably flat over the last f