Hide table of contents

Crossposted to LessWrong

TL;DR Quantified Intuitions helps users practice assigning credences to outcomes with a quick feedback loop. Please leave feedback in the comments, join our Discord, or send thoughts to aaron@sage-future.org.

Quantified Intuitions

Quantified Intuitions currently consists of two apps:

  1. Calibration game: Assigning confidence intervals to EA-related trivia questions.
    1. Question sources vary but many are from Anki deck for "Some key numbers that (almost) every EA should know"
    2. Compared to Open Philanthropy’s calibration app, it currently contains less diversity of questions (hopefully more interesting to EAF/LW readers) but the app is more modern and nicer to use in some ways
  2. Pastcasting: Forecasting on already resolved questions that you don’t have prior knowledge about.
    1. Questions are pulled from Metaculus and Good Judgment Open
    2. More info on motivation and how it works are in the LessWrong announcement post

Please leave feedback in the comments, join our Discord, or send it to aaron@sage-future.org.

Motivation

There are huge benefits to using numbers when discussing disagreements: see “3.3.1 Expressing degrees of confidence” in Reasoning Transparency by OpenPhil. But anecdotally, many EAs still feel uncomfortable quantifying their intuitions and continue to prefer using words like “likely” and “plausible” which could be interpreted in many ways.

This issue is likely to get worse as the EA movement attempts to grow quickly, with many new members joining who are coming in with various backgrounds and perspectives on the value of subjective credences. We hope that Quantified Intuitions can help both new and longtime EAs be more comfortable turning their intuitions into numbers.

More background on motivation can be found in Eli’s forum comments here and here.

Who built this?

Sage is an organization founded earlier this year by Eli LiflandAaron Ho and Misha Yagudin (in a part-time advising capacity). We’re funded by the FTX Future Fund.

As stated in the grant summary, our initial plan was to “create a pilot version of a forecasting platform, and a paid forecasting team, to make predictions about questions relevant to high-impact research”. While we build a decent beta forecasting platform (that we plan to open source at some point), the pilot for forecasting on questions relevant to high-impact research didn’t go that well due to (a) difficulties in creating resolvable questions relevant to cruxes in AI governance and (b) time constraints of talented forecasters. Nonetheless, we are still growing Samotsvety’s capacity and taking occasional high-impact forecasting gigs.

Eli was also struggling some personally around this time and updating toward AI alignment being super important but crowd forecasting not being that promising for attacking it. He stepped down and is now advising Sage part-time.

Meanwhile, we pivoted to building the apps contained in Quantified Intuitions to improve and maintain epistemics in EA. Aaron wrote most of the software for both apps within the past few months, Alejandro Ortega helped with the calibration game questions and Alina Timoshkina helped with a wide variety of tasks.

If you’d like to contact Sage you can message us on EAF/LW or email aaron@sage-future.org. If you’re interested in helping build apps similar to the ones on Quantified Intuitions or improving the current apps, fill out this expression of interest. It’s possible that we’ll hire a software engineer, product manager, and/or generalist, but we don’t have concrete plans.

Comments8


Sorted by Click to highlight new comments since:

I might be biased because I had an idea for something very similar, but I think this is amazing and I think hit on something very, very interesting. I found the calibration training game very addictive (in a good way)  and actually played it for for a few hours.  

I think it might be because I play it in particular way though:

  • I always set it to 90%.
  • Then, I only put in orders of magnitudes, even when the prompt and mask doesn't force the user to do this.  So for instance, 'What percent of the world's population was killed by the 1918 flu pandemic?' I put in: 90% Confidence Interval, Lower Bound: 1%, Upper Bound: 10%. This has two advantages: 
  1. I can play the game very quickly - I  can do a rough BOTEC in my head. 
  2. I'm almost always accurate but not very precise but when I'm not, I'm literally orders of magnitude off and I get this huge prediction error signal - and that is very memorable (and I feel a bit dumb! :D). This might also guide people towards those parts of my model of the world, where I have biggest gaps in my knowledge (certain scientific subjects). 'It's better to be roughly right than precisely wrong'. I think you could implement a spaced repetition feature based on how many orders of magnitude you’re off, where the more OOMs you're off, the earlier it prompts you with the same question again (so if you're say >3 orders of magnitude off it prompts you within the same session, if you're 2 orders of magnitude of within 24 hours, 1 within in 3 days (from Remnote)). You could preferentially prioritize displaying questions that people often get wrong, perhaps even personalize it using ML.

With that in mind, here are some feature suggestions:

  1. You're already pretty good at getting people to make rough orders of magnitude estimations, by often using scientific notation, but you could zero in on this aspect of the game.
  • Add even higher confidence setting like 95% and 99%, and perhaps make that the default. This will get users to answer questions faster.
  • Restrict the input to orders of magnitude or make that the default. It might also be good to select million, 10 million, 100M from a drop down menu, so that people gets faster and is more reinforcing.
  • While I appreciate that I got more of an intuitive grasp of scientific notation playing the game (how many 0s does a trillion have again?), have the word 'billion' displayed when putting in the 10^12. 
  • When possible, try to contextualize where possible (I do this in this post on trillion dollar figures: 'So how can you conceptualize $1 trillion? 1 trillion is 1,000 billion. 1 billion is 1,000 million. Houses often costs ~1 million. So 1 trillion ≈ 1 million houses—a whole city.')
  • I like the timer feature, but perhaps consider either reducing the time per question even further or give more point if one answers faster.

If you gamify this properly, I think this could be the next Sporcle (but much more useful better).

I think you could implement a spaced repetition feature based on how many orders of magnitude you’re off, where the more OOMs you're off, the earlier it prompts you with the same question again

 

This is a great idea, so we made Anki with Uncertainty to do exactly this!

Thank you Hauke for the suggestion :D

I think we'll keep the calibration app as a pure calibration training game, where you see each question only once. Anki is already the king of spaced repetition, so adding calibration features to it seemed like a natural fit.

This is awesome, I am glad that someone built this!

This seems cool!

When I saw the word "app" I assumed 'oh cool I can download this on my phone and maybe I'll be tempted to fiddle with it in spare moments similarly to how I get tempted to scroll social media.' Seems it's just on a website for now? I'm less optimistic that I'll remember / get tempted to use it in this format.

(Not a criticism, just a reflection.)

Thanks for this! I am excited to try it!

But anecdotally, many EAs still feel uncomfortable quantifying their intuitions and continue to prefer using words like “likely” and “plausible” which could be interpreted in many ways.

This issue is likely to get worse as the EA movement attempts to grow quickly, with many new members joining who are coming in with various backgrounds and perspectives on the value of subjective credences

 

Don't take this as a serious criticism; I just found it funny.

Yeah I realized this when proofreading and left it as I thought it drove home my point well :p

We've added a new deck of questions to the calibration training app - The World, then and now.

What was the world like 200 years ago, and how has it changed? Featuring charts from Our World in Data.

Thanks to Johanna Einsiedler and Jakob Graabak for helping build this deck!

We've also split the existing questions into decks, so you can focus on the topics you're most interested in:

More from Sage
Curated and popular this week
 ·  · 11m read
 · 
Confidence: Medium, underlying data is patchy and relies on a good amount of guesswork, data work involved a fair amount of vibecoding.  Intro:  Tom Davidson has an excellent post explaining the compute bottleneck objection to the software-only intelligence explosion.[1] The rough idea is that AI research requires two inputs: cognitive labor and research compute. If these two inputs are gross complements, then even if there is recursive self-improvement in the amount of cognitive labor directed towards AI research, this process will fizzle as you get bottlenecked by the amount of research compute.  The compute bottleneck objection to the software-only intelligence explosion crucially relies on compute and cognitive labor being gross complements; however, this fact is not at all obvious. You might think compute and cognitive labor are gross substitutes because more labor can substitute for a higher quantity of experiments via more careful experimental design or selection of experiments. Or you might indeed think they are gross complements because eventually, ideas need to be tested out in compute-intensive, experimental verification.  Ideally, we could use empirical evidence to get some clarity on whether compute and cognitive labor are gross complements; however, the existing empirical evidence is weak. The main empirical estimate that is discussed in Tom's article is Oberfield and Raval (2014), which estimates the elasticity of substitution (the standard measure of whether goods are complements or substitutes) between capital and labor in manufacturing plants. It is not clear how well we can extrapolate from manufacturing to AI research.  In this article, we will try to remedy this by estimating the elasticity of substitution between research compute and cognitive labor in frontier AI firms.  Model  Baseline CES in Compute To understand how we estimate the elasticity of substitution, it will be useful to set up a theoretical model of researching better alg
 ·  · 6m read
 · 
TLDR: This 6 million dollar Technical Support Unit grant doesn’t seem to fit GiveWell’s ethos and mission, and I don’t think the grant has high expected value. Disclaimer: Despite my concerns I still think this grant is likely better than 80% of Global Health grants out there. GiveWell are my favourite donor, and given how much thought, research, and passion goes into every grant they give, I’m quite likely to be wrong here!   What makes GiveWell Special? I love to tell people what makes GiveWell special. I giddily share how they rigorously select the most cost-effective charities with the best evidence-base. GiveWell charities almost certainly save lives at low cost – you can bank on it. There’s almost no other org in the world where you can be pretty sure every few thousand dollars donated be savin’ dem lives. So GiveWell Gives you certainty – at least as much as possible. However this grant supports a high-risk intervention with a poor evidence base. There are decent arguments for moonshot grants which try and shift the needle high up in a health system, but this “meta-level”, “weak evidence”, “hits-based” approach feels more Open-Phil than GiveWell[1]. If a friend asks me to justify the last 10 grants GiveWell made based on their mission and process, I’ll grin and gladly explain. I couldn’t explain this one. Although I prefer GiveWell’s “nearly sure” approach[2], it could be healthy to have two organisations with different roles in the EA global Health ecosystem. GiveWell backing sure things, and OpenPhil making bets.   GiveWell vs. OpenPhil Funding Approach What is the grant? The grant is a joint venture with OpenPhil[3] which gives 6 million dollars to two generalist “BINGOs”[4] (CHAI and PATH), to provide technical support to low-income African countries. This might help them shift their health budgets from less effective causes to more effective causes, and find efficient ways to cut costs without losing impact in these leaner times. Teams of 3-5
 ·  · 3m read
 · 
We’re excited to announce SparkWell! What is SparkWell? SparkWell is an Anti Entropy program designed to help high-impact nonprofit projects test ideas, develop operational capabilities, and launch as independent entities. We provide a temporary home for a diverse range of promising initiatives. Why have we built this? We believe that we’re living through a transformational period in history. Catastrophic risks loom large, whether from climate change, factory farming, pandemics, nuclear or cyber warfare, or the misalignment or misuse of intelligent systems. A transformational period in history warrants a transformation in philanthropy — and we want to give innovative projects the support they need to test their ideas and scale. We leverage our skills and experience with nonprofit operations to guide enrolled projects through a bespoke acceleration roadmap.  Within 6–24 months, your project will graduate into an independent entity with operational best practices. This will put you in a position to scale your activities — and help mitigate the catastrophic risks facing us. What does SparkWell offer? SparkWell offers 6-month, 12-month, or 24-month tracks to accommodate projects at different stages. We enable each project to: * Test ideas * Receive tax-exempt funding via Anti Entropy's 501(c)(3) * Run your project, including hiring staff, contractors, and managing expenses * Receive feedback and develop your theory of change * Develop operational capabilities * Access your bank account, company card, and dashboard * Receive mentorship and resources from your Project Liaison * Leave bookkeeping and compliance to us * Launch an independent entity * Monitor your progress along entity formation milestones * Be on track to independence within 6, 12, or 24 months * Launch an independent entity when you’re ready We apply a 7% service fee on funds raised or received during the program. You can learn more about the program here. Who are