Ben Stewart's Quick takes

Ben Stewart

This is a special post for quick takes by Ben Stewart. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Sorted by

New & upvoted

Click to highlight new quick takes since: Today at 1:10 AM

Ben StewartApr 22*28

Forecasting

I was excited by ForecastBench and FutureEval both projecting that LLMs would reach superforecaster parity by June 2027. But I didn't realise access to human crowd forecasts might be driving a lot of performance. If it is, that is massively disappointing.

The top LLM performers in ForecastBench have access to the crowd forecast (and it's not clear to me if FutureEval hides crowd forecasts - Metaculus did for the Quarterly Cup in 2025 but I couldn't find info about FutureEval). Skimming the literature with Claude, it seems like most studies either deliberately provide crowd forecasts or don't prevent searching for it, and those that hide it tend to have significantly worse results (still interesting, but less exciting).

To me, the potential wonders of LLM superforecasting is being able to get excellent guesses at any questions I might come up with. If I need to already have a human crowd or market forecast for the guess to be any good, then the kind of LLM superforecasting being projected is about 10% as useful to me. I still expect 'true' parity eventually, but it becomes a story of general timelines rather than empirical projection.

I don't know the field well, and I'm probably misunderstanding something. I'm posting this to find out I'm wrong. If I'm right, then it's worth dampening the expectations of anyone else who was imagining having an instant team of supers at their beck-and-call in ~14 months time.

aogApr 246

Great catch, this seems important and I didn't realize it. The ForecastBench paper has some comparisons between humans and LLMs which probably don't have access to human forecasts. In the tables below, they're the rows where "information provided" is "news." These models don't have open access to the internet; they can only pull summaries of news articles through a custom API, so unless those news articles are citing prediction markets, the LLM isn't getting information about prediction market forecasts.

The Brier score difference between LLM forecasters with and without access to human crowd forecasts is roughly the same as the Brier score difference between the superforecaster median and the public median. (Though I'm not sure how to interpret that, a Brier score is a weird metric.)

Agreed this seems like an important shortcoming of existing research. I'd love to see future work that measures the accuracy of LLM forecasters with access to the internet but no access to prediction markets or human crowd forecasts. This could be implemented by instructing the LLM not to look at crowd forecasts when surfing the internet, then asking another LLM to verify that the instruction was followed, and resampling if not.

Mo PuteraApr 272

Having followed a lot of AI benchmarks over the years, my main heuristic takeaway regarding expert-parity claims is "prepare to be disappointed once you dig in", alongside "but they were still useful in advancing understanding and progress", cf. SemiAnalysis' Benchmarks are bad but we need to keep using them anyways section for an outside-of-EA perspective. I'm also less bullish on long-range poor-feedback loops superforecasting more generally for reasons along the lines of superforecaster Eli Lifland's takes (esp. #2 and #4), Dan Luu's appendix notes and comparisons to the actually-accurate futurists his review found, nostalgebraist on metaculus badness, etc which collectively reduce my enthusiasm for automating this.

Ben StewartMay 74

Somewhat meta point on epistemic modesty, calling it out here because it is a pattern that has deeply frustrated me about EA/rationalism for as long as I have known them:
(making a quick take rather than commenting due to an app.operation_not_allowed error - I'm responding to @Linch's quick take on war crimes)
I guess these are just EA/rationalist norms, but an approach that glosses major positions as being so quickly dismissible strikes me as insufficiently epistemically modest. I would expect such a treatment will fail to properly consider alternative answers or intuitions to the author's own, especially the strongest versions of those answers (e.g. modern just war positions), won't consider the most sophisticated counterpoints (e.g. your 'oldest and clearest form' gambit may just be bracketing out the counterexamples that don't fit your definition, like genocide or sexual violence), and reinvent the wheel, e.g. the view seems to be exactly this from 2013:

"A final rationale for the perfidy prohibition is to preserve the possibility of a return to peace. To prevent the degradation of trust and the bad faith between warring parties that would impede negotiation of peace terms. An effective perfidy prohibition preserves the good faith upon which ceasefires, armistices and conclusions of hostilities rely."

I think deep engagement with the range of serious views on the topic is required to make your post "the best modern articulation of these ancient ideas". I don't think the quick take seems on a good track for that.

LinchMay 7*8

Thanks for the reference, and the point that the structural argument doesn't handle all modern cases as well. Will address both in the post.

Though I'm confused. If you're accusing me of reinventing the wheel, why reference Watts from 2013 and not Grotius in 1625? Or Didiotus in 427 BC, or other references in Thucydides?

I think EAs if anything are far too epistemically modest and unwilling to stick their neck out for defending true and accurate positions. I also find demands to police epistemic modesty based on hastily written quick takes annoying.

The de facto outcome if I take these concerns seriously is to showcase much less of my intermediate thinking, to bulletproof all my writing before they see the light of day, and/or crosspost to the EA Forum less.

I've indeed been taking actions like this, especially the last one, due to comments like yours (this is the third time you've done this), though I'm unsure if I endorse it on net.

Ben StewartMay 74

I'm not trying to be unkind, and I apologise if I was. I'll take this down if you ask here or via DM. I overreacted to what is a quick take because I think it was emblematic of a bad pattern - but that is unfair and disproportionate of me.
My main thing here is to push for better intermediate thinking. Like the standard EA/rat approach is so often based on dismissing mainstream or non-EA views, and then acting like their individual opinion is clearly superior, often reinventing current or past views that have had lots of non-EA examination. I want EA thinking to be better, and a lot of the time it would be improved by people reading more before opining, and not thinking the views of EA are so special.

I think EAs if anything are far too epistemically modest and unwilling to stick their neck out for defending true and accurate positions.

We just have very different experiences then.

(this is the third time you've done this),

Do you mean critique someone on epistemic immodesty grounds? This is probably true but can you point me to the examples you have in mind? (I may indeed be doing this too much and seeing the examples would help)

LinchMay 7*16

Thanks for the much kinder response and the serious engagement! :) Please don't take your comment down, it's good to have this discussion in the open.

(Also apologies for the long comment, brain not working really well so less succinct than I want to be)

My main thing here is to push for better intermediate thinking. Like the standard EA/rat approach is so often based on dismissing mainstream or non-EA views, and then acting like their individual opinion is clearly superior.

I want to defend my own approach here, and won't speak for the" standard EA/rat approach" except insomuch as my thinking is constitutive of that approach (as the old joke goes, "you're not in traffic, you are traffic"). Generally when I try to learn information about the world, what I go for is to seek facts and models that are

interesting (ie, novel to me)
true
useful

The best way to do this typically involves some combination of Google searches, original thinking, reading papers, conversations, reading, toy models, and (since ~2025) talking to AIs^[1]. Since college, I've honed an ability to form views very quickly that I can defend, and believe I'm reasonably calibrated on. I think this is sometimes surprising to people but it shouldn't be. The first data point tells you a lot^[2].

Similarly my bar for publishing my thoughts, ignoring opportunity cost, is also fairly low. The primary thing I'm interested in from a content perspective is some combination of novel/true/useful to my readers. Novel to whom? For me I have an implicit model of who my readers are and I try to calibrate accordingly. I want to write things that are new to a large fraction of my readers. I think you might have more of an academia-derived model where it's very important to only share thoughts that are novel to humanity.

I think this is less good of a norm. If I can write a better intro to stealth than is widely understood/disseminated, I think this is a useful service even if no individual point there is original.

Similarly, I think it's less important in non-academic contexts to attribute the originators of an idea or an analysis. I don't think it's useless, I just think it's less important. But if I'm thinking about a problem the academic citations are mostly directly useful inasomuch as it benefits either me or my readers, rather than being the first line of attack.

To be clear, credit attribution is valuable and I want to avoid actual plagiarism (I think academic norms are valuable in a bunch of ways and I want to respect the institution even when I disagree with it).

Also, this may be nonobvious, but I do in fact "do the reading" and "expert engagement" significantly, often past the point of diminishing marginal returns compared to honing my own thinking or writing.

For example, in my earlier post on war, where I summarized and extended James Fearon's bargaining model, I read Fearon's paper and skimmed a bunch of others to form a gestalt view. I also emailed my post to both Fearon and another academic on war (Fearon replied positively, the other academic didn't respond).

In my Chiang review I read something like 10 reviews before starting my piece, and maybe more like 20 before finishing it.

And for war crimes in particular, I've been reading about it casually for several years. See here for one example.

I also think it's very easy to say "do the reading" but in practice what reading you do is highly contingent and it's easy to waste a bunch of time feeling virtuous for doing the homework on adjacent topics but not actually learning useful things for addressing your original question. For example, you seem to believe that I should be reading the latest academic papers on just war (a plausible enough hypothesis!). Someone on Substack (with a relevant background!) suggested I read the negotiating history of the Geneva conventions and their Additional Protocols (also plausible!) Someone on LessWrong suggested I read Tom Schelling's treatment of the subject (plausible enough, I ordered the book). And these are just the ones that I think are sufficiently plausible! There are so many other ways to burn time seeming to doing the reading instead of committing to a hypothesis and seeing where it lands.

Finally, I'd note that when you said people's arguments are

so often based on dismissing mainstream or non-EA views

there is a major selection effect. If I think a mainstream view is both true and introduced well, I usually don't bother writing about it.

> I think EAs if anything are far too epistemically modest and unwilling to stick their neck out for defending true and accurate positions.
We just have very different experiences then.

Concretely I think Bentham's Bulldog/Matthew comes across as overconfident on his blog, as does John Wentworth on LessWrong. But most randomly selected writers on EAF and LW are underconfident and often hedge in 10 words what they could say in 3.

Maybe a background methodological difference here is that I strongly agree with Scott Alexander on the most useful forms of criticism (highly specific, targeted, concrete). Whereas I'm skeptical of deep paradigmatic criticisms really being correct, changing people's minds, or overall being insightful/true/useful.

(this is the third time you've done this),
Do you mean critique someone on epistemic immodesty grounds?

I meant respond to my comments or posts in a way that seems asymmetrically easy to make but very hard to respond to/argue against on the object-level. I don't want to dreg up the links, sorry.

Anyway, thanks for the response and for giving me an opportunity to elaborate my thoughts and overall position here.

^{^}
this is for the types of questions I'm interested in, and my workflow. A historian might do more primary source hunting/archival research offline. An ML researcher might run more experiments, a biologist might work in a wet lab, or a field, and so forth. In the past I also did more expert interviews.
^{^}
If you have a model of the world/human epistemics where surprisal value is constant across learning time, or even that it's highly superlinear per topic, then you might prioritize your actions very differently from me.

DicentraMay 87

I feel like I've heard this position a lot before, and I have some sympathy for it, but I feel like it implicitly overlooks a lot of what I find valuable about writing EA Forum comments, and it sets an overly high bar.

When one writes academic papers, one is expected to cite relevant previous work. Credit assignation is an important mechanism for tracing the evidence for claims and for assigning credit. Even in academic spheres, I think this is perhaps taken pathologically far (to the point where it probably sometimes is unduly burdensome and vaguely implies that pretty obvious ideas or hypotheses had to have come from someone else as opposed to being generated by the author), but the reasons why it's important to cite your claims seem a lot stronger in academia.

The EA Forum is partly intended, I believe, to be a place where people are encouraged to say things more quickly and speculatively after having done less research, and where people are more encouraged to share their own overall judgments and thinking process without necessarily fully defending all their positions. You might think it's bad to have such a place and that people should mostly just rely on the academic literature. I disagree with that, but trying to make the EA Forum use the same standards that academia uses seems counterproductive. We can just use academia for that.

And at least in my mind, a big part of the point of writing things like what Linch wrote is about trying to practice my critical thinking skills and appling them to new areas, for the eventual purpose of use in areas where there's not already a lot of scholarship. So I value approaching an area I don't know much about, like the topic of war crimes, and trying to understand it on my own and seeing how far I can get and forming my own view rather than necessarily seeing this as strictly an opportunity to practice building on existing literature on war crimes (or worse, just regurgitating that literature undiscerningly)

Ben StewartJun 18 202438

On the recent post on Manifest, there’s been another instance of a large voting group (30-40ish [edit to clarify: 30-40ish karma, not 30-40ish individuals])arriving and downvoting any progressive-valenced comments (there were upvotes and downvotes prior to this, but in a more stochastic pattern). This is similar to what occured with the eugenics-related posts last year. Wanted to flag it to give a picture to later readers on the dynamics at play.

Pat Myron 🔸Jun 18 202430

ForecastingShow more

Manifold openly offered funding voting rings in their discord:

Erich_Grunewald 🔸Jun 18 20248

Just noting for anyone else reading the parent comment but not the screenshot, that said discussion was about Hacker News, not the EA Forum.

LinchJun 19 20242

Also it was clearly not about Manifest. (Though it is nonetheless very cringe).

Habryka [Deactivated]Jun 18 202413

I would be surprised if it's 30-40 people. My guess is it's more like 5-6 people with reasonably high vote-strengths. Also, I highly doubt that the overall bias of the conversation here leans towards progressive-valenced comments being suppressed. EA is overwhelmingly progressive and has a pretty obvious anti-right bias (which like, I am a bit sympathetic to, but I feel like a warning in the opposite direction would be more appropriate)

Ben StewartJun 18 20244

My wording was imprecise - I meant 30-40ish in terms of karma. I agree the number of people is more likely to be 5-12. And my point is less about overall bias than just a particular voting dynamic - at first upvotes and downvotes occurring as is pretty typical, then a large and sudden influx of downvotes on everything from a particular camp.

JasonJun 19 202411

There really should be a limit on the quantity of strong upvotes/downvotes one can deploy on comments to a particular post -- perhaps both "within a specific amount of time" and "in total." A voting group of ~half a dozen users should not be able to exert that much control over the karma distribution on a post. To be clear, I view (at least strong) targeted "downvoting [of] any progressive-valenced comments" as inconsistent with Forum voting norms.

At present, the only semi-practical fix would be for users on the other side of the debate to go back through the comments, guess which ones had been the targets of the voting group, and apply strong upvotes hoping to roughly neutralize the norm-breaking voting behavior of the voting group. Both the universe in which karma counts are corrupted by small voting groups and the universe in which karma counts are significantly determined by a clash between voting groups and self-appointed defenders seem really undesirable.

Habryka [Deactivated]Jun 19 202418

We implemented this on LessWrong! (indeed based on some of my own bad experiences with threads like this on the EA Forum)

The EA Forum decided to forum gate the relevant changes, but on LW people would indeed be prevented from voting like I think voting is happening here: https://github.com/ForumMagnum/ForumMagnum/commit/07e0754042f88e1bd002d68f5f2ab12f1f4d4908

Sarah Cheng 🔸Jun 20 202420

Thanks for the suggestion Jason! @JP Addison says that he forum-gated it at the time because he wanted to “see how it went over, whether they endorsed it on reflection. They previously wouldn’t have liked users treating votes as a scarce resource.” LW seems happy with how it’s gone, so we’ll go ahead and remove the forum-gating.

Ben StewartSep 17 20256

What can I read to understand the current and near-term state of drone warfare, especially (semi-)autonomous systems?

I'm looking for an overview of the developments in recent years, and what near-term systems are looking like. I've been enjoying Paul Scharre's 'Army of None', but given it was published in 2018 it's well behind the curve. Thanks!

CalebWSep 17 202515

FWIW, CNAS (where Paul and I work) are continuing to put out work on drone warfare:
https://www.cnas.org/publications/reports/evolution-not-revolution
https://www.cnas.org/publications/reports/swarms-over-the-strait
https://www.cnas.org/publications/reports/countering-the-swarm

RasoolSep 21 20256

I have been recommended the Drone Ultimatum podcast, but haven't listened to any

Ben StewartJul 16 20238

I really enjoyed this 2022 paper by Rose Cao ("Multiple realizability and the spirit of functionalism"). A common intuition is that the brain is basically a big network of neurons with input on one side and all-or-nothing output on the other, and the rest of it (glia, metabolism, blood) is mainly keeping that network running.
The paper's helpful for articulating how that model's impoverished, and argues that the right level for explaining brain activity (and resulting psychological states) might rely on the messy, complex, biological details, such that non-biological substrates for consciousness are implausible. (Some of those details: spatial and temporal determinants of activity, chemical transducers and signals beyond excitation/inhibition, self-modification, plasticity, glia, functional meshing with the physical body, multiplexed functions, generative entrenchment.)
The argument doesn't necessarily oppose functionalism, but I think it's a healthy challenge to my previous confidence in multiple realisability within plausible limits of size, speed, and substrate. It's also useful to point to just how different artificial neural networks are from biological brains. This strengthens my feeling of the alien-ness of AI models, and updates me towards greater scepticism of digital sentience.
I think the paper's a wonderful example of marrying deeply engaged philosophy with empirical reality.

[comment deleted]Aug 15 20232

Deleted by Ben Stewart, 08/15/2023