'small or even zero' refers to two different conclusions using two different accounting methods.
'small': from the method which spreads out 'lives saved' across all contributors in the chain of causality
'zero': from the method which only attributes 'lives saved' to the final actor in the chain of causality.
Leif provides both accounts, which is why he provides 'small or even zero' as his description of the impact.
I agree that it is a little unclear. I think Leif's argument would be clearer if he omitted the 'zero' accounting method, which I don't think he places much credence in but nonetheless included to illustrate the potential range of accounts of attribution.
Overall, I think that it is accurate of Leif to characterise the impact as 'small' if we ought to decrease impact by multiple orders of magnitude is correct.
+1 to comments about the paucity of details or checks. There are a range of issues that I can see.
Am I understanding the technical report correctly? It says "For each question, we sample 5 forecasts. All metrics are averaged across these forecasts." It is difficult to interpret this precisely. But the most likely meaning I take from this, is that you calculated accuracy metrics for 5 human forecasts per question, then averaged those accuracy metrics. That is not measuring the accuracy of "the wisdom of the crowd". That is (a very high variance) estimate of "the wisdom of an average forecaster on Metaculus". If that interpretation is correct, all you've achieved is a bot that does better than an average Metaculus forecaster.
I think that it is likely that searches for historical articles will be biased by Google's current search rankings. For example, if Israel actually did end up invading Lebanon, then you might expect historical articles speculating about a possible invasion to be linked to more by present articles, and therefore show up in search queries higher even when restricting only to articles written before the cutoff date. This would bias the model's data collection, and partially explain good performance on prediction for historical events.
Assuming that you have not made the mistake I described in 1. above, it'd be useful to look into the result data a bit more to check how performance varies on different topics. How does performance tend to be better than wisdom of the crowd? For example, are there particular topics that it performs better on? Does it tend to be more willing to be conservative/confident than a crowd of human forecasters? How does its calibration curve compare to that of humans? Also questions I would expect to be answered in a technical report claiming to prove superhuman forecasting ability.
It might be worth validating that the knowledge cutoff for the LLM is actually the one you expect from the documentation. I do not trust public docs to keep up-to-date, and that seems like a super easy error mode for evaluation here.
I think that the proof will be in future forecasting prediction ability: give 539 a Metaculus account and see how it performs.
Honestly, at a higher level, your approach is very unscientific. You have a demo and UI mockups illustrating how your tool could be used, and grandiose messaging across different forums. Yet your technical report has no details whatsoever. Even the section on Platt scoring has no motivation on why I should care about those metrics. This is a hype-driven approach to research that I am (not) surprised to see come out of 'the centre for AI safety'.
Hello! I am the aforementioned friend. I guess part of the problem is the deliberate narrowing-in of scope that the book proposes (see the Diagnosis summary above). To some degree, this narrowing of scope is a necessary and valuable part of creating a plan of action to achieve a limited objective.
But I think that this Desert Storm example in the book is an entertainingly good example of 'win the battle, lose the war' as Huw mentioned.
There are many examples throughout history of bureaucrats and leaders narrowing striving to achieve the objective most clearly in front of them, at the long-term cost to the organisation or society they are ostensibly acting on behalf of (even putting aside the question of wider wellbeing).
Given this history, I think that any book attempting to discuss 'good strategy' shouldn't shy away from this issue. I don't think it's valid for the author (or reviewer) to just deem that topic as out of scope.
It's been a long time since I read the book, so apologies if my recollection is mistaken, but I don't recall it engaging with this topic. At the very least, it definitely ignores it in the Desert Storm example.
To be clear, I found other parts of the book valuable, and I think that the calling out of different types of 'bad strategy' in particular is useful.
[Framed a different way: GSBS criticises 'bad strategy' as failing to grapple with the largest challenges that exist in a given situation; in Operation Desert Storm it is fairly obvious that the biggest challenge was not 'how do we defeat this military force, give our overwhelming technological & air superiority?' but rather 'how do we achieve our political goals via this military operation, given the obvious potential causes for civilian unrest and unhappiness?'. When you look at it like this, Desert Storm might have been a great example of the importance of choosing a single coherent plan of attack, but also a great example of failing to identify the actual largest issue faced by the American military, and therefore a good example of 'bad strategy'].
Leif - thanks for sharing this! I appreciate your explanation of the Wired article, which frankly did convey the impression, for example, that you were arguing for a certain method of aid delivery.
In the letter, you provide some 'false objections' to watch out for.
Could you please provide what you think would be the strongest objections to your arguments, and what your response in turn to those would be?