Wisdom of the Crowd vs. "the Best of the Best of the Best"

nikos

Wisdom of the Crowd vs. "the Best of the Best of the Best"

nikos

14 min readApr 4, 2023

101

Comments 11

Sorted by

New & upvoted

Charles Dillon 🔸

Can you quantify how much work recency weighting is doing here? I could imagine it explaining all (or even more than all) of the effect (e.g. if many "best" forecasters have stale predictions relative to the community prediction often).

nikos

Not sure how to quantify that (open for ideas). But intuitively I agree with you and would suspect it's at least a sizable part

Charles Dillon 🔸

Suggestion: pre-commit to a ranking method for forecasters. Chuck out questions which go to <5%/>95% within a week. Take the pairs (question, time) with 10n+ updates within the last m days for some n,m, and no overlap (for questions with overlap pick the time which maximises number of predictions). Take the n best forecasters per your ranking method in the sample and compare them to the full sample and the "without them" sample.

isabel🔸

Quick question about reputation scores: "Every time a question resolves, the reputation is updated depending on how many Metaculus points a user got relative to other users (with a mean of zero and a standard deviation of 10)" -- does this mean that predicting on questions late in the life of a question is harmful for one's reputation? Because predicting late means that you'll typically get fewer points than an early predictor.

nikos

In principle yes. In practice also usually yes, but the specifics depend on whether the average user who predicted on a question gets a positive amount of points. So if you predicted very late and your points are close to zero, but the mean number of points forecasters on that question received is positive, then you will end up with a negative update to your reputation score.
Completely agree that a lot hinges on that reputation score. It seems to work decent for the Metaculus Prediction, but it would be good to see what results look like for a different metric of past performance.

Vasco Grilo🔸

Nice analysis!

Presumably, when choosing your X, there is a trade-off between "having better forecasters" and "having more forecasters" (see this and this analysis on why more forecasters might be good).

FWIW, here, I found a correlation of -0.0776 between number of forecasters and Brier score. So more forecasters does seem to help, but not that much.

If you have the choice to ask a large crowd OR a small group of accomplished forecasters, you should maybe consider the crowd. This is especially true if you have access to past performance and can do something more sophisticated than Metaculus' Community Prediction.

Mannes 2014 found a select crowd to be better, although not by much, looking into 90 data sets:

Note they scored performance in terms of the mean absolute error, which is not proper, but I guess they would get qualitatively similar results in they had used a proper rule.

I used the Metaculus reputation scores for my analysis to select the top forecasters. Reputation scores are used internally to compute the Metaculus Prediction and track performance relative to other forecasters. Using average Brier scores or log scores might yield very different results. Really: this entire analysis hinges on whether or not you think the reputation score is a good proxy for past performance. And it may be, but it might also be flawed.

I think it makes more sense to measure reputation according to the metric being used for performance, i.e. with the Brier/log score, as Mannes 2014 did (but using mean absolute error). You could also try measuring reputation based on performance on questions of the same category, such that you get the best of each category.

nikos

Interesting, thanks for sharing the paper. Yeah agree that using the Brier score / log score might change results and it would definitely be good to check that as well.

NunoSempere

I don't know man, Metaculus forecasters are generally unpaid. Maybe "the best" would be the best monetary prediction market forecasters, or the best hedge-fundies?

Javier Prieto🔸

Glad you brought up real money markets because the real choice here isn't "5 unpaid superforecasters" vs "200 unpaid average forecasters" but "5 really good people who charge $200/h" vs "200 internet anons that'll do it for peanuts". Once you notice the difference in unit labor costs, the question becomes: for a fixed budget, what's the optimal trade-off between crowd size and skill? I'm really uncertain about that myself and have never seen good data on it.

Joel Becker

Agree.

Really glad this work is being done; grateful to Nikos for it! The "yes, and" is that we're nowhere near the frontier of what's possible.

nikos

Yeah, definitely. The title was a bit tongue-in-cheek (it's a movie quote)

Comments

More from the author

111

Predictive Performance on Metaculus vs. Manifold Markets

nikos·3y ago·6m read

Reflections on Wytham Abbey

nikos·3y ago·5m read

Creating a database for base rates

nikos·3y ago·4m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·6d ago·Curated 2d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

154

Let's taboo the V-word

lincolnq·6d ago·8m read

“How long have you been v*g*n?” This is one of the most common icebreakers at animal protection events. It’s a baseline assumption, and it mostly holds true: if you’re out advocating for animals not to be tortured or abused, realistically these days you are v**n, or close. And it makes for good conversation. It seems fairly safe to assume when you meet strangers. But this assumption is hurting the movement in a way which we don’t always notice: someone new comes into the sp...

105

Spiro: an update 2.5 years on and a fundraising ask for expansion

Habiba Banu·3d ago·6m read

Summary Back in November 2023 I posted here to launch Spiro and raise our first $198k. Two and a half years later this is an update and a fundraiser for the next step. The short version: we've now reached over-5,900 people with TB preventive medicine, including over 3,000 children under five years old. Our early results have held up well an...

Recent opportunities to take action

EA Organisation Updates thread: July 2026

Dane Valerie·5d ago·1m read

announcing High Impact Aliens

tzukitchan·2d ago·1m read

Help us launch AI safety university groups by referring potential founders

Jason Chin🔸, Thomas Rodskog·2d ago·4m read

Vasco Grilo🔸

Nice analysis!

Presumably, when choosing your X, there is a trade-off between "having better forecasters" and "having more forecasters" (see this and this analysis on why more forecasters might be good).

FWIW, here, I found a correlation of -0.0776 between number of forecasters and Brier score. So more forecasters does seem to help, but not that much.

If you have the choice to ask a large crowd OR a small group of accomplished forecasters, you should maybe consider the crowd. This is especially true if you have access to past performance and can do something more sophisticated than Metaculus' Community Prediction.

Mannes 2014 found a select crowd to be better, although not by much, looking into 90 data sets:

Note they scored performance in terms of the mean absolute error, which is not proper, but I guess they would get qualitatively similar results in they had used a proper rule.

I used the Metaculus reputation scores for my analysis to select the top forecasters. Reputation scores are used internally to compute the Metaculus Prediction and track performance relative to other forecasters. Using average Brier scores or log scores might yield very different results. Really: this entire analysis hinges on whether or not you think the reputation score is a good proxy for past performance. And it may be, but it might also be flawed.

^{^}

Imagine a scenario in which the true probability for an event was 0.6. All forecasters are somewhat terrible, but they manage to be terrible in a way that the average of all forecasts is exactly 0.6 (e.g. some predict 0.3, others 0.9 etc.). If you know add a bunch of very good forecasters that predict between 0.58 and 0.62 that still doesn't improve your average, even though those forecasters individually were clearly better.

Min. num. forecasters	50	100	150	200	250	300	350	400	450	500	550	600
Num. of available questions	1033	507	276	175	127	90	72	59	39	35	30	27

Wisdom of the Crowd vs. "the Best of the Best of the Best"

Wisdom of the Crowd vs. "the Best of the Best of the Best"

Summary

Introduction

Methods

Results (and some discussion)

A general trend

Ensemble comparisons

Discussion

Limitations

Further work