Hide table of contents

TLDR

  • I analysed a set of 64 (non-randomly selected) binary forecasting questions that exist both on Metaculus and on Manifold Markets. 
  • The mean Brier score was 0.084 for Metaculus and 0.107 for Manifold. This difference was significant using a paired test. Metaculus was ahead of Manifold on 75% of the questions (48 out of 64). 
  • Metaculus, on average had a much higher number of forecasters
  • All code used for this analysis can be found here.

Conflict of interest note
I am an employee of Metaculus. I think this didn't influence my analysis, but then of course I'd think that and there may be things I haven't thought about. 

Introduction

Everyone likes forecasts, especially if they are accurate (well, there may be some exceptions). As a forecast consumer the central question is: where should you go to get your best forecasts? If there are two competing forecasts that slightly disagree, which one should you trust most? 

There are a multitude of websites that collect predictions from users and provide aggregate forecasts to the public. Unfortunately, comparing different platforms is difficult. Usually, questions are not completely identical across sites which makes it difficult and cumbersome to compare them fairly. Luckily, we have at least some data to compare two platforms, Metaculus and Manifold Markets. Some time ago, David Glidden created a bot on Manifold Markets, the MetaculusBot, which copied some of the questions on the prediction platform Metaculus to Manifold Markets. 

Methods

  • Manifold has a few markets that were copied from Metaculus through MetaculusBot. I downloaded these using the Manifold API and filtered for resolved binary questions. There are likely more corresponding questions/markets, but I've skipped these as I didn't find an easy way to match corresponding markets/questions automatically. 
  • I merged the Manifold markets with forecasts on corresponding Metaculus questions. I restricted the analysis to the same time frame to avoid issues caused by a question opening earlier or remaining open longer on one of the two platforms. 
  • I compared the Manifold forecasts with the community prediction on Metaculus and calculated a time-averaged Brier Score to score forecasts over time. That means, forecasts were evaluated using the following score: , with resolution  and forecast  at time . I also did the same for log scores, but will focus on Brier scores for simplicity. 
  • I tested for a statistically significant tendency towards higher / lower scores on one platform compared to the other using a paired Mann-Whitney U test. (A paired t-test and a bootstrap analysis yield the same result.) 
  • I visualised results using a bootstrap analysis. For that, I iteratively (100k times) drew 64 samples with replacement from the existing questions and calculated a mean score for Manifold and Metaculus based on the bootstrapped questions, as well as a difference for the mean. The precise algorithm is: 
    • draw 64 questions with replacement from all questions
    • compute an overall Brier score for Metaculus and one for Manifold
    • take the difference between the two
    • repeat 100k times

Results

The time-averaged Brier score on the questions I analysed was 0.084 for Metaculus and 0.107 for Manifold. The difference in means was significantly different from zero using various tests (paired Mann-Whitney-U-test: p-value < 0.00001, paired t-test: p-value = 0.000132, bootstrap test: all 100k samples showed a mean difference > 0). Results for the log score look basically the same (log scores were 0.274 for Metaculus and 0.343 for Manifold, differences similarly significant). 

Here is a plot with the observed differences in time-averaged Brier scores for every question: 

Usually, it's not possible to make any meaningful statements about which of two forecasters is more accurate based on a single question. What we care about, therefore, is average or expected performance across many questions. To get a clearer picture of the average performance difference, I conducted a bootstrap analysis. This plot shows the bootstrapped distribution of the average difference between Manifold and Metaculus across sets of 64 questions:

Under lots of of strong assumptions, this plot would give us an answer to the question "If I look at a set of 64 random questions, available both on Metaculus and Manifold, what should I expect the average difference in Brier score on these 64 questions to be?" Of course, those assumptions don't hold entirely. The bootstrap analysis kind of assumes that questions are independent, which they are probably not. (Many of the questions are about the Ukraine conflict.) The interpretation I've given also assumes that the 64 questions are representative for all questions on Manifold/Metaculus, which they are also probably not. 

As another interesting observation, Metaculus on average had a much higher number of forecasters on the markets I looked at than Manifold. (For a discussion on how that might affect accuracy see here and more recently here.) Here is a plot of the number of forecasts for each question (y-axis on the log scale, with red marks indicating when that platform has a better Brier Score). 

I find this interesting, but also somewhat hard to identify any meaningful patterns. For example, one could expect red points to be clustered at the top for Manifold, indicating that more forecasts equal better performance. But we don't see that here. The comparison may be somewhat limited anyway: In the eyes of the Metaculus community prediction, all forecasts are created equal. On Manifold, however, users can invest different amounts of money. A single user can therefore in principle have an outsized influence on the overall market price if they are willing to spend enough. I'd be interested to see more on how accuracy on Manifold changes with the number of traders and overall trading volume. Who knows, maybe Manifold would be ahead if they had a similar number of forecasters to Metaculus? 

Let's have another look at the actual forecasts. Here is a gigantic plot that shows the corresponding Manifold and Metaculus community prediction (as well as time-averaged scores) for all questions that I looked at. 

We can notice a few interesting things. The curves for Metaculus and Manifold usually look roughly similar. That's good. If independent people using different methods arrive at similar conclusions, that should give us more confidence that the overall conclusions are reasonable. Of course, it could just mean that one platform copies the other. But even that would be a good sign, as it means  you couldn't trivially do better than just copying the other platform. 

The curve for Manifold looks more spiky and less smooth. I expect this to be largely a function of the number of forecasters and the trading volume. To me, the spikes mostly look like noise. But large movements could also reflect a tendency for Manifold to update more quickly or strongly to new information. Sometimes markets on Manifold have gone stale, which seems to be less of an issue on Metaculus in this small data set. 

Discussion

Statistical significance aside, the 64 forecasts I investigated feel more consistent and therefore slightly more informative on Metaculus from a pure forecast consumer perspective. However, terms and conditions, and of course a million limitations apply. 

Firstly, the set of questions I looked at is very limited. Results might completely change if you look at different markets/questions. For simplicity, I only looked at markets created on Manifold by the MetaculusBot. I'm not entirely sure how the MetaculusBot picked questions to replicate, but to me it doesn't necessarily look like a random sample. Copying questions from Metaculus to Manifold (rather than the other way round) of course means that the questions are skewed towards the kind of questions that would appear on Metaculus and are of interest to the Metaculus community. If you want to (help) rerun the analysis with more questions, feel free to adapt my code or get in touch. 

Secondly, this analysis doesn't necessarily provide any guidance for the future. Once you point out a potentially profitable trading strategy, it tends to quickly disappear. If I were an ambitious user on Manifold and had a free weekend to spend, I would sure as hell start coding up a bot that just trades the Metaculus community prediction on Manifold. 

Thirdly, this analysis doesn't directly allow general statements about which platform provides more original value, even though it looks like on the set of questions I analysed Metaculus forecasts tended to update faster. It remains a challenge to disentangle how forecasts on both platforms may be influencing each other and how the existence of one platform affects the quality of forecasts on the other.  


Thanks to Lawrence Phillips and Tom Liptay for providing valuable feedback to this post!

New Answer
New Comment


2 Answers sorted by

I find this interesting, but also somewhat hard to identify any meaningful patterns. For example, one could expect red points to be clustered at the top for Manifold, indicating that more forecasts equal better performance. But we don't see that here. The comparison may be somewhat limited anyway: In the eyes of the Metaculus community prediction, all forecasts are created equal. On Manifold, however, users can invest different amounts of money. A single user can therefore in principle have an outsized influence on the overall market price if they are willing to spend enough. I'd be interested to see more on how accuracy on Manifold changes with the number of traders and overall trading volume. Who knows, maybe Manifold would be ahead if they had a similar number of forecasters to Metaculus?

Does this mean that if you controlled for number of forecasters, you still think Metaculus would beat Manifold? If not, do you have any opinion on this question (sorry if I missed it).

I slightly tend towards yes, but that's mere intuition. As someone on Twitter put it, "Metaculus has a more hardcore user base, because it's less fun" - I find it plausible that the Metaculus user base and the Manifold user base differs. But higher trading volume I think would have helped. 

For this particular analysis I'm not sure correcting for the number of forecasters would really be possible in a sound way. It would be great to get the MetaculusBot more active again to collect more data. 

“I'm not entirely sure how the MetaculusBot picked questions to replicate, but to me it doesn't necessarily look like a random sample.”

Correct! I (err…MetaculusBot) chose markets to replicate on Manifold based on personal preference, heavily anchored toward questions in the Metaculus Effective Altruism category, as well as Ukraine as it seemed to be of most interest to the community at the time.

MetaculusBot has been dormant the last few months, at least in terms of market creation, but open to requests via tagged comment on Manifold, direct message to me here on the EA Forum, or via Twitter DM @dglid.

Credit to the Manifold team for the idea and letting me manage that account.

It should be possible to fully automate the bot and just run a CRON job that regularly checks the Metaculus API for new questions, right? 

And is the code to the MetaculusBot public somewhere? :) 

Comments3
Sorted by Click to highlight new comments since:

Thanks for the post, great analysis!

I'd be interested to see more on how accuracy on Manifold changes with the number of traders and overall trading volume.

In my anecdotal experience, Manifold accuracy improves a lot with more trading, just as you'd expect. Some of the markets in this dataset probably only ever got single-digit number of trades and were obviously mispriced (with nobody seeing them to correct the mispricing). I occasionally see this happen on Metaculus too, but much less often in my experience - I would guess this is in large part because the question set on Metaculus is highly curated and much smaller.

I'd be interested in what the analysis looks like if restricted to questions with at least a certain number of forecasts or forecasters.

The curve for Manifold looks more spiky and less smooth. I expect this to be largely a function of the number of forecasters and the trading volume. To me, the spikes mostly look like noise.

Yeah, the narrow spikes on Manifold are mostly noise due to inexperienced traders not understanding liquidity and price slippage, which are quickly corrected. The platform has changed a lot over the last year, liquidity was generally improved, so I think the amount of those noise spikes has decreased. I'd be curious if you ran the comparison on earlier vs later questions whether we'd see a significant difference in relative performance (although it would be hard to distinguish that from random noise from looking at different questions in a different time period).

it looks like on the set of questions I analyzed Metaculus forecasts tended to update faster.

This is a very interesting observation, and I think it's largely coming from the Metaculus having more predictors update their predictions over time than on Manifold, on this particular question set. Prediction markets that are well-traded should update much faster than Metaculus because there is a large profit incentive for being the first to update the market price with new information, which typically happens within minutes on the most popular markets, whereas Metaculus predictions just reward an update as much as it affects your time-averaged score. Metaculus's recency weighting works fairly well at updating quickly, but we're usually talking about days, not minutes.

Great analysis, I really enjoyed reading this. I’m also excited to see how Metaculus and Manifold compare on the 2023 ACX predictions! I think that’ll be a great set of identical questions on both platforms that will avoid any selection effect issue of which platform the questions started on.

Is it possible to get rid of the question mode for this post?

Curated and popular this week
 ·  · 16m read
 · 
At the last EAG Bay Area, I gave a workshop on navigating a difficult job market, which I repeated days ago at EAG London. A few people have asked for my notes and slides, so I’ve decided to share them here.  This is the slide deck I used.   Below is a low-effort loose transcript, minus the interactive bits (you can see these on the slides in the form of reflection and discussion prompts with a timer). In my opinion, some interactive elements were rushed because I stubbornly wanted to pack too much into the session. If you’re going to re-use them, I recommend you allow for more time than I did if you can (and if you can’t, I empathise with the struggle of making difficult trade-offs due to time constraints).  One of the benefits of written communication over spoken communication is that you can be very precise and comprehensive. I’m sorry that those benefits are wasted on this post. Ideally, I’d have turned my speaker notes from the session into a more nuanced written post that would include a hundred extra points that I wanted to make and caveats that I wanted to add. Unfortunately, I’m a busy person, and I’ve come to accept that such a post will never exist. So I’m sharing this instead as a MVP that I believe can still be valuable –certainly more valuable than nothing!  Introduction 80,000 Hours’ whole thing is asking: Have you considered using your career to have an impact? As an advisor, I now speak with lots of people who have indeed considered it and very much want it – they don't need persuading. What they need is help navigating a tough job market. I want to use this session to spread some messages I keep repeating in these calls and create common knowledge about the job landscape.  But first, a couple of caveats: 1. Oh my, I wonder if volunteering to run this session was a terrible idea. Giving advice to one person is difficult; giving advice to many people simultaneously is impossible. You all have different skill sets, are at different points in
 ·  · 47m read
 · 
Thank you to Arepo and Eli Lifland for looking over this article for errors.  I am sorry that this article is so long. Every time I thought I was done with it I ran into more issues with the model, and I wanted to be as thorough as I could. I’m not going to blame anyone for skimming parts of this article.  Note that the majority of this article was written before Eli’s updated model was released (the site was updated june 8th). His new model improves on some of my objections, but the majority still stand.   Introduction: AI 2027 is an article written by the “AI futures team”. The primary piece is a short story penned by Scott Alexander, depicting a month by month scenario of a near-future where AI becomes superintelligent in 2027,proceeding to automate the entire economy in only a year or two and then either kills us all or does not kill us all, depending on government policies.  What makes AI 2027 different from other similar short stories is that it is presented as a forecast based on rigorous modelling and data analysis from forecasting experts. It is accompanied by five appendices of “detailed research supporting these predictions” and a codebase for simulations. They state that “hundreds” of people reviewed the text, including AI expert Yoshua Bengio, although some of these reviewers only saw bits of it. The scenario in the short story is not the median forecast for any AI futures author, and none of the AI2027 authors actually believe that 2027 is the median year for a singularity to happen. But the argument they make is that 2027 is a plausible year, and they back it up with images of sophisticated looking modelling like the following: This combination of compelling short story and seemingly-rigorous research may have been the secret sauce that let the article to go viral and be treated as a serious project:To quote the authors themselves: It’s been a crazy few weeks here at the AI Futures Project. Almost a million people visited our webpage; 166,00
 ·  · 32m read
 · 
Authors: Joel McGuire (analysis, drafts) and Lily Ottinger (editing)  Formosa: Fulcrum of the Future? An invasion of Taiwan is uncomfortably likely and potentially catastrophic. We should research better ways to avoid it.   TLDR: I forecast that an invasion of Taiwan increases all the anthropogenic risks by ~1.5% (percentage points) of a catastrophe killing 10% or more of the population by 2100 (nuclear risk by 0.9%, AI + Biorisk by 0.6%). This would imply it constitutes a sizable share of the total catastrophic risk burden expected over the rest of this century by skilled and knowledgeable forecasters (8% of the total risk of 20% according to domain experts and 17% of the total risk of 9% according to superforecasters). I think this means that we should research ways to cost-effectively decrease the likelihood that China invades Taiwan. This could mean exploring the prospect of advocating that Taiwan increase its deterrence by investing in cheap but lethal weapons platforms like mines, first-person view drones, or signaling that mobilized reserves would resist an invasion. Disclaimer I read about and forecast on topics related to conflict as a hobby (4th out of 3,909 on the Metaculus Ukraine conflict forecasting competition, 73 out of 42,326 in general on Metaculus), but I claim no expertise on the topic. I probably spent something like ~40 hours on this over the course of a few months. Some of the numbers I use may be slightly outdated, but this is one of those things that if I kept fiddling with it I'd never publish it.  Acknowledgements: I heartily thank Lily Ottinger, Jeremy Garrison, Maggie Moss and my sister for providing valuable feedback on previous drafts. Part 0: Background The Chinese Civil War (1927–1949) ended with the victorious communists establishing the People's Republic of China (PRC) on the mainland. The defeated Kuomintang (KMT[1]) retreated to Taiwan in 1949 and formed the Republic of China (ROC). A dictatorship during the cold war, T