🔮 IT’S OVER
For the first time, an AI has placed in the top 1% of a major forecasting tournament. Here is how it works.
Toby Shevlane is CEO and co-founder of Mantic, a London-based startup working on applying AI to forecasting.
Mantic’s automated system has just placed fourth (out of roughly 500 entrants) in the 2025 Metaculus Fall Cup. This puts it well into the top 1% of human forecasters, and by far the best performing AI forecasting system that we are aware of.
The Oracle spoke to Shevlane about why most AI forecasters fail to extract value from news, and three places where Mantic thinks Polymarket traders are off base.
This interview has been edited for length. All answers are his own.
Mantic dominated the Metaculus Fall Cup. What separates you from other AI forecasting systems?
We have an edge on scaffolding and data. Separately we’re also currently working on reinforcement learning to improve the models.
On data, we take very seriously the idea that to make a good prediction, you have to be well informed. We have someone on staff whose whole job is adding more data sources. We have tens of different sources: Wikipedia, news, country-level economic data, population and migration data, company-level financial data, earnings calls.
We don’t use Google Search or Perplexity. That’s actually a disadvantage, and we sometimes miss things because of it. But we need to be able to backtest, to run experiments where we ask questions from the perspective of six months ago and see how we would have done. You can’t do that with Google Search because you can’t see what it would have told you six months ago.
In finance they call this “point in time data.” You need data sources you can roll back without any revision. Google and Perplexity don’t have that property.
What do you mean by “scaffolding?” Can you break down how Mantic approaches a new forecasting question?
Mantic doesn’t work as a single call to a language model. We make many, many different calls during the workflow. You can imagine a factory line with lots of different workers doing different jobs: breaking down the question, doing research, pursuing different lines of inquiry, then bringing it all together into a clean, well-informed prediction.
Take the Greenland acquisition market. The classic base rates approach would be to look at how often the US has acquired Greenland before. That’s never happened. So you might use Laplace’s Rule, which is a fancy way of saying, “It’s been a long time and this has never happened, so it probably won’t happen soon.”
But we go further. We look into the history of US-Greenland relations, hunt for analogous cases where one country acquired another, and try to learn lessons. I like to start specific, then zoom out. If you just use “the US acquiring any territory” as your reference class, you’d get an overestimate.
You mentioned other AI systems add news but it doesn’t help them. What are they doing wrong?
Sometimes people say, “I accidentally didn’t include news in my pipeline, but my score didn’t go down.” That is crazy, because news is often key information. The fact it’s not helping is a really bad sign.
In the early days, we found similar results. But that’s completely not true now. You need to figure out how to best process the information. It’s not obvious how to use a news article to make a better prediction.
There’s this whole political theme of “the news media is biased, so don’t trust it.” I think that’s a different trap from freaking out about headlines constantly. You shouldn’t fall into either. Just because news has bias doesn’t mean there’s no information to be found.
One insight is that it helps to just hoover up as much information as possible. Humans want to be efficient with their time, so they only look at trusted sources. But AI doesn’t have that limitation. It can read a lot of stuff and extract anything useful.
What about the “wisdom of crowds” approach of ensembling different models?
There’s a paper called “The Wisdom of the Silicon Crowd” that uses tens of different language models and takes an average. I’m slightly skeptical that’s the best approach. Most models aren’t very good at forecasting, so you’re bringing down the average.
We try to work out the perfect recipe. Is it the same model with different prompts? Different models together? What’s the right number? Part of the lesson is that using frontier models is definitely a good idea.
Are there question types where Mantic performs better or worse?
We probably aren’t very good at sports. We haven’t invested in that because it’s not our commercial focus.
The Pope selection is a classic hard problem. All the politics happens behind the scenes. Mantic isn’t magic. If there’s no information coming out of the Vatican, we can’t somehow know who’s in the ascendancy.
But one thing we’re quite good at is not giving too confident an answer when we don’t know. In the Dutch election recently, we picked up a lot of points not because we predicted the winner confidently, but because we were better calibrated than the human community. We didn’t take an overconfident position.
Can you give an example where that caution paid off?
The Japanese leadership election at the end of last year. Polymarket and the options markets had a lot of weight on one particular candidate. But Takaichi won, and she hadn’t been the front-runner.
When I looked back at what Mantic would have said, it was giving her much more weight than the markets were. Just seeing everyone else run off a cliff and knowing to step back and be more cautious can be very helpful.
We pick up points for being confidently right on some questions, but we also pick up a decent amount for being cautious where caution is needed.
Have you used Mantic to trade on Polymarket or other prediction markets?
We haven’t set a bot loose on Polymarket. It would be fun. But to Polymarket’s credit, the reason AIs don’t do really well is that the markets already do a really good job of finding accurate probabilities. It’s a very hard benchmark.
The highest-value use case for us right now is working with traders on traditional financial markets, helping them predict events that are upstream of price changes. Like the Japanese leadership election: different leaders might have different fiscal policy, which affects bond yields. If you can have an edge predicting these key events, it helps a lot. But it’s mediated through the skill of human traders rather than plugged directly in.
Mantic’s Forecasts
The following are some of Mantic’s forecasts, along with its reasoning. The “US strikes Iran“ forecasts are current as of January 21; the others as of January 13.
Jan 23rd: <1%
Jan 31st: 5%
Feb 28th: 39%
March 31st: 52%
July 1st: 63%
The high probability is driven by the ongoing crisis in Iran, where a violent state crackdown on nationwide protests has led the US to threaten military intervention on humanitarian grounds. The US is actively moving military assets, including an aircraft carrier strike group and strategic bombers, into the region after briefly pausing an attack in mid-January 2026.
The direct US airstrikes on Iranian nuclear facilities in June 2025 (”Operation Midnight Hammer”) established a recent precedent for large-scale kinetic action, lowering the threshold for a future strike. Key restraining factors are pressure from regional allies, including Israel and Gulf states, to delay or avoid military action due to fears of a wider conflict and significant Iranian retaliation.
Jerome Powell federally charged by June 30, 2026
🔮 Mantic: 26% | Polymarket: 12%
The Department of Justice is actively investigating Powell, having served the Federal Reserve with grand jury subpoenas on January 9, 2026, concerning his June 2025 congressional testimony.
An indictment could provide the “for cause” justification required by the Federal Reserve Act to remove Powell before his term as Chair expires in May 2026. Charging a sitting Federal Reserve Chair is unprecedented and faces major obstacles, including bipartisan political opposition and the high legal bar for proving perjury. Recent administration-led prosecutions of other political figures have faced significant legal challenges, suggesting a difficult path for any indictment.
The US acquires part of Greenland before 2027
🔮 Mantic: 8% | Polymarket: 25%
Risk is elevated by a recent, aggressive US push. The Trump administration appointed a special envoy in December 2025 and has publicly threatened force.
But a successful acquisition faces strong, unified opposition. The Greenlandic and Danish governments have repeatedly stated Greenland is not for sale, and a 2025 poll shows 85% of Greenlanders oppose joining the US. Key European NATO allies have issued joint statements supporting Danish sovereignty. Any unilateral action would risk significant diplomatic damage.
The most plausible path is a treaty creating a “Guantanamo-style” enclave with exclusive US jurisdiction. This is constrained by the current lack of Danish and Greenlandic consent.
Khamenei out as Supreme Leader of Iran by January 31, 2026
🔮 Mantic: 17% | Polymarket: 20%
Risk is driven by intense, nationwide protests that began in late December 2025 following a major currency collapse. The IRGC has been placed on its highest alert level.
Khamenei’s advanced age of 86 creates a persistent underlying risk of death or incapacitation. Media reports citing intelligence sources indicate the existence of a “Plan B” for him to flee Iran.
The regime’s survival currently depends on security force loyalty. As of now, there are no confirmed reports of significant defections from the IRGC or regular military. A plausible pathway to removal in this short timeframe involves a sudden health event or a “soft coup” by security elites.
The Supreme Court rules in favor of Trump’s tariffs
🔮 Mantic: 26% | Polymarket: 27%
During November 2025 oral arguments, justices across the ideological spectrum expressed skepticism that IEEPA provides explicit authority for the executive to unilaterally impose tariffs, a power the Constitution reserves for Congress.
The case involves an estimated $150 billion in collected duties, likely triggering the Major Questions Doctrine, which requires clear congressional approval for executive actions with vast economic and political significance. The US Court of Appeals for the Federal Circuit has already ruled the tariffs unlawful.
Disclaimer
Nothing in The Oracle is financial, investment, legal or any other type of professional advice. All odds and forecasts are time sensitive and subject to change. Anything provided in any newsletter is for informational purposes only and is not meant to be an endorsement of any type of activity or any particular market or product. Terms of Service on polymarket.com prohibit US persons and persons from certain other jurisdictions from using Polymarket to trade, although data and information is viewable globally.





