I've been hearing about the need for useful LLM benchmarks and it seems to me that setting up an LLM to trade on a prediction market (e.g. Polymarket, which is the platform I'll talk about here) could be a particularly strong benchmark with a number of desirable properties. 

How I envision this working

All you really need for an LLM to trade on Polymarket is the ability to search the internet for relevant information, the ability to keep notes, and the ability to run Python code (which enables trading on Polymarket through an API). My imagined setup is one where you have multiple LLM instances (e.g. GPT-4o) working in tandem in order to function as a successful trader. For example, one instance would be responsible for overall strategy, another for researching markets and events on the internet, and another for actually executing the trades through the API. During runtime, these LLM instances would also be able to contribute to a corpus of information that they can refer back to, whether market-specific info or info pertaining to general trading strategy (e.g. avoid markets of type X as these are really toxic).

It's not clear to me whether GPT-4o is capable of reliably devising and implementing a trading strategy (let alone making a profit), but I certainly think it's capable of doing all the sub-tasks that make up this bigger task. For example, I just asked it about a specific Polymarket contract to do with the upcoming French election and it gave me some decent analysis and what is IMO a pretty reasonable estimate for the probability associated with that contract. I do think that we should easily be able to have an LLM operate as a prediction market trader within one year, whether through advances in scaffolding and chaining LLM instances together and/or through 'regular' capability advances. 

Desirable properties of a prediction market benchmark

  1. Prediction markets require general intelligence. Prediction market contracts vary significantly in their domain and the skills required to win: on Polymarket alone, you can trade on celebrity / pop culture events, elections, sports, geo-politics, scientific advances and economic data. Not only does profitably trading these contracts require understanding of the specific domains and all the background knowledge associated with them , but it also requires the ability to do probabilistic reasoning, the ability to critically assess and weigh up different pieces of evidence, as well as successful trading execution. Based on this, I think such a benchmark can be said to be assessing a kind of 'general intelligence'.
  2. Despite 1), the barriers to entry are low. As I mentioned above, an LLM really doesn't need much more than access to Python and Google search to get started on a prediction market.
  3. Total profit-and-loss (PnL) presents a clear, comparative and intuitive performance metric. One issue I see with some benchmarks is the somewhat arbitrary or necessity-driven choice of scale from 0% to 100%. I often find comparing performances on these benchmarks somewhat unintuitive as a result. What I like about PnL is it's essentially unbounded on the upside, and it comes with a natural zero-point which roughly equates to "performing at the average level of all traders on this platform". My background in prediction markets may be biasing me here, but I also feel that "this LLM made a 30% return on Polymarket in five months" to be a lot more clear and visceral in terms of what it tells us about AI capabilities than "this LLM scored 93% on a Chemistry assessment benchmark where Chemistry PhD student average score is 76%". To expand on this, I think that a lot of people have experience with the task "try to make money in some vaguely speculative way", whether that's buying stocks, owning crypto or betting on stuff. Whereas almost no one has experience with many of the benchmarks that we're using to assess AI today. Insofar as benchmarks are useful for making clear to the public and policymakers what AI today is capable of, I think people are likely to grasp the significance of an AI making a profit more easily than performance on other benchmarks.
  4. This benchmark won't saturate for a long time. Given that PnL has unbounded upside, we won't see the pitfall where LLMs approach 'maximum' performance on a benchmark and it ceases to be a useful metric. I can imagine a well-designed prediction market benchmark continuing to be useful through multiple leaps in capabilities. One downside is that having two or more LLMs trade on the same platform at once 'poisons' the benchmark in some sense, since they will be competing with each other. However I think this is definitely a surmountable difficulty, particularly through use of more complex metrics than just PnL.
  5. A prediction market benchmark enables assessment of how model capabilities grow with scaffolding and/or the ability to learn over time. With an appropriate setup, we may see a scenario where a frontier LLM is able to successfully execute trades on Polymarket, it starts out unprofitable but is able to build up a repository of knowledge and trading heuristics that it eventually uses to consistently turn a profit. I think this kind of self-improvement is an important thing to assess on, and there's value in doing this assessment in a live continuous environment like a prediction market that runs 24/7, rather than at a series of discrete assessment points.
  6. Prediction markets have enough slack for this benchmark to be useful early on. One question might be: why not do this for financial markets? They're much bigger, far more liquid, far more important in many senses. My response would be that financial markets are too efficient. The vast majority of humans today can't consistently turn a profit on the financial markets with just access to the Internet, so I don't expect LLMs to be able to do this at anything like the current level of capabilities. A basic requirement for a benchmark like this is that the most profitable strategy that the LLM is capable of executing has to be something other than "don't ever trade". My claim is that prediction markets are often extremely inefficient in pretty obvious ways, and that even GPT-4o should be able to identify at least one trade that is positive-EV. I also, less confidently, claim that ~GPT-5 should be able to make some good trades, some bad trades, and learn over time which types of markets it's more profitable in, potentially even carving out a niche for itself in a specific area of Polymarket.
  7. There's no way to game a prediction market benchmark. The nature of prediction markets means that this benchmark would not be vulnerable to criticisms such as the assessment mechanism being part of the training data. More generally, since the benchmark is about making real money, if an LLM finds a way to do better than it 'should', you have an amazing money-making opportunity on your hands, and even inefficient markets like Polymarket shouldn't really let these exist.
  8. It's pretty different to other benchmarks. My (pretty uninformed) sense is that a lot of benchmarks are correlated with each other in terms of model performance, and I think there's value in trying to design benchmarks that are distinct from what's already out there. It's definitely possible that a prediction market benchmark ends up being highly correlated as well, but I don't think it's a foregone conclusion.

Conclusion

I think that a prediction market benchmark is highly achievable (I don't have the time or relevant skills to start making one myself, but would be happy to help someone who does) and would carry a number of benefits relative to the benchmarks that are currently being designed. 

New Comment
1 comment, sorted by Click to highlight new comments since:

An important drawback is that the difficulty of beating the market fluctuates depending on factors such as who the other traders are, and what kind of news is moving the markets.