The market is the ultimate test for AI.
Written by: Juan Galt
Translated by: AididiaoJP, Foresight News
Can AI trade cryptocurrencies? Jay Azhang, a computer engineer and finance professional from New York, is testing this question through Alpha Arena. This project pits the most powerful large language models against each other, each with $10,000 in capital, to see which can make more money trading cryptocurrencies. These models include Grok 4, Claude Sonnet 4.5, Gemini 2.5 pro, ChatGPT 5, Deepseek v3.1, and Qwen3 Max.
You might be thinking, "Wow, what a brilliant idea!" And you may be surprised to learn that, at the time of writing, three out of the five AIs are in a loss position, while Qwen3 and Deepseek, two open-source models from China, are leading.

That's right, the most powerful, closed-source, proprietary AIs operated by Western giants like Google and OpenAI have lost over $8,000—80% of their crypto trading capital—in just over a week, while their open-source counterparts from the East are in profit.
The most successful trade so far? Qwen3 has remained profitable and continues to make gains simply by holding a 20x long position on bitcoin. Grok 4, unsurprisingly, spent most of the competition going 10x long on dogecoin, at one point sharing the top spot with Deepseek, but is now close to a 20% loss. Maybe Elon Musk should post a dogecoin meme or something to help Grok out of trouble.

Meanwhile, Google’s Gemini has been relentlessly bearish, shorting all tradable crypto assets—a stance that echoes their overall crypto policy over the past 15 years.
In the end, it made every possible wrong trade for a whole week straight. It takes skill to do that badly, especially when Qwen3 is simply going long on bitcoin. If this is the best closed-source AI can offer, maybe OpenAI should stay closed-source to spare us the losses.
The idea of pitting AI models against each other in the crypto trading arena offers some very profound insights. First, AI cannot obtain the answers to crypto trading knowledge tests during pre-training because it is unpredictable—this is a problem faced by other benchmarks. In other words, many AI models are given some of the answers to these tests during training, so they naturally perform well during testing. But some research shows that making slight changes to these tests can lead to huge changes in AI benchmark results.
This controversy raises a question: What is the ultimate test of intelligence? According to Grok 4’s creator and Iron Man enthusiast Elon Musk, predicting the future is the ultimate measure of intelligence.

And we have to admit, there’s nothing more uncertain than the short-term price of cryptocurrencies. In Azhang’s words, “Our goal with Alpha Arena is to make benchmarking closer to the real world, and the market is perfect for this. Markets are dynamic, adversarial, open-ended, and always unpredictable. They challenge AI in ways that static benchmarks cannot. The market is the ultimate test for AI.”
This insight about markets is deeply rooted in the libertarian principles that gave birth to bitcoin. Economists like Murray Rothbard and Milton Friedman pointed out over a hundred years ago that markets are fundamentally unpredictable by central governments, and only individuals who must bear losses can make rational economic decisions.
In other words, the market is the hardest thing to predict because it depends on the personal views and decisions of intelligent individuals around the world, making it the best test of intelligence.
Azhang mentions in his project description that instructing AI to trade is not just about profit, but also about risk-adjusted returns. This risk dimension is crucial, because a single bad trade can wipe out all previous gains, as seen in Grok 4’s portfolio collapse.
There’s another issue: whether these models learn from their experience trading cryptocurrencies. Technically, this is not easy to achieve, because the initial pre-training of AI models is extremely costly. They can be fine-tuned with their own trading history or others’, and may even keep recent trades in short-term memory or context windows, but that only gets them so far. Ultimately, the correct AI trading model may have to truly learn from its own experience—a technology recently announced in academia, but still a long way from becoming a product. MIT calls these self-adaptive AI models.
Another analysis of this project and its results so far is that it may be indistinguishable from a “random walk.” A random walk is like rolling dice for every decision. What would this look like on a chart? There’s actually a simulator you can use to answer this question; in fact, it wouldn’t look much different.

The issue of luck in the market has also been described in detail by intellectuals like Nassim Taleb in his book “Antifragile.” He argues that, statistically, it is completely normal and possible for a trader—say, Qwen3—to be lucky for a whole week straight, making them appear to have extraordinary reasoning ability. Taleb’s point goes further: he believes there are enough traders on Wall Street that one of them could easily get lucky for 20 years straight, building a godlike reputation, with everyone around thinking this trader is a genius—until the luck runs out.
Therefore, for Alpha Arena to produce valuable data, it actually needs to run for a long time, and its patterns and results need to be independently replicated, with real capital at risk, before it can be considered different from a random walk.
Ultimately, so far, it’s been interesting to see open-source, cost-effective models like DeepSeek outperform their closed-source counterparts. Alpha Arena has been a great source of entertainment so far, going viral on X.com last week. No one can predict where it will go next; we’ll have to see if the gamble its creator took—giving five chatbots $50,000 to gamble on crypto—will ultimately pay off.