Bianca Botes, Managing Director at Citadel Global
The financial industry has spent the better part of a decade automating everything it can. AI, using algorithms and natural language processing, ingests, analyses and interprets large volumes of data, and can draft memos, flag fraud, and process earnings transcripts at a speed no team of analysts can match. While AI has embedded itself across the financial industry faster than most people expected, there is one place it keeps coming unstuck, and it is the one that matters most – actually trading.
Failing the test
Live experiments around how AI performs as a trader are now providing data on the issue rather than opinion. In one of the more watched competitions, eight frontier models, including, Anthropic’s Claude, Google’s Gemini, OpenAI’s ChatGPT, and xAI’s Grok, were each given $10,000 and let loose to trade in US tech stocks for two weeks. The combined portfolio lost roughly a third of its capital. Out of 32 individual result sets, a model only finished in profit six times. A separate exercise tracking 11 markets-related competitions found the median model profitable in only two of them. The outcome of these tests shows that most AI trading systems will lose money when tested in real conditions.
Understanding AI limitations
How AI fails matters. The models size their positions badly, and their entry and exit timing is off. When different models receive identical instructions, they produce meaningfully different decisions, one model, for example, placed 1,418 trades using a prompt that saw another model only place 158 trades. Where one model may be stubbornly long, another will be comfortably short, and a third drawn to leverage. None of the actions reflect a reading of the market. They reflect whatever tendency got baked into the model during training, which is a different thing entirely, and an expensive one when real money is involved.
What must be remembered is that the models do not lack data or processing power, they lack the ability to answer the forward-thinking questions that the markets demand. A large language model learns from what has already happened. It finds patterns in historical text and produces outputs consistent with those patterns. That works for tasks with a defined structure, like summarising a document, flagging an anomaly, or predicting whether an earnings estimate will be revised up or down. It breaks down, however, when the question is open-ended and the answer depends on a context that has never existed in quite this form before.
Trading dynamics
Markets produce that kind of question constantly. While there are mechanical elements like pricing relationships, flow dynamics, and rate arithmetic, where systematic tools perform well and always have, what actually moves prices is usually not mechanical. Markets are about positioning, about whether a trade is crowded, about how participants respond when something they believed turns out to be wrong. Reading into those nuances, requires reasoning capability in an environment of genuine uncertainty, where information is incomplete and the variables keep shifting depending on what else is happening. A model trained on historical data cannot reliably do this, and scaling the model does not fix it.
There is also a problem with how AI trading performance gets evaluated. Any model tested today on how it would have handled March 2020 already knows what March 2020 looked like. The outcome is in the training data. Real markets do not provide that luxury. The only valid test is a live one, and live tests have not been encouraging. Firms that have found something workable are using these tools for contained, specific jobs – earnings direction prediction, data processing, and research synthesis – where historical patterns remain a reliable guide. That is genuinely useful work. It is, however, not what most people mean when they talk about AI replacing the fund manager.
Trading needs a human touch
Markets are not fully reducible to algorithms. The science behind them is real and rewards good tools. But underneath the models and the data is a system driven by human decisions – made under stress, with incomplete information, shaped by incentives and feedback loops that behave differently every time. Working through that does not come from a formula. The machines handle the structured part well. The rest still needs a person in the room.
ENDS







