AI failed in finance for CFOs
The 7 failures every CFO is quietly living with. And the 5-principle architecture the top 7% use instead.
Every CFO has tried Claude on a real finance task.
Almost every one of them is disappointed.
The tool that was supposed to save three hours added a fourth.
Not because it didn’t work. Because it shifted the work. The prep time moved into verification time. The model produced a beautiful paragraph. You spent the next hour checking every number in it. By the time you were done, you might as well have written it yourself.
This is the quiet truth about AI in finance right now.
Not failure but shifted work.
And the reason is simple enough to name.
Finance is deterministic. LLMs are probabilistic. Every hour you spend in that mismatch is an hour the tool has not actually saved you.
Today I’m going through the seven specific places this mismatch shows up. Every one of them is backed by data and named sources.
Then the five principles that separate the CFOs who moved past this from the ones who are still pasting P&Ls into chat windows at 9pm on a Tuesday.
Let’s dive in.
1. Your AI just made up a number. And sounded great doing it.
October 2025. Deloitte Australia delivered a 237-page welfare compliance report to the Australian government. AU$440,000.

A Sydney University researcher found more than 20 fabricated references. A book attributed to a constitutional law professor that does not exist. A quote from a federal court judgment that was never said.
Deloitte admitted to using Azure OpenAI GPT-4o. Issued a revised report. Refunded part of the money.
Six weeks later, a second Deloitte scandal broke. A 526-page healthcare workforce report for Newfoundland and Labrador. CA$1.6 million. Four fabricated academic citations. Real researchers attributed to papers they never wrote.
Two Deloitte reports. Two hallucinated disasters. Same quarter.
This is Deloitte. Every resource, every reviewer, every process money can buy. They still shipped the mistakes.
The benchmark data says it was always going to happen.
Patronus AI’s FinanceBench tested GPT-4 on questions taken directly from public company filings. Best result with retrieval: 19% correct. An 81% failure rate.
A newer benchmark called FinanceQA tested real investment banking tasks. Best model: 40%.
You are not checking every number. Nobody is. You check the ones that look off. The ones that look right sail through.
And then you are defending a figure to your board that you did not generate, from a tool that cannot tell you where it came from.
In most industries, 95% accuracy is a win.
In finance, 95% accuracy is a resignation letter.
2. Run the same prompt twice. Get two different answers.
Every CFO has now learned this the hard way.
You run a variance on Monday. You run the same prompt on Wednesday. The numbers shift.
And the usual fix, setting temperature to zero, does not work.
A NeurIPS 2025 paper proved it this year. Even at temperature zero with greedy decoding, the same model on different hardware produces up to 9% variation in accuracy. The math is in the floating-point arithmetic. No setting you flip gets you out of it.
OpenAI’s own documentation only promises “mostly identical” outputs. One developer tested 10 identical prompts with seed 42 and temperature zero. Result: 50% variability in long-form output.
A controller ran an AI agent to categorize 200 transactions against the chart of accounts. First pass: 94% accurate. Second pass, same data, same prompt: 97%. Third pass, the model miscategorized an entire vendor because it weighted a different token during inference. Nothing changed in the input but the output shifted anyway.
Jason Child, CFO of Arm Holdings, said it cleaner at the MIT Sloan CFO Summit.
LLMs are probabilistic. Finance is deterministic. There’s an answer. An LLM is going to give you the highest probability of what might be the right number. It’s not going to be exactly the right number.
Your board does not want a natural-sounding answer. They want the answer. The same one, every time, by every person who runs the query.
Generic AI cannot give you that. Not at temperature zero.
Not with the same seed. Not ever.
3. “Claude told me” is not an audit defense.
June 2025. The UK Financial Reporting Council reviewed AI use across Big Four auditors.

Every firm was using AI tools in audits. None had formally assessed quality impact.
Nine months later, in March 2026, the FRC issued what it calls the world’s first regulatory guidance on AI in auditing. Audit partners are now personally accountable for AI-assisted work. Including hallucinations.
Your AI makes up a number. Your audit partner signs the opinion. That partner’s career is on the line.
KPMG’s April 2025 survey of 48,000 workers found 53% of employees present AI-generated content as their own work. 46% have uploaded sensitive company data to public AI platforms.
This is why most AI work in finance is happening in private. CFOs use the tool. Then they rebuild the output manually from their source system before anyone senior sees it.
Doing the work twice. Calling it innovation.
4. Your AI can’t keep its own secrets.
On April 1, 2026, Anthropic shipped Claude Code with a source map file attached by mistake.
512,000 lines of internal code. 1,900 files. The full architecture of their flagship coding agent.
A researcher posted the link on X. Tens of millions of views inside a day. Anthropic filed DMCA takedowns on 8,000+ GitHub copies. A Korean developer rewrote the core architecture in Python within hours.
Days before that leak, a separate misconfiguration had exposed 3,000 unpublished Anthropic internal assets. Including draft announcements for an unreleased model Anthropic itself said poses “unprecedented cybersecurity risks.”
If Anthropic cannot protect their own crown jewels, what makes you think their “enterprise-grade security” is protecting your consolidated P&L? Or your M&A model? Or the board memo you pasted in at 9pm on a Tuesday?

The shadow AI data is worse than CFOs admit. 46% of workers have uploaded sensitive company data to public AI platforms (KPMG). 1 in 5 organizations had a shadow AI data breach (IBM). Those breaches cost $670,000 more on average than breaches without AI involvement.
Every CFO knows the small cold moment right after you hit paste.
The one where you think, I probably should not have done that.
5. Your AI has never seen your actual numbers.
Your CEO wants tariff impact by Friday.
Your board wants the last two weeks, not the last quarter.
Half your week is ad hoc questions that did not exist on Monday.
Generic AI cannot help you here. It has no connection to your live data. Every answer is based on whatever you pasted in. Already a snapshot. Already stale.
A typical SAP instance has 50,000+ tables. Even a 200,000-token context window cannot hold a real financial dataset without splitting it. Each split multiplies the error rate.
So you rebuild the same dashboard in Excel. Again. For a question that matters for a week and then disappears.
Gartner’s November 2025 data is damning. Finance AI adoption moved from 58% to 59% over 12 months. Essentially flat, after doubling from 37% in 2023.
The hype curve broke.
91% of finance teams report low or moderate impact from AI. Only 7% of CFOs report strong impact from their AI spend.
6. The $30 per-seat tool nobody uses. And the 95% pilot graveyard.
Microsoft Copilot is the poster child.
15 million paid seats out of 450 million M365 users. A 3.3% conversion rate.
Among those with access, only 36% actively use it. 56% of organizations report no significant financial benefit.
Marc Benioff said it on his earnings call in August 2024:
So many customers are so disappointed in what they bought from Microsoft Copilot because they are not getting the accuracy and the response they want.
He later called it Clippy 2.0.
This is not a Microsoft problem. It is systemic.
MIT NANDA’s August 2025 study of 150 executives, 350 employees, and 300 public deployments found 95% of enterprise generative AI pilots deliver no measurable P&L impact.
S&P Global reported 42% of companies scrapped most of their AI projects in 2025. Up from 17% the year before.
Gartner’s October 2025 survey of 506 CIOs found 72% are breaking even or losing money on AI investments.
The promised revolution flatlined.
7. The regulatory bill nobody budgeted for.
August 2, 2026. The EU AI Act’s high-risk provisions take effect.
AI systems for creditworthiness, credit scoring, and insurance risk pricing are specifically classified as high-risk. Compliance obligations cover data governance, record-keeping, transparency, human oversight, quality management, and conformity assessment.
Penalties reach €35 million or 7% of global annual turnover. Whichever is greater.
The SEC is already enforcing AI washing. The first Trump-era case, Nate Inc. in April 2025, brought securities fraud charges over $42 million raised on fabricated AI claims.
The US Treasury published a Financial Services AI Risk Management Framework with 230 control objectives.
Meanwhile, only 40% of S&P 500 companies disclose their AI use at all.
The compliance cost of AI in finance is about to show up on the balance sheet. Most CFOs have not budgeted for it.
The light at the end
Most contrarian pieces stop at the pain.
That is lazy.
AI has not failed in finance. Generic AI has failed in finance. Those are different statements, and the gap between them is wide. The 7% of CFOs Gartner identified are already pulling ahead. They are not using better tools than the other 93%. They are using the same tools inside a different architecture.
Five principles. I have tested each one in real finance workflows.
That piece goes out Sunday.
If this edition hit home, you’ll want to read the next one.
Whenever you’re ready, there are 2 ways I can help you:
If you’re building an AI-powered CFO tech startup, I’d love to hear more and explore if it’s a fit for our investment portfolio.
I’m Wouter Born. A CFOTech investor, advisor, and founder of finstory.ai
Find me on LinkedIn







