I built an AI financial analyst to replace me. Here's where it broke.
For over ten years of my career, I built financial models, analyzed SaaS businesses, and presented insights to executives. I also led teams to do the same. There's a lot of noise right now about AI replacing people doing knowledge work like this. I wanted to find out for myself what AI can do today, and what it means for the rest of us.
Last weekend, I spent four hours training an AI to do my job. I gave it everything it needed to run a full financial analysis and a board-ready dashboard for a hypothetical SaaS company.
The TLDR is this: the AI can do a lot. It fell short in some areas, but it surprised me in others. Human analysts still have a role, but it won't look the way it used to.
The experiment
For this experiment, I created a fictitious SaaS transactions database with bookings details, start and end dates, and customer names. (MangoDB and LinkedOut made the list!)

To give the AI the best chance of success, I spent the first hour writing detailed instructions as if I was onboarding package for a new analyst. I defined the scope: bookings analysis, ARR waterfall, net retention rate, customer analysis, sales productivity, forecasting, and a polished dashboard for a board review. I gave the AI a persona: Alex Chen, FP&A manager at AlphaTech. I included materials from my past training workshops on SaaS financial analysis so it had domain context before we started.


Then we began. I opened with the simplest request: create an Excel workbook with quarterly trended bookings by product, by region, and by top customer. It took five minutes to run.

Claude's first pass wasn't bad. It gave me a reasonable executive summary and tables formatted the way I like. But I hit a problem right away. A human analyst would have built a pivot table, giving me a file I could follow. Claude doesn't work that way. It ran a Python script and produced the output as hard-coded numbers, with no way for me to trace any figure back to the source database.

This pointed to a bigger issue. With a human analyst, you assume some level of accountability and integrity. But with an AI, I have no clue whether it's just making things up. I ended up in a broader conversation about what responsible behavior means here, and we agreed on a few critical ground rules: flag data issues, state assumptions, never invent numbers.

I then asked Claude what it could do to help me validate the hard-coded output. We agreed on two new tabs: one with the raw data, one with formula-driven checks using SUMIFS so I could confirm every total tied back to the source.

The second version looked better. I could trace every figure back to the source data and felt comfortable.

We worked through a few minor iterations—tightening the executive summary, corrected a circular reference error, fixed formatting. Before moving on, I asked it to codify everything it had learned into CLAUDE.md so I wouldn't have to repeat any of it going forward. This Cowork capability to retain memory makes it 10x more useful than chat. I'll write a separate post on how it works. Subscribe here if you want to an update on that.

More problems as the work becomes complex
The ARR waterfall analysis was next. It required deriving new fields not in the original dataset. Claude confirmed the metric definitions with me before running anything, which was helpful.

A few minutes later it produced the correct output and ran the verification checks we'd established earlier.

Then I hit another problem: Claude struggled with Excel charts. Even with a precise template as a reference, the output was messy, with dark font on dark bars and inconsistent formatting that I wouldn't put in front of an executive.

I asked Claude to delete all the charts. But now the tables were worse than before: formula-driven values had reverted to hard-coded, and validation cells changed from green to yellow for no good reason.

This became a persistent issue for the rest of my experiment: whenever I iterate on the same Excel file more than a few rounds, the output degrades. Starting a fresh version after round three or four is often faster than sticking to the same session.
The net retention rate analysis had the same problem. Complex cohort calculations made Claude to revert to hard-coded values again, even with an explicit rule against it in my instructions.

The customer analysis and sales productivity analyses were straightforward so they ran without issues. Then something interesting happened when I got to the ARR forecast. Claude noticed that recent trends were pointing down and proactively recommended building a bear case alongside the base case, without my asking.

The instinct was helpful, but the execution wasn't great. It built the scenario without confirming the approach with me first, and the bear case came out hard-coded (yes, again!) with no stated assumptions. I asked Claude to remove it and give me a simple input-driven model instead. It did.

The last task was the one I was most curious about how the AI would do: an executive dashboard. I gave Claude screenshots of dashboards I had built in the past as a reference and asked it to match the style.
Claude took almost ten minutes to run the dashboard in an HTML file. The output was better than I expected. The charts were clean and the forecast periods were formatted in a lighter shade. It was visually pleasing.

I worked through a few iterations: trimmed the bullet points, rearranged some widgets, added an executive summary at the bottom with three buckets: Highlights, Watch, and Lowlights. The final version looked polished enough to put in front of a board.

I added Claude to add data labels to each bar so I didn't have to hover over each one to view the value. Claude replied that that would simple and it had executed. But when I opened the file, nothing had changed.


I asked it to try again. This time, the bars disappeared completely!

I told Claude that if something wasn't possible, it should say so. It admitted it might not be possible due to tool limitations. I was disappointed, but the final dashboard was good enough, so I moved on.

As a final impromptu request, I asked for a PowerPoint deck at the same level of polish. The results were just as good as the HTML dashboard. It even gave me labeled bar charts this time.



What I learned
A few learnings are clear by the end of the experiment.
What goes in determines what comes out. The AI's output was solid because I provided specific resources upfront: detailed instructions, metric definitions, and training materials from years of workshop work. When I skipped that and shifted to impromptu prompting, I paid for it in more time and extra iterations. The same was true for examples. I gave the AI dashboards I'd built before and asked it to match the style. That was faster than trying to describe what I wanted. It's the same as working with humans: the better the examples, the better the output.
The work shifts toward judgment. The AI handled execution in a fraction of the time: cleaning data, formatting documents, iterating on comments. Work that is purely data manipulation is most vulnerable to AI replacement. What I spent more time on was deciding what to build, evaluating whether the output was ready to show, and pushing back when something looked off. Answer those questions became the work. Visualization is another area where AI still struggles: it could create a correct ARR chart, but it couldn't tell when a chart wasn't good enough for an executive presentation, or where the CFO may challenge me. That kind of taste comes from experience. It doesn't transfer well through prompting or text instructions.
The AI sounds confident even when it's wrong. Incorrect or made-up numbers are unacceptable, so I set non-negotiable ground rules early like formula-driven validation and no hard-coded figures. The model still broke the rules when the calculations got complex. It told me the data labels were fixed, but when I opened the file, nothing had changed. Ground rules help, but they don't replace careful human review.
AI saves time, but the net savings is smaller than it looks. The model handled data processing much faster than I could on my own. But I still had to verify what might have gone wrong, correct it when something broke, and rewrite instructions so it didn't break the same way again. Each iteration took five to ten minutes to run. Too much iterations could make the output worse. And some tasks are still best for a human to complete.
Memory makes the system compound. I started this experiment from scratch. By the end, every lesson from the iterations was captured in CLAUDE.md. After round one, I deleted all the output files and ran it again with simple prompts like: "Run the bookings analysis for the quarter." The second run was much cleaner than the first.