I build AI products that can't lie about being done

Your AI-built product says it works. The demo runs, the checklist is green. But under that sits a quieter question no one answered for you: do you actually know it works, or do you just know that something told you it does? Most AI code is trusted because it looks finished. I build on the opposite assumption.

That assumption isn't a pose. It's the lens I was trained to see through. Two disciplines shaped how I think. Public administration, where I hold a master's and trained inside government, is the study of getting an apparatus to produce a reliable outcome without doing every task yourself. And auditing, which I did in consulting, is the opposite reflex: trust the evidence, never the assurance. Orchestrating AI to make software turns out to be exactly those two, pointed at a new kind of workforce: govern the agents that do the work, then prove what they produced is real. Directing the build is the easy half. Proving that "done" is true, not reported and not assumed, is the whole job, and the half almost everyone skips.

It says it's done. Is it?

I found something uncomfortable while building: you can't ask an AI whether it finished. Its confidence that the work is done lines up with reality less often than a coin flip. So I stopped trusting the report, its and my own. On my work, "done" is not a claim anyone makes. It is computed from things that can't be talked into existence:

the real commit, passing tests, the actual action completing through the actual screen

And the thing that grades the work is never the thing that did the work. You get a product whose "finished" is a fact, not a feeling.

The fear this lifts: I can stop re-checking whether they're lying to me about "done."

Perfect in the demo. Dead on real users

The first 80% of an AI build comes together in an evening and looks flawless. Then real people arrive and the last 20% collapses. The cause is almost always the same: it works if you jump straight to the right screen, and a real user never does that. So I treat two things as the real finish line, both of which a demo skips.

the main action has to work through the real, rendered interface, not in isolation, and a brand-new user has to reach the feature by the path people actually take

A shortcut straight to the screen does not count as proof. I break the product the way your users will, before they show up to do it for me.

The fear this lifts: it won't fall apart the moment it meets real people.

Read: Why AI code breaks in production

Fix it, rebuild it, or leave it alone. The truth first

When a codebase is in trouble, the most expensive sentence you can hear is "rewrite everything" from someone who never read it. I read the real code first, then tell you the truth: what to fix, what to rebuild, what to leave alone, including "this is fine, you don't need me," when that is the honest answer. The verdict comes as a written plan you can hand to any developer. That detachment is the whole point: if I recommended a rebuild and only I could do it, you'd have no way to tell me apart from the people who burned you. You can. And under the mess I don't patch the symptom. I find the clean architecture the broken version was blindly reaching for, and name it.

The fear this lifts: being scammed by the person I hired to help.

The invoice that quietly kills the project

These days the fear isn't only bad code. It's the surprise bill. AI tooling can burn a budget silently and end a project in a single month. I measured this on my own work: the "cheap temporary version, just to test it" cost three to four times more than building the real thing once, and left nothing to keep. So I build the right thing the first time, price it fixed for a scoped result instead of by the hour (hourly punishes speed: you'd pay more for me being fast), and keep spend capped and visible, not discovered at month's end.

The fear this lifts: the AI bill that arrives too late to do anything about.

Three months later the model updated and everything broke

AI code has a short shelf life: a new model or tool version ships, the product rots, and you pay to rebuild what already worked. I split every product on purpose:

a stable core that holds your real logic and rules, and a thin outer layer that touches the fast-moving model

When the tooling changes, only the outer layer is rewritten. The core stands. You stop paying for the same thing every quarter.

The fear this lifts: paying again, on every model release, for work I already paid for.

What if they're confidently wrong?

The most expensive mistakes don't look like mistakes. They look like a confident, reasonable answer that happens to be wrong. AI does this constantly; so do people. The instinct to "just fix the obvious cause" is exactly how a wrong guess becomes broken data and lost days. So I built a stop into my own process:

nothing gets changed on a story about what's wrong, only on the actual evidence of what's wrong

That stop catches me too. More than once I've started down a fix that looked completely right and killed it within minutes, the moment I checked it against what had actually happened. The one worth trusting isn't the person who's never wrong. It's the one whose process catches it fast, before it reaches your product.

The fear this lifts: they won't confidently break my thing on a wrong guess, and they'll catch their own mistakes before I pay for them.

Why most AI builds come out shallow

Everyone knows you get more from an AI by asking better. That's why prompt engineering caught on. But the prompt is the small lever. Behind every answer is a large amount of hidden work the model does on itself: the questions it asks itself, the angles it weighs, how hard it actually thinks before it replies. What I found, working this way, is that putting the model into the right state for the task ahead matters more than how you phrase the request. Most tools run it as an order-taker: it does what you said, at the shallowest reading that satisfies the words. That is where demo-deep, falls-apart-later output comes from. For real work I run it as an investigator: thinking at depth, questioning its own assumptions, going after the real problem instead of the stated one.

The two halves fit together. I take everything I can from how the model thinks, and trust nothing of what it claims.

depth on the way in, evidence on the way out

That combination is the craft, and it's most of why what I build holds up where a quick build doesn't.

Whether we're a fit

A fit if you have a real problem and want a finished result (a project or a fixed engagement) and you value speed and quality over hourly bureaucracy. Also a fit if your thing is genuinely messy and broken: I'd rather have a real, unpredictable problem than a tidy invented one.

Send me the real broken thing

If your build is breaking and you need to know what's actually wrong, show me the real thing: the broken repo, the stuck deploy, the product that demos clean and dies on real users. I'll tell you the truth, what's really going on, and whether the right move is a fix, a rebuild, or walking away. That honest diagnosis is where the work starts.