Case 01 · TaxDown · 2025 → ongoing

Agentic TaxDown

An agentic experience that tailors the tax return to each person filing it. The AI does the reach. A human advisor keeps the accountability.

00 · The premise

No two tax returns are alike

The heterogeneity isn't only fiscal. It's idiosyncratic: your age, your relationship with technology, the kind of service you're used to, what you believe, how much you want to delegate. A single flow for everyone, however much micro-personalisation we layer on top, still gives everyone the same thing. That isn't adapting to the user. It's asking the user to adapt to us.

Complex returns are the sharp end of that heterogeneity. Fiscal complexity multiplies the number of situations a person can be in, almost exponentially. So we made a bet. If we could learn to serve the most heterogeneous cases, adapting to everyone else would get easier.

01 · The opportunity

The filer worth the most is the one we lose the most

Complex returns are the highest-value filers. They pay the most when they subscribe, because they correlate with the priciest plans and their tier pushes the price up further. And still they convert to payment better if they finish the tax flow. But. They drop off most before the results screen, where the main conversion lever lives.

It was, frankly, predictable. Complex returns carry three compounding burdens:

Quantitative friction

The more complex the return, the more data to provide: more screens, more forms, more fields. Sheer volume.

Uncertainty

Complex data doesn't live in your head: rental dates, sale values, expenses, crypto trades. People often don't know what's being asked, and even when they do, they don't know where to find it. That also lands as a heavy support load on the company.

Accompaniment gap

These users are used to delegating to an advisor who does the hard part with precision. A human-only team can't scale that, and gets low-quality data back, which means endless back-and-forth during the flow and at review.

02 · The bet

Three levers, one first iteration

↓

Ask less

Cut the quantitative friction. Only ask what a case needs, never what we already know.

Remove uncertainty

Read each user and how they are behaving, then give them the context their case needs, which data matters and where to find it.

Accompany, proactively

Step in before the user knows there is a problem, heading off needless escalations to human advisors and curating the information those advisors receive.

03 · The MVP

A conversational agent with generative UI

To validate the hypotheses we started small. Some property sections went through the classic form, others through a new experience guided by a conversational agent, so the test ran right beside the control.

Ask less. The agent asks only what we need. No pre-filled forms to read, no re-entering what we already hold. It draws on prior-year data and blends deterministic rules with heuristics that only an LLM can combine, choosing different graphic and textual resources depending on the nature of the data and how the user is behaving, to cut the data required for a correct return.

Remove uncertainty. The agent reads how each person is behaving and gives them the context their case needs, what to provide and where to find it, before they get stuck. The help adapts to each user's gaps, not a generic FAQ.

Better data for the advisor. It can push back, asking for something more specific when an answer looks off. And users can upload documents instead of digging for the figure inside them: the agent extracts what matters, checks internal coherence, and only then is anything sent to the advisor.

Some screens

04 · Outcomes

What the numbers said

+9.2%conversion to payment against the form, 29.0% versus 26.6%

−2.1 ptsfewer expert corrections than the form (10.6% vs 12.7%)

Once inside, the agent performs. Conversion to payment reaches 29.0% against 26.6% on the form, a +9.2% relative lift, and data quality holds on par with the form, 10.6% against 12.7% expert corrections. The result it produces is the result the form would have produced.

Reading the numbers

Self-selection. Users pick the agent at the selector, so the cohorts aren't randomised. The +9.2% is directional, and a clean A/B is designed and pending.
We retired our own headline. An earlier read claimed the agent was 34% faster. A later audit found the form's timing skewed by single-page re-entries, so we pulled the claim and rebuilt the method, tagging modality on every event from day one.
Adoption is the real bottleneck. Two-thirds still choose the form, and most who pick the agent use it like a guided form rather than a conversation. That finding is exactly what powers the next iteration.

05 · What we learned

The things that didn't go to plan

Complex filers skew older, and warier of AI. They get nervous when they sense the machine is replacing the human treatment an advisor would give.
One property at a time isn't enough compounding. Because each property is a separate conversation, we can't synthesise across them. Collapsing four fields into "the buy and sell dates and prices of your Lavapiés flat" saves a little. The real prize, "last year you lived with your son Jaime as a dependent in the Montera flat and rented out the other three, still true?", needs the whole return in one conversation.
We over-constrained the agent. Fear of fiscal errors made it more deterministic than we wanted, which, paradoxically, produced more hallucination and rigidity, not less.
For confident users, the conversation added friction. Reading and writing more than a form makes no sense when someone already knows their data cold. The help matters most exactly when the user is unsure.
We were fighting a technology that moves at dizzying speed, hand to hand, week to week.

Now · ongoing · a one-person squad

The next iteration: an end-to-end, generative experience

A discovery I'm running solo. A fully functional agentic product, built from scratch, that replays a property owner's whole return from start to finish.

Agentic technology is improving not just the product, but how the team works. Because I can design, prototype and test almost at the same time, I can put our open hypotheses under real pressure. And each one maps straight back to a learning from the MVP.

Learning

One property at a time isn't enough compounding.

Response

Go end-to-end. Friction-removal tactics over the whole return, so "ask once, save many" compounds across every property at once.

Learning

Over-constrained, so it hallucinated and felt rigid.

Response

Iterate the infrastructure faster. A system of specialised agents orchestrated across several harnesses (Paperclip, Codex, Claude Code, LangSmith), generating unit tests, QA cases and datasets in record time.

Learning

Conversation added friction for confident users.

Response

Generative, graphic-first. The core isn't conversational, it's generative. The flow builds itself per case, and conversation becomes the layer of care, not the container.

Learning

Older users distrust AI replacing the human.

Response

The agent as the advisor's emissary. It doesn't replace the advisor. It hands them better data, and gives the user freedom to share information however and whenever they like, asking questions around the clock. We make it visible that the human advisor is present, fully informed, and ultimately responsible. The AI is the reach, the human is the accountability.

How we'll validate it

Less friction

User interventions, data points required, and time to a result, measured against the classic flow.

The correctness gate

Non-negotiable. At equal data, the agent's result must be identical to TaxDown's. The engine is the judge.

More accompaniment

How often users ask open questions or for a human, whether they correct or defer data, and the richness of what they volunteer in natural language.

Some screens

How it's wired

The experience runs as an isolated module embedded in the main product, with its own backend and authentication inherited from the app. It reads the user's fiscal data through an existing service, runs a deterministic calculation engine alongside LLM agents (a router that switches model tiers by task), lets users upload documents in bulk for extraction and validation, and, crucially, validates offline against synthetic profiles. It hands a structured draft to a human advisor and never writes to production tax systems during the experiment.

Where it stands

The prototype is where all of this was discovered, but it isn't reliable yet. It fails often and doesn't calculate with guarantees.
Making it truly testable needs a careful refactor and a minimal fiscal audit. That's scope, not a footnote.
The discipline is deliberate. A lab, not a launch. Validate first, expose later.

The thesis

A return built for one beats a form built for everyone. And the same agent that builds it is the one that has your back.

Credits

Team & my role

The MVP shipped with a squad of seven: a product manager, tech and tax leads, backend and frontend engineers, an AI scientist, and me as the product designer.

What I owned: the product strategy and the three-lever framing, the conversation principles the whole team designed against, the agent's identity and narrative, the full interface, the instrumentation that made the agent and the form comparable, and an agentic QA framework, an AI that runs the full conversation against fiscal personas and files each bug against the principle it breaks. The next iteration I'm building as a one-person squad.

← all work