Blog
/
Agents

Migrating from Claude to DeepSeek

Bruno Škvorc
Bruno Škvorc
Staff Software Engineer
Bruno enjoys working on cutting edge tech and putting himself out of a job with agents. He also likes SCUBA diving. He lives in Croatia in a family of 4.5.
Bruno Škvorc
Written by
Bruno Škvorc
Last updated:
June 24, 2026
Expert Verified

Lindy's pricing only works if inference keeps getting cheaper.

That is the business constraint. Lindy is a generalist AI assistant. It writes emails, manages calendars, prepares meetings, follows up, and does a long list of small jobs that need to feel reliable every day. The model is the engine. It is also one of the biggest costs in the business.

So we designed pricing from day one around a bet: models would get cheaper fast enough that we could move users down the cost curve without making the product worse. We did not want to spend the same money forever and merely give users more intelligence they did not always need.

You do not need God to write your emails.

For months, headlines kept coming out of China claiming frontier-level performance at a fraction of the price. Most of them were noise. Then, in May and June, the claim stopped being noise for us. We moved most managed-agent model traffic, including Claude/Sonnet-backed paths and the remaining Gemini/Google paths, to DeepSeek v4 Flash on Atlas Cloud. Sonnet can still run when a user explicitly selects it or a higher-intelligence path needs it. On the migrated traffic, inference costs fell by about 90%.

Changing a model name was easy. Proving that users would still trust the assistant took the work.

That is why the current lab question is not academic. AI labs have grown from zero to billions of dollars of revenue at a ridiculous speed. That money has to come from somewhere. For application companies, a lot of it comes straight from the infrastructure bill.

The mistake is treating this like a dropdown change

The naive version of this migration takes five minutes: pick a cheaper model, run a few prompts, declare victory.

That version is how you ship a worse assistant.

A single prompt tells you almost nothing. The useful test is whether the model survives thousands of tiny product moments. Can it write the email in the user's voice? Can it decide when not to send? Can it keep a meeting brief crisp? Can it recover when the thread is messy? Can it do this without making the user feel like their assistant had brain surgery overnight?

That last line is not theoretical. We tried Kimi K2.5 during this search. It did well in offline evals. Then we rolled it to a small slice of real usage and it failed the vibe test almost immediately. One user reported that it felt like their Lindy had had a brain surgery overnight.

Offline evals are necessary. They are not enough.

How we picked DeepSeek

We started by building a lot of offline evals. This was not new for us. If you run an agent in production, you need a way to replay real tasks and compare models without turning users into guinea pigs.

Then we ran those evals across several candidates, including GLM5.1, Kimi K2.6, and DeepSeek v4 Flash. We also tested the same model on different inference providers. Annoyingly, the provider mattered. The same nominal model could score differently depending on who served it. Our best guess is that some providers were serving quantized versions or had issues in their inference stack.

That is a useful lesson: "we tested DeepSeek" is not precise enough. You tested a model, a provider, an inference stack, and your own prompts as one system.

DeepSeek v4 Flash won for the workloads we cared about. Then we tuned prompts until the offline scores roughly matched the old setup. This is where our GEPA prompt-optimization loop helped: it gave us a way to improve prompts against evals instead of hand-editing in the dark.

Only then did we start the rollout.

The rollout that actually matters

We began with a small percentage of users, including internal users. Internal traffic matters because employees complain quickly and with useful detail. If the assistant suddenly feels dumber, you hear about it before the graph has enough data to be polite.

From there, we watched two things.

First, online evals. These catch cases that the offline set misses, because production always invents weird inputs faster than you can write tests.

Second, retention. This is slower and more honest. If a cheaper model makes the product subtly worse, users may not file a support ticket. They may just use it less. That means you need at least a few weeks of data before you get too confident.

As long as those signals held, we kept ramping. Eventually DeepSeek reached 100% of the target traffic.

This is the part that is easy to misunderstand from the outside. The interesting work was not "we gradually moved traffic from old model to new model." That is plumbing. The interesting work was earning the right to move the traffic at all.

The rule we learned

Do not ask whether a cheaper model is "as good" in the abstract. Ask where it is good enough.

For us, the answer came from offline evals, provider testing, prompt optimization, internal rollout, online evals, retention, and finally a full ramp. That process cut inference costs on the migrated routes by about 90% without asking users to think about model routing.

That is the job for application companies now. Keep the product reliable. Move work to cheaper intelligence when the evidence says you can. Use the savings to make the business work.

The user should not have to care which model wrote the email. They should only notice that Lindy still feels like Lindy.

Come build this with us

Lindy is hiring engineers who want evals, prompt optimization, provider selection, rollout judgment, and cost discipline to be product work.

Picking a cheaper model is the easy headline. Proving that users can keep trusting the assistant is the actual job.

What this means for labs

The obvious question is whether this makes model labs commodities.

We think the honest answer is: partly, and that is not the same thing as doomed.

Airlines are famously brutal businesses. They are capital-intensive, competitive, operationally painful, and hard to differentiate. Delta is still worth tens of billions of dollars. A commodity business can still be enormous when the market is enormous.

If AI becomes much larger than airlines, even commodity-like margins can support huge companies. If a lab owns the frontier at the moment it matters, there is still a huge premium. The race to the frontier still matters.

The pressure is on margins, not relevance. The tier just below the frontier keeps getting cheaper. Every time that tier becomes good enough for another class of work, application companies have a strong incentive to move that work down the cost curve. We just did that for a large slice of Lindy.

That does not mean labs stop mattering. It means they have to keep earning the premium. The frontier can be extremely valuable while the workloads behind it commoditize one by one.

Save 2 Hours Every Day
Lindy is your ultimate AI assistant that manages inbox, meetings, and follow-ups—so you stay ahead of the chaos.
Try Lindy for Free
About the editorial team
Bruno Škvorc
Bruno Škvorc
Staff Software Engineer

Bruno enjoys working on cutting edge tech and putting himself out of a job with agents. He also likes SCUBA diving. He lives in Croatia in a family of 4.5.

Trusted by 400,000+ professionals

The AI assistant that runs your work life

Lindy saves you two hours a day by proactively managing your inbox, meetings, and calendar, so you can focus on what actually matters.

7-day free trial
Set up in 60 sec