We Doubled Engineering Throughput With AI. Here's What We Actually Changed.

The organizational decisions, workflow redesigns, and honest challenges behind eight weeks of sustained productivity gains across a 21-person R&D team.

March 4, 2026

A quick glossary (for the non-engineers in the room)

Term	What it means
Pull request (PR)	A single, self-contained change to our codebase. Think of it as one logical unit of new product. More PRs per week means more things shipped.
Merge time	How long a PR sits between being ready for review and going live. Shorter is faster.
Story points	How we measure the size and complexity of work completed. A higher number means more substantial work shipped, not just more tasks checked off.
Code churn	How often code gets rewritten within two weeks of being shipped. A proxy for rework and wasted effort.
Code review	A check where another engineer reads through code written by the author before it goes live, to catch any problems before they reach customers.

Why velocity matters for our customers

When we started Tern, most people thought we were a little crazy. Building one system that could genuinely serve travelers, travel advisors, agencies, and host agencies had never really been done before. These groups have different workflows, different business models, and very different ideas about what great software looks like. The conventional wisdom was that you had to pick one.

We didn't. And the engineering engine described in this post is a big part of how.

The travel industry is full of what we call invisible work: the gap between what advisors and agencies can do today and what a truly end-to-end experience would unlock for them and their clients. We currently have over 2,000 feature requests sitting in our feedback tool. That backlog isn't a sign of failure. It's a map of the opportunity in front of us.

Every week of engineering speed is a week we close that gap faster. When we're not operating at our truly full potential, advisors are stitching together workarounds for problems we haven't solved yet. That's the context for what follows.

Starting point: we were already moving fast

We weren't a slow team trying to get faster. Our 21-person R&D team runs one-week shipping cycles. Every squad ships a feature every week.

Through October and November, velocity had plateaued at about 104 pull requests per week. Process improvements and team growth had gotten us from 81 in July to that plateau, and we'd leveled off. The jump in January wasn't a gradual trend. It was a step change.

The central challenge wasn't just going faster. It was going faster without making a mess.

Getting aligned before January

Through 2025, engineers had been experimenting independently: different editors, different AI assistants, different workflows. Lots of individual exploration, no shared approach. The problem with that isn't that any one person made a bad choice. And critically, it makes it really hard to invest intentionally in giving the AI context when everybody's using different tooling and approaches.

At our December offsite, we made some decisions together:

Which tools to standardize on
How to structure the work going forward
What we were actually optimizing for

We centralized on Claude Code and agreed on a set of practices. Then everyone had the holiday break to rest and let it settle. January wasn't a cold start. In hindsight, the specific tool probably mattered less than the act of aligning and investing in one place as a team.

Restructuring into pairs

One week into the rollout, we reorganized squads of three and four engineers into dedicated pairs.

Everyone needed someone to unblock them, and nobody should navigate a major workflow change alone. Changing the way you work is disorienting. Doing it in isolation is a recipe for people getting quietly stuck.

Every pair looked different:

Some worked together most of the day
Some paired for an hour and split off
Some did a morning check-in and an end-of-day review
Some were effectively solo, with a partner who reviewed output periodically

The format was up to each pair. The accountability was the point.

What the data showed

Week one, PRs per week jumped from ~104 to 150. We expected it to settle. By week two it was 178.

Weeks four and five dipped to 154 and 164. People were adapting, review queues were adjusting, the initial energy was normalizing. We watched closely to see if the gains would hold.

Weeks seven and eight came in at 222 and 218. As of mid-February, we're averaging 198 PRs per week. Up from a baseline of 81 in July. Same team.

Volume without quality is noise. Here's what happened to quality:

Metric	Change vs. July baseline
Code complexity per line	Down 29%
Production errors per 1,000 lines	Down 38%
Review comments per PR	Down 45%
Review rounds to merge	Flat at 1.1 (most PRs still merge on first approval)
Story points per week	28 in December to 223 by Feb 16

Code churn spiked briefly in weeks four through seven, then dropped below our pre-AI baseline. The code we're shipping with AI assistance requires less rework than what we were shipping without it.

What do all these numbers actually mean? We're moving faster. We're shipping better code. And we're showing up better for our customers every week.

What we actually changed (beyond the tools)

1. How we plan

One of the more significant changes had nothing to do with AI tooling directly. We changed when we do detailed planning.

Most teams write specs well in advance, weeks before implementation begins, then hand them off. By the time an engineer picks something up, the context is stale. It captures what someone thought the problem was, not what they'd think after more recent conversations with users or with the codebase.

We shifted detailed planning closer to execution. The goal: when an engineer picks up a ticket, the thinking behind it is days old, not weeks.

Two examples of what this unlocked:

We needed to build Agency Supplier Terms & Conditions — letting agency admins configure structured T&Cs and cancellation policies for their suppliers, so advisors don't have to hunt them down or reuse old ones. It's a multi-model feature touching library management, booking and pricing flows, and advisor workflows — the kind of project that would have previously required an extended planning process, an architecture document, a full implementation cycle, and QA. With current-context planning and AI-assisted development, it went from kickoff to shipped in days.

Here's a different flavor: a year ago, we deferred building an admin console for IATA/ICAO airline codes because the ~1,000-line build and review didn't justify the effort over shipping advisor-facing features. Now it took 30 minutes. As one of our engineers put it: "We've gotten to the point where it's not a project, it's a PR."

The first example shows how a complex project compresses from weeks to days. The second shows how work we'd previously shelved becomes trivial — not even worth calling a project anymore.

This change also made AI significantly more useful. Current, detailed context translates directly into better outputs. Vague or stale specs produce code that gets thrown away. We've thrown away plenty, and the pattern is consistent.

2. Investing in skills and agent definitions

This was the unlock most people didn't anticipate.

Early in the rollout, the common frustration was that AI wasn't quite good enough. It would get close but miss the nuances of how we actually build: our patterns, our preferences, the tricky parts of our codebase. The team's instinct was to work around it.

The shift happened when we started teaching the AI how we work rather than adapting to how it worked out of the box:

We wrote skills and agent definitions that give Claude Code structured context about our system
We pulled documentation from Notion into the repo so it's visible to the AI alongside the code
We rewrote our architectural decision records to be more concise and more useful as reference material

The result was AI that works the way we work, not a generic approximation of how software teams work. That is a meaningfully different tool.

3. Making standups about business goals

We retooled daily standups away from task reporting. The question shifted from "what did you work on yesterday?" to "what are you doing to move the needle this week?"

That change created a new behavior: engineers started sharing workflow improvements openly. Someone might walk through how they compressed a two-hour task to twenty minutes, or how they used context tooling to get unstuck on something that previously required a long detour through legacy code. Those examples spread.

4. Rethinking code review

Doubling output creates review bottlenecks if you don't adapt the process. A few changes made the difference:

Draft stage: Nearly all PRs now go through a draft stage where the author explicitly marks them ready for review. This creates a natural pause for self-review before anyone else looks at it.
Pair as primary reviewer: The pair partner reviews most PRs. Someone who was close to the problem can evaluate the output against the intent, not just the implementation.
High-risk reviewer role: For changes that carry meaningful production risk, a designated engineer evaluates specifically whether a change could cause an incident. They're not reviewing the approach. They're looking for the class of problems that take down production.

What we're still figuring out

The bottleneck has shifted. As engineering capacity has grown, the harder work is figuring out what to build.

Travel technology has depth that's hard to see from the outside. Commission reconciliation alone involves fifteen or more models depending on the relationships between suppliers, host agencies, agencies, and advisors. Most of that knowledge lived in people's heads and nowhere else. AI executes well on well-specified problems. In travel technology, getting to a well-specified problem takes real domain expertise.

As engineering capacity grows, the ratio of time spent understanding a problem to time spent building the solution has to shift accordingly. We're actively working through what that means: what a good spec looks like before anyone picks it up, how discovery gets structured, and when to build fast and learn versus invest more upfront.

That's the live question.

Eight weeks in

The gains are real, sustained, and the quality signals are moving in the right direction.

The structure around the tools is what made it work. The pairs, the planning discipline, the skills investment, the standup culture, the review process. The tools gave us capacity. The structure determined what we did with it.

198 PRs a week. 21 people. Accelerating.

Every set of changes introduces new bottlenecks. Give it another six weeks, and we'll probably have a whole other suite of changes to talk about.

Pete Jackson and Brian Reath