Test-Driven Refactoring: How to Rewrite Legacy Code Without a Big Bang Rewrite
The billing module has 4,000 lines of untested JavaScript, nobody on the team wrote it, and every deploy triggers the "please do not break billing" prayer. Here is the three-phase strategy (characterization tests, strangler fig extraction, and feature flag cutover) that rewrites legacy code incrementally without a freeze or a big bang.
The billing module has been in production for four years. Nobody on the current team wrote it. The original author left two jobs ago. The module has 4,000 lines of JavaScript, zero tests, twenty-two if statements nested seven levels deep, and a single try/catch wrapping the entire request handler that catches everything and returns { error: 'something went wrong' } with a 500 status.
Every sprint the team dreads the billing ticket. A new tax calculation requirement means touching the pricing logic. Nobody understands the pricing logic. The only “test” is deploying to staging and manually entering credit card numbers. If the total looks wrong, somebody goes back into the code and changes if conditions until staging produces correct-looking numbers.
The common response to this situation is “let’s rewrite it.” A six-week project, a new codebase in TypeScript with tests, a big bang cutover. I have seen this play out six times. It works exactly once: when the code is small (under 1,000 lines), the domain is well-understood (the team knows every edge case), and the business is willing to accept a month without feature work. Every other time, the rewrite misses edge cases, the cutover reveals bugs nobody anticipated, and the project bleeds into a quarter of stalled output.
There is a better way. It is slower in the first week, faster by the third month, and dramatically less risky in every phase. It is called test-driven refactoring, and it has three phases: characterize the existing behavior, extract code using the strangler fig pattern, and cut over behind a feature flag.
Phase 1: Characterization tests
Before you can refactor safely, you need to know what the code actually does, not what you think it should do. The legacy billing module has zero tests, but it has been running in production for four years. That means its behavior, including the bugs, is the current truth. Your characterization tests capture that truth.
A characterization test is different from a specification test. You do not write assertions based on the requirements document. You run the code against a set of inputs, capture the outputs, and assert that the outputs stay exactly the same after your refactor. The tests are a net over the existing behavior.
Here is the pattern in TypeScript, applied to a legacy pricing function:
// legacy.js -- the untouchable original
function calculatePrice(items, customer, discountCode) {
let total = 0;
for (let i = 0; i < items.length; i++) {
let price = items[i].price;
// original author's comment: "special handling for promo items"
if (items[i].category === 'promo' && items[i].quantity > 1) {
price = price * 0.8;
}
total += price * items[i].quantity;
}
if (customer.tier === 'vip') {
total = total * 0.9;
}
// TODO: apply discount code
if (discountCode && discountCode.startsWith('SAVE')) {
const parts = discountCode.split('-');
if (parts.length === 2 && !isNaN(parseInt(parts[1]))) {
total = total - parseInt(parts[1]);
}
}
if (total < 0) total = 0;
return Math.round(total * 100) / 100;
}
The characterization tests for this function look like this:
import { calculatePrice } from './legacy';
// Use a describe block per function to group all characterizations
describe('calculatePrice (characterization)', () => {
// Helper to build test fixtures
function makeItem(overrides = {}) {
return {
id: 1,
name: 'test item',
price: 10.00,
quantity: 1,
category: 'standard',
...overrides,
};
}
function makeCustomer(overrides = {}) {
return { tier: 'standard', ...overrides };
}
// Test 1: simplest happy path -- one standard item
it('returns the item price for a single standard item with no discount', () => {
const result = calculatePrice(
[makeItem({ price: 10.00 })],
makeCustomer(),
null
);
expect(result).toBe(10.00);
});
// Test 2: quantity multiplies
it('multiplies price by quantity for standard items', () => {
const result = calculatePrice(
[makeItem({ price: 5.00, quantity: 3 })],
makeCustomer(),
null
);
expect(result).toBe(15.00);
});
// Test 3: promo items with quantity > 1 get 20% off
it('applies 20% discount to promo items with quantity > 1', () => {
const result = calculatePrice(
[makeItem({ price: 10.00, quantity: 2, category: 'promo' })],
makeCustomer(),
null
);
// 10 * 2 * 0.8 = 16
expect(result).toBe(16.00);
});
// Test 4: promo items with quantity === 1 do NOT get discount
it('does not apply promo discount when quantity is exactly 1', () => {
const result = calculatePrice(
[makeItem({ price: 10.00, quantity: 1, category: 'promo' })],
makeCustomer(),
null
);
expect(result).toBe(10.00);
});
// Test 5: VIP tier gets 10% off total
it('applies 10% VIP discount to total before discount code', () => {
const result = calculatePrice(
[makeItem({ price: 100.00 })],
makeCustomer({ tier: 'vip' }),
null
);
// 100 * 0.9 = 90
expect(result).toBe(90.00);
});
// Test 6: SAVE-xxx discount code subtracts xxx dollars
it('subtracts the numeric value of SAVE-xxx codes', () => {
const result = calculatePrice(
[makeItem({ price: 50.00 })],
makeCustomer(),
'SAVE-10'
);
// 50 - 10 = 40
expect(result).toBe(40.00);
});
// Test 7: discount code that does not start with SAVE is ignored
it('ignores discount codes not starting with SAVE', () => {
const result = calculatePrice(
[makeItem({ price: 50.00 })],
makeCustomer(),
'WELCOME-10'
);
expect(result).toBe(50.00);
});
// Test 8: total is floored at zero
it('returns 0 when discounts exceed the total', () => {
const result = calculatePrice(
[makeItem({ price: 5.00 })],
makeCustomer(),
'SAVE-100'
);
expect(result).toBe(0);
});
});
Run these tests against the legacy function. Every test passes. Now you have a safety net. If a future refactor changes any of these outputs, the test fails. The tests do not care whether the implementation is a tangle of if statements or a clean strategy pattern. They only care that the output for each input is identical.
You do not need 100% coverage in characterization tests. You need coverage for every input pattern that currently exists in production. Parse the git history, the bug tracker, and the support tickets to find the edge cases. Every bug that was fixed is a regression risk that needs a characterization test.
Phase 2: The strangler fig pattern
With characterization tests in place, you can start extracting the legacy code into smaller, testable pieces without changing any behavior. The strangler fig pattern (named after the vine that slowly envelops a tree) means you build the new code alongside the old code, bit by bit, routing calls to the new implementation as each piece becomes ready.
Do not attempt to extract the entire 4,000-line module at once. Find the seams. Look for functions that have clear inputs and outputs. The calculatePrice function above is a perfect candidate: it takes three arguments and returns a number. Its test surface is already written.
Here is the extraction in three steps.
Step 1: Copy the function into a new file and add TypeScript types.
// src/pricing/calculatePrice.ts
export interface LineItem {
id: number;
name: string;
price: number;
quantity: number;
category: 'standard' | 'promo' | 'clearance';
}
export interface CustomerProfile {
tier: 'standard' | 'vip' | 'enterprise';
}
export type DiscountCode = string | null;
export function calculatePrice(
items: LineItem[],
customer: CustomerProfile,
discountCode: DiscountCode
): number {
let total = 0;
for (const item of items) {
let price = item.price;
if (item.category === 'promo' && item.quantity > 1) {
price = price * 0.8;
}
total += price * item.quantity;
}
if (customer.tier === 'vip') {
total = total * 0.9;
}
if (discountCode && discountCode.startsWith('SAVE')) {
const parts = discountCode.split('-');
if (parts.length === 2 && !isNaN(parseInt(parts[1]))) {
total = total - parseInt(parts[1]);
}
}
if (total < 0) total = 0;
return Math.round(total * 100) / 100;
}
Run the characterization tests against both the old and new implementations. The tests should pass for both. If they do not, you either copied incorrectly or the legacy function depends on something outside its parameters (a global, a module-level variable, a database call). If it depends on globals, inject them as parameters in the new version and mock them in the tests.
Step 2: Factor the function into smaller, pure sub-functions.
Now that the logic is in a typed file with passing tests, you can safely decompose it. Each sub-function gets its own tests. This is where the real refactoring happens.
// src/pricing/calculatePrice.ts (after decomposition)
export function calculateLineTotal(item: LineItem): number {
let price = item.price;
if (item.category === 'promo' && item.quantity > 1) {
price = price * 0.8;
}
return Math.round(price * item.quantity * 100) / 100;
}
export function applyCustomerDiscount(
total: number,
customer: CustomerProfile
): number {
if (customer.tier === 'vip') {
return Math.round(total * 0.9 * 100) / 100;
}
return total;
}
export function applyDiscountCode(
total: number,
discountCode: DiscountCode
): number {
if (!discountCode || !discountCode.startsWith('SAVE')) {
return total;
}
const parts = discountCode.split('-');
if (parts.length !== 2 || isNaN(parseInt(parts[1]))) {
return total;
}
const result = total - parseInt(parts[1]);
return result < 0 ? 0 : result;
}
export function calculatePrice(
items: LineItem[],
customer: CustomerProfile,
discountCode: DiscountCode
): number {
let total = items.reduce((sum, item) => sum + calculateLineTotal(item), 0);
total = applyCustomerDiscount(total, customer);
total = applyDiscountCode(total, discountCode);
return Math.round(total * 100) / 100;
}
Each sub-function is pure, tested independently, and has no side effects. The calculatePrice function is now a composition of three small, readable functions instead of a single 32-line block with nested if statements.
Write unit tests for each sub-function:
describe('calculateLineTotal', () => {
it('returns price * quantity for standard items', () => {
expect(calculateLineTotal({ price: 10, quantity: 3, category: 'standard' } as LineItem))
.toBe(30);
});
it('applies 20% discount for promo items with qty > 1', () => {
expect(calculateLineTotal({ price: 10, quantity: 2, category: 'promo' } as LineItem))
.toBe(16);
});
it('does not apply discount for promo items with qty === 1', () => {
expect(calculateLineTotal({ price: 10, quantity: 1, category: 'promo' } as LineItem))
.toBe(10);
});
});
describe('applyCustomerDiscount', () => {
it('returns total unchanged for standard customers', () => {
expect(applyCustomerDiscount(100, { tier: 'standard' })).toBe(100);
});
it('applies 10% discount for VIP customers', () => {
expect(applyCustomerDiscount(100, { tier: 'vip' })).toBe(90);
});
});
describe('applyDiscountCode', () => {
it('subtracts SAVE-xxx amount from total', () => {
expect(applyDiscountCode(50, 'SAVE-10')).toBe(40);
});
it('returns 0 if discount exceeds total', () => {
expect(applyDiscountCode(5, 'SAVE-100')).toBe(0);
});
it('ignores codes not starting with SAVE', () => {
expect(applyDiscountCode(50, 'WELCOME-10')).toBe(50);
});
});
The characterization tests from Phase 1 still pass, now running against the decomposed version. The unit tests for the sub-functions give you a more precise safety net at a finer granularity.
Step 3: Replace the old import with the new one.
In the original billing module, change the import:
// before
const { calculatePrice } = require('./legacy');
// after
const { calculatePrice } = require('./src/pricing/calculatePrice');
Run the characterization tests. Run the integration tests. Deploy. The billing module now calls the refactored, typed, tested code for its pricing calculations, but everything else in the 4,000-line module is untouched.
Repeat this process for the next function, and the next, and the next. Each extraction takes a day or two. Within a month, the 4,000-line module is a thin shell that imports functions from separate, tested, typed modules. The shell itself becomes small enough that you can extract the remaining logic or delete it entirely.
Phase 3: Feature flag cutover
Sometimes you cannot extract the old code cleanly because the logic is deeply interwoven with the framework, the database, or the routing layer. In those cases, the safest path is to build the new implementation in parallel behind a feature flag.
The feature flag approach is the slowest but safest option. You build the new module alongside the old one. You route production traffic to the new module for a small percentage of users. You compare the outputs. When the new module matches the old one for 100% of requests over a week of production data, you cut over fully and delete the legacy code.
Here is the pattern with a simple feature flag:
// src/featureFlags.ts
export function useNewBilling(userId: string): boolean {
// Start with 1% of users, ramp up as confidence grows
return Math.abs(hashUserId(userId)) % 100 < 1;
}
function hashUserId(id: string): number {
let hash = 0;
for (let i = 0; i < id.length; i++) {
const char = id.charCodeAt(i);
hash = ((hash << 5) - hash) + char;
hash = hash & hash; // Convert to 32-bit integer
}
return hash;
}
// src/billing/router.ts (simplified)
import { useNewBilling } from '../featureFlags';
import { calculatePrice as oldCalculate } from '../legacy/billing';
import { calculatePrice as newCalculate } from '../pricing/calculatePrice';
function handleCheckout(req, res) {
const { items, customer, discountCode, userId } = req.body;
const quantity = useNewBilling(userId)
? newCalculate(items, customer, discountCode)
: oldCalculate(items, customer, discountCode);
// Compare outputs in production for monitoring
if (useNewBilling(userId)) {
const oldResult = oldCalculate(items, customer, discountCode);
if (Math.abs(quantity - oldResult) > 0.01) {
logger.warn('Price mismatch between old and new implementations', {
userId,
oldResult,
newResult: quantity,
items,
discountCode,
});
}
}
// ... complete checkout with the calculated price
}
The key insight: you compare the old and new results for every request, even after the flag is on. The comparison is the safety mechanism. If the flag shows 0 mismatches over a week, you can confidently delete the old code. If mismatches appear, you inspect them, fix the new implementation, and wait another week.
Never roll out a feature flag for a rewrite without a comparison logger. The comparison is the only objective measure of correctness. Without it, you are guessing.
Common mistakes
Writing new tests instead of characterization tests first. A team skips the characterization phase and writes specification tests based on the requirements they think the code should follow. The tests pass against the new code, but the new code handles an edge case differently from the old code. An order that was priced at $49.99 in production is now priced at $50.00. Customer support gets a ticket. The fix is to start with characterization tests that capture the actual production behavior, not the ideal behavior.
Extracting functions that have side effects. The strangler fig pattern works best with pure functions. If the legacy code writes to a database or calls an external API in the middle of a calculation, you cannot extract that calculation cleanly without extracting the side effect too. The solution is to identify the I/O boundary first. Extract the pure calculation logic. Leave the I/O in the shell. Test the I/O separately with integration tests.
Big bang feature flag cutover. A team builds the entire new module behind a single flag and flips it at once. This defeats the purpose of incremental refactoring. The flag should control a single function or a single route, not the entire module. Flip flags one function at a time.
Not deleting the old code. A team extracts four functions into new modules but leaves the old functions in place “just in case.” Two years later, nobody knows which code path is active. The dead code becomes a source of confusion and bugs. Delete the old code when the new code has been running in production with zero comparison mismatches for at least a week. Git has the history. You do not need the dead code.
When to just rewrite instead
The strangler fig pattern is not always the right answer. Consider a full rewrite when:
- The module is less than 500 lines. The overhead of extraction exceeds the benefit. Rewrite it, but still write characterization tests first against the production database or a traffic capture.
- The module is not deployed to production. If the code has never run against real traffic, there is no production behavior to preserve. Write specification tests and build the new version.
- The module is being replaced by a third-party service. If you are replacing in-house billing with Stripe or a custom pricing engine with a vendor, the interface is fundamentally different. Characterization tests on the old code help you migrate the data, but the new service defines its own contract.
Everything else goes through the strangler fig. It is slower in week one, but it is the only approach that guarantees you never ship a regressed billing calculation to production.
The practical takeaway
Every codebase has legacy code. The team that avoids touching it accumulates technical debt. The team that rewrites it in a big bang accumulates production incidents. The team that applies test-driven refactoring accumulates safety.
The three-phase rhythm is always the same. Characterize the existing behavior with tests that capture production truths. Extract one pure function at a time into a typed, tested module. Cut over behind a feature flag with a comparison logger. Repeat until the legacy module is an empty shell. Delete the shell.
Before your next legacy code ticket, run through this checklist:
- Characterization tests exist for every input pattern that currently reaches the function in production.
- The extraction targets one pure function at a time, not the entire module.
- The new code has the same TypeScript types, the same function signature, and passes the same characterization tests as the old code.
- Sub-functions are pure and independently unit tested.
- The feature flag (if used) starts at 1% and ramps up only after zero comparison mismatches.
- Old code is deleted within a week of full cutover.
The billing module is not hopeless. It is just untested. The difference is what you do next.
A note from Yojji
The discipline of refactoring legacy code without breaking production is a core engineering skill that separates teams who ship reliably from teams who ship prayerfully. Yojji’s engineering teams apply these same incremental extraction patterns when taking over existing codebases for full-cycle development engagements, ensuring that modernization happens without the downtime and revenue risk of a big bang rewrite.
Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. They specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms, and full-cycle product delivery from discovery through deployment.