Graceful Degradation: The Pattern That Turns Total Outages into Partial Success

Black Friday. The recommendation service, powered by a Python ML model, OOMs under load. The Node.js API gateway calls it to populate “You might also like” on the product page. The call throws. The gateway does not catch it. The entire product page route returns 500. The checkout flow is fine, but 70% of users discover products through recommendations. Revenue drops 40% for two hours until the ML team scales the service.

The recommendation engine was 8% of the page payload. The other 92% (product details, pricing, inventory, reviews) was healthy. But the architecture was all-or-nothing: if any dependency failed, the route failed. That is not a dependency problem. It is a composition problem.

Graceful degradation is the decision to serve a partial success instead of a total failure. It is not a circuit breaker (which stops calling the broken service) and it is not load shedding (which rejects the request). It is the code that says: “The recommendations are unavailable, so show the top-10 bestsellers from yesterday’s cache instead.” The user still sees a product page. The business still makes money. The on-call engineer fixes the ML service in the morning instead of at midnight.

This post is the degradation pattern: the fallback hierarchy, the wrapper code that makes it automatic, the cache strategy that keeps stale data ready, and the monitoring that tells you when you are running degraded.

The fallback hierarchy

Not all failures deserve the same response. A four-level hierarchy keeps your decisions consistent:

Level 1: Stale cache. If the dependency fails, return the last cached response even if it is expired. For recommendations, yesterday’s bestsellers are better than nothing.

Level 2: Pre-computed defaults. Maintain a static fallback for critical paths. A weather app might show “conditions unavailable” but still display the 7-day forecast from the last successful sync. An e-commerce site might show “trending now” instead of personalized recommendations.

Level 3: Simplified response. Omit the failed section entirely but return HTTP 200 with the rest of the payload. The mobile app renders the product page without the “similar items” carousel. This requires frontend discipline: every optional section must handle absence.

Level 4: Empty but valid. Return an empty array, a null field, or a placeholder object that satisfies the schema. The client knows something is missing but the page does not crash.

The rule: never let a Level 1-4 fallback propagate as a 500. A 500 means “I have no idea what happened.” A 200 with degraded data means “I am operational, but this specific feature is limited.”

The degradation wrapper

The implementation is a wrapper around service calls. It enforces a timeout, catches errors, and routes to the fallback. Here is the TypeScript version we use in production:

// degradation.ts
import { setTimeout } from 'node:timers/promises';

type DegradationLevel = 'stale' | 'default' | 'simplified' | 'empty';

interface DegradeOptions<T> {
  name: string;
  timeoutMs: number;
  fallback: T;
  staleCache?: () => Promise<T | undefined>;
  onDegraded?: (level: DegradationLevel, err: unknown) => void;
}

export async function withDegradation<T>(
  operation: () => Promise<T>,
  options: DegradeOptions<T>,
): Promise<T> {
  const { name, timeoutMs, fallback, staleCache, onDegraded } = options;

  try {
    const result = await Promise.race([
      operation(),
      setTimeout(timeoutMs, Symbol('timeout')),
    ]);

    if (result === Symbol('timeout')) {
      throw new Error(`${name} timed out after ${timeoutMs}ms`);
    }

    return result as T;
  } catch (err) {
    // Level 1: try stale cache.
    if (staleCache) {
      try {
        const cached = await staleCache();
        if (cached !== undefined) {
          onDegraded?.('stale', err);
          return cached;
        }
      } catch {}
    }

    // Level 4: empty but valid fallback.
    onDegraded?.('empty', err);
    return fallback;
  }
}

Usage in an Express route:

import { Router } from 'express';
import { withDegradation } from './degradation.js';
import { getRecommendations } from './recommendations.js';
import { redis } from './redis.js';

const router = Router();

router.get('/api/products/:id', async (req, res) => {
  const [product, recommendations] = await Promise.all([
    getProduct(req.params.id),
    withDegradation(
      () => getRecommendations(req.params.id),
      {
        name: 'recommendations',
        timeoutMs: 200,
        fallback: [],
        staleCache: async () => {
          const cached = await redis.get(`recs:${req.params.id}`);
          return cached ? JSON.parse(cached) : undefined;
        },
        onDegraded: (level, err) => {
          console.log(JSON.stringify({
            event: 'degradation',
            service: 'recommendations',
            level,
            productId: req.params.id,
            error: (err as Error).message,
          }));
        },
      },
    ),
  ]);

  res.json({ product, recommendations });
});

The timeout is aggressive: 200ms. If the ML service is healthy, it responds in 30ms. If it is slow, we do not wait. We degrade immediately. The user gets the product page in 220ms instead of 30 seconds.

Three design decisions in this wrapper matter:

1. The timeout is a business decision, not a technical one. 200ms means “recommendations are nice, but they are not worth delaying the page.” For inventory data, the timeout might be 2 seconds because “out of stock” is business-critical.

2. The fallback is typed. It returns the same shape as the success case. The route handler does not need an if (recommendations === null) branch. The frontend receives an empty array and renders nothing.

3. Degradation is logged as a first-class event. Not an error. An error implies someone did something wrong. Degradation implies the system is adapting. You want separate metrics for each.

Stale-while-error caching

The wrapper above uses Redis for stale cache, but you need a specific caching strategy: write-through with a long TTL, and on failure, read the stale value regardless of expiration.

// cache.ts
import { redis } from './redis.js';

export async function getStaleOrFresh<T>(
  key: string,
  fetcher: () => Promise<T>,
  ttlSeconds: number,
): Promise<T> {
  const cached = await redis.get(key);

  if (cached) {
    const parsed = JSON.parse(cached) as T;
    // If the TTL is still valid, return immediately.
    const ttl = await redis.ttl(key);
    if (ttl > 0) return parsed;
    // TTL expired: return stale, but trigger background refresh.
    fetcher().then(fresh => {
      redis.setex(key, ttlSeconds, JSON.stringify(fresh));
    }).catch(() => {});
    return parsed;
  }

  // Cold cache: fetch and store.
  const fresh = await fetcher();
  await redis.setex(key, ttlSeconds, JSON.stringify(fresh));
  return fresh;
}

This is stale-while-revalidate with a twist: on failure, the stale value is served even if it is hours old. The fetcher() promise in the background refresh is fire-and-forget. If it fails, the stale value remains. If it succeeds, the cache is warm again.

For critical fallbacks, pre-compute defaults and store them in a separate key that never expires:

await redis.set('recommendations:default', JSON.stringify(bestsellers));

Then in the wrapper:

staleCache: async () => {
  const cached = await redis.get(`recs:${id}`);
  if (cached) return JSON.parse(cached);
  const defaultRecs = await redis.get('recommendations:default');
  return defaultRecs ? JSON.parse(defaultRecs) : undefined;
},

This gives you two fallback layers: personalized stale data, then global defaults.

Feature flags for controlled degradation

Not every route should degrade the same way. Some features are too important to fake. Use feature flags to control degradation per route or per environment:

const DEGRADATION_CONFIG = {
  recommendations: { enabled: true, timeoutMs: 200, fallback: [] },
  inventory: { enabled: false, timeoutMs: 5000 }, // no fallback; fail fast
  reviews: { enabled: true, timeoutMs: 300, fallback: { count: 0, items: [] } },
  pricing: { enabled: false }, // never degrade pricing
};

export function shouldDegrade(feature: string): boolean {
  return DEGRADATION_CONFIG[feature]?.enabled ?? false;
}

In production, drive this from an environment variable or a feature flag service. During incidents, you can disable degradation for a specific feature if the fallback is causing confusion, or tighten the timeout if the dependency is flapping.

When not to degrade

Graceful degradation is not a universal virtue. There are paths where partial success is worse than total failure:

Payments and refunds. Never return “payment probably succeeded.” The user needs certainty. If the payment gateway is down, return 503 and let the user retry.
Authentication and authorization. Never degrade to “allow all” because the auth service is slow. A 500 or 503 is correct here.
Safety-critical operations. Medical dosing, industrial control, anything where a wrong answer hurts someone. Fail closed, not open.
Data mutations with side effects. If you are charging a customer and the inventory check fails, do not default to “assume in stock.” The business rule is: no charge without confirmation.

The rule: degrade reads, not writes. Degrade optional features, not core guarantees.

Monitoring degradation

You cannot manage what you do not measure. Four metrics matter:

1. Degradation rate per service.

rate(degradation_events_total[5m])

Alert when this is above zero for more than 10 minutes. Degradation is a bandage, not a cure. If you are degraded for an hour, the dependency needs fixing, not more fallback.

2. Fallback cache hit rate.

rate(fallback_cache_hits_total[5m]) / rate(fallback_cache_attempts_total[5m])

If this drops below 50%, your stale cache is empty and you are serving Level 4 (empty) fallbacks more often than Level 1. That is a data warmth problem.

3. User-facing latency during degradation.

Degradation should make responses faster, not slower. If your fallback path is slower than the primary path, you have a bug in the fallback logic (common with unoptimized default queries).

4. Revenue or engagement impact.

The ultimate metric. If the recommendation service is degraded to bestsellers, does conversion drop 2% or 20%? This tells you whether your fallback is good enough or needs better defaults.

The operational checklist

Before you declare degradation work done, verify:

Every optional dependency has a typed fallback that matches the success shape.
Timeouts are set per dependency based on business criticality, not engineering convenience.
Stale cache has a separate long-TTL key or a cache-aside pattern that survives primary failure.
Feature flags control which routes can degrade.
Degradation events are logged and metrics are exported.
Alerts fire when degradation rate is non-zero for more than 10 minutes.
Load tests confirm that the fallback path is faster than the timeout path.
Frontend and mobile clients handle missing optional fields without crashing.
Write paths (payments, auth, mutations) do not degrade. They fail fast.

The takeaway

A total outage is not always caused by a total failure. It is often caused by a partial failure that the architecture treats as total. One slow dependency, one missing cache key, one unhandled rejection in a non-critical service, and the entire page goes white.

Graceful degradation is the engineering decision to build fallbacks that are good enough. Not perfect. Not ideal. Good enough to keep the business running while the team fixes the root cause. It is stale recommendations instead of a 500. It is a product page without reviews instead of no product page at all. It is the difference between a blip in metrics and a revenue cliff.

Build the wrapper. Set the timeouts. Warm the fallback cache. And stop letting 8% of your page take down the other 92%.

A note from Yojji

The kind of resilient architecture that serves partial success instead of total failure during dependency outages is exactly the kind of practical engineering Yojji builds into the platforms it ships. Their senior teams specialize in designing distributed systems where fallback strategies, cache hierarchies, and timeout discipline keep services operational through the inevitable failures of real-world infrastructure.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their engineers work across the JavaScript ecosystem, cloud platforms, and event-driven microservices, building the degradation logic and operational monitoring that turn 2 a.m. outages into non-events.