The Three Pillars of Observability Are A Myth: What Actually Matters In Production

The team has logs in Elasticsearch, metrics in Prometheus, and traces in Jaeger. Three different tools, three different mental models, three different places to look during an incident. When something is broken, the on-call jumps between tabs trying to correlate timestamps. The tools are all “best in class.” The team is still slow at debugging because each tool answers a different question and connecting them is on you.

The “three pillars” framing of observability — logs, metrics, traces — is everywhere and is, frankly, a category mistake. It describes implementation details, not what you’re trying to do. The thing you actually need is the ability to ask new questions about your system in real time. That capability shows up most strongly when you store high-cardinality events and let yourself slice them by anything.

This post is the case for the alternative framing, and the practical setup that gets you genuine observability.

The original framing, briefly

The three-pillars argument: a production system needs

Logs: time-stamped text describing what happened.
Metrics: numeric counters and gauges, aggregated.
Traces: linked spans across services, showing request flow.

Each pillar has its tools. Logs go to ELK/Loki. Metrics go to Prometheus. Traces go to Jaeger/Tempo.

The framing is so common it’s barely questioned. But it has a problem: the pillars don’t compose. The metric you have doesn’t help you find the trace you need. The log line doesn’t link to the metric. You glue them together with timestamps and request IDs and hope the clocks agree.

The alternative: events with cardinality

Charity Majors and the Honeycomb team’s framing: store wide events. An event is a single record describing one unit of work — usually one HTTP request, one job, one transaction. Each event has many attributes: user_id, endpoint, latency_ms, status, region, feature_flag.x, db.query_count, db.query_time, etc.

Instead of three separate stores, you have one store that holds events. From those events you can:

Compute aggregates (“p99 latency by endpoint”) — your metrics.
Look at individual records (“show me events where user_id=42 in the last hour”) — your logs.
Reconstruct flows (“show me all events with this trace_id”) — your traces.

The pillars become views over the same data, not separate systems.

The cardinality argument

The thing that distinguishes observability from monitoring is the ability to ask questions you didn’t predict. To do that, you need attributes you didn’t pre-aggregate. Cardinality — the number of distinct values an attribute can have — is what enables that.

Pre-aggregated metrics (Prometheus-style) require you to decide cardinality up front. http_requests_total{endpoint="..."} has at most one entry per endpoint. If you wanted to slice by user_id, you’d have a million distinct values, which Prometheus cannot store.

Event stores (Honeycomb, Datadog APM with Trace Search, ClickHouse-backed self-hosts) let you slice by anything you stored, including high-cardinality attributes. Run a query that says “for user_id=42, show me p99 latency by endpoint” — and it works.

This is the capability that turns “we have data” into “we can debug.”

What this means in practice

A few changes from the three-pillars setup:

1. Emit one structured event per request. Not separate log lines + metric increments. One event with everything:

{
  "ts": "2024-09-13T10:32:14Z",
  "service": "api",
  "trace_id": "abc123",
  "span_id": "def456",
  "endpoint": "/api/orders",
  "user_id": "42",
  "tenant_id": "acme",
  "status": 200,
  "latency_ms": 142,
  "db_query_count": 3,
  "db_query_total_ms": 87,
  "cache_hits": 5,
  "cache_misses": 1,
  "feature_flags": {"new_pricing": true},
  "region": "us-east-1"
}

This single event answers questions logs, metrics, and traces would each give a partial answer to.

2. Use OpenTelemetry semantic conventions. Standardized attribute names mean tools can find them automatically.

3. Send events to a backend that supports high-cardinality slicing. Honeycomb, Datadog APM, New Relic, or self-hosted ClickHouse-based setups (e.g., SigNoz, [Tempo + Grafana with the right config]).

4. Phase out separate metric counters where possible. Aggregations from event data replace pre-defined counters. Some specific cases (CPU, memory, raw infra) still want metrics; everything app-level can be derived.

The two real pillars

If you must talk about pillars, two are enough:

Events for application behavior. One row per logical unit of work, with rich attributes.
Metrics for infrastructure. CPU, memory, network — things that aren’t request-shaped.

Logs become “events without a request context.” Traces become “events linked by trace_id.” The framework simplifies.

What you give up

Two real costs:

Storage cost. Wide events with many attributes are bigger than aggregated metrics. Storing every event for a year is expensive. Most teams sample (keep 100% of errors, 10% of successes, all of high-value endpoints). The querying capability is the trade.

Tool maturity for self-hosted setups. Prometheus + Grafana + Loki is well-trodden. Self-hosting an event-shaped observability stack is newer; SigNoz, ClickHouse-based stacks, etc. are real options but rougher.

For SaaS observability tools (Honeycomb, Datadog), the integration is polished. The cost is the bill.

A practical setup

For a Node.js / Go / Python service:

Instrument with OpenTelemetry. Auto-instrumentation for HTTP, DB, queue handlers. Manual spans at logical boundaries (see the OTel post).
Add request-level attributes you’ll want to query by. user_id, tenant_id, feature_flag.X, business categories, etc.
Send to a backend with high-cardinality support. Honeycomb (great UX), Datadog (broad), Tempo+Grafana (free if you self-host).
Keep Prometheus for infrastructure metrics. The two coexist; they answer different questions.

The cost is one library, ~50 lines of setup code, and a backend.

Sampling

For high-volume services:

Always sample errors at 100%. Every error is a learning opportunity.
Sample slow requests at 100%. Tail latency is where bugs live.
Sample normal requests at 1-10%. Aggregates remain accurate; you just don’t store every single event.

Tail-based sampling (decide after the request completes) is the right approach. The OTel Collector supports it. For SaaS tools, it’s typically built in.

Two anti-patterns

1. Logging the same data twice — once as a log line, once as a metric increment. Doubles cost, doubles complexity, doesn’t add capability. Pick one (the event) and derive the other.

2. Pre-aggregating everything you might ever want. Prometheus shops sometimes have 200 named metrics that try to cover every conceivable query. The result: 200 places to maintain, no flexibility for the question that wasn’t pre-anticipated. Events solve this.

What a debugging session looks like

The difference between three-pillars and event-based observability is most visible mid-incident:

Three pillars: Dashboard shows error rate spike. Switch to logs, search by time, find error messages. Switch to traces, look up a request ID, find the slow span. Switch back to metrics, correlate manually. 30 minutes.

Event-based: Query: “show me events from the last 5 minutes where status=500, group by attribute that varies most.” Answer arrives in seconds, points at a specific feature_flag value or tenant_id. Drill in by adding filters. 5 minutes to root cause.

Once you’ve experienced the second flow, the first feels like working blindfolded.

When metrics + logs are enough

For small services with simple traffic patterns, the three-pillars setup is fine. The event-based approach pays back when:

You have many independent dimensions to slice by (multi-tenant, multi-region, many feature flags).
Incidents are getting hard to debug; cause is “somewhere between three services.”
You’re adopting OTel anyway.

For a 5-service stack on a single region, Prometheus + structured logs is enough. The capabilities only matter when complexity does.

The takeaway

The three pillars are a description of tools, not capabilities. The capability you actually want is “ask new questions about my system, fast, in real time.” That comes from storing wide events with high-cardinality attributes — not from picking three separate tools.

Modern observability tooling (OpenTelemetry, Honeycomb, Datadog APM, ClickHouse stacks) supports this directly. The migration from three-pillars to events is incremental: start emitting wide events, and the new capability shows up the next time you have an incident that the dashboard couldn’t explain.

A note from Yojji

The kind of observability work that turns “we have logs, metrics, and traces” into “we can answer any question about the system in seconds” is the kind of long-haul backend engineering Yojji’s teams build into the platforms they ship for clients.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, GCP), and microservices — including the observability work that decides whether your incident debugging is fast or guesswork.