Observability: Beyond Monitoring
I've watched teams spend six months implementing Datadog, deploy it across 50 services, and still be unable to answer the basic question: "Why is request latency degrading?" They had metrics. They had dashboards. They had alerting. What they didn't have was observability.
There's a critical difference between monitoring and observability, and conflating them is a fast way to build expensive systems that don't actually help you understand what's happening.
Monitoring Is Not Observability
Monitoring tells you about things you already know to look for. You define metrics ahead of time - CPU, memory, request latency, error rates - and collect them. You set thresholds. When those thresholds breach, you get an alert. This works great for known problems. It's terrible for unknown ones.
Observability is the ability to ask arbitrary questions about your system's state without knowing in advance what questions you'll need to ask. It's about unknown unknowns. You're not trying to predict failure modes; you're trying to build systems that reveal their own failure modes when you ask.
Here's the practical difference: Monitoring says "we're alerting on 95th percentile latency." Observability says "we recorded enough structured information that we can answer: what percentage of requests hit the cache? Of those that hit the database, what was the distribution of query time by table? Of the slow queries, which ones came from the auth service? What percentage of those had retry loops?"
Monitoring is a fire alarm. Observability is forensic evidence.
The Three Pillars
The observability industry settled on three pillars: logs, metrics, and traces. Each one tells you something different, and you need all three.
Logs are events that happened. "User 12345 logged in at 2024-02-24 14:32:10 UTC." "Database connection pool exhausted at 14:32:15 UTC." "Request to /api/users took 5234ms." Logs are high cardinality - they can contain any arbitrary information - but they're expensive to store. Most teams can only afford to log a fraction of their events.
Metrics are measurements over time. They're aggregations: request count, request latency distribution, CPU usage, memory usage, queue depth. Metrics are low cardinality and cheap to store. You can afford to keep metrics forever. But they're aggregates - when you look at a metric, you've lost the individual data points. You know the 95th percentile of request latency went up, but you don't know which requests were slow or why.
Traces are the request flow through your system. A single API request hits your auth service, then your API gateway, then three backend services, then the database. A trace shows you the entire path, timing of each hop, and any errors. Traces are how you understand distributed system behavior.
Most teams implement metrics first (monitoring), eventually add logs (debugging), and never implement traces. This is backwards. Traces are the most underused pillar, and they're where the real understanding lives.
Why Traces Matter Most
Think about a slow request. With logs and metrics, you see: "latency was high." You dig through logs, find the request ID, and see a cascade of timestamps. You try to correlate log entries to understand what happened. It's forensic, manual, and slow.
With traces, you see the exact timeline of every operation in every service. You see: auth took 50ms, API gateway took 20ms, service A took 200ms and called service B, service B made three database queries. The second query was 150ms. You can zoom into that query. Why was it slow? Was it a full table scan? A connection timeout waiting for a free pool slot? A lock contention on a row?
Traces turn debugging from a guessing game into a scavenger hunt with a map.
Implementing Observability
Here's what you actually need to do.
Instrument everything with OpenTelemetry. OpenTelemetry is the open standard for generating traces, metrics, and logs in a vendor-agnostic way. Use the language-specific SDKs. Instrument your frameworks (Django, FastAPI, Rails, Spring Boot all have OpenTelemetry integrations). Set it up once and forget about it.
What to instrument: every external call (database queries, HTTP requests, cache hits), every business operation (user login, payment processing, report generation), and every state change in critical services. You don't need to instrument every line of code. You need to instrument the boundaries where latency actually lives.
Structure your logs. Logs should be structured data, not free-form text. That means JSON with consistent keys. Every log entry should include: timestamp, request ID (for correlation), service name, operation name, and context-specific fields. Never log "error occurred" - log "database_connection_failed" with fields like host, port, error_message. When you have structured logs, you can query them like a database.
Connect your traces to your logs. Every log entry should include the trace ID. This lets you reconstruct the full timeline. When you're looking at a trace in Jaeger, you can click to see all log entries for that trace ID. When you're in your logs, you can click to see the full trace.
Implement distributed tracing with OpenTelemetry and a backend like Jaeger or Tempo. This is not optional. Logs and metrics alone are not enough. You need to see request flows. Use context propagation (W3C Trace Context standard) to pass trace information between services.
Create service maps from your traces. Once you have traces, you can build a map of which services call which other services. This is your actual system topology, not the architecture diagram someone drew. Real service maps reveal undocumented dependencies and unauthorized integrations. This matters for resilience planning.
SLOs, Not SLAs
An SLA (Service Level Agreement) is a contract with your customer about availability. "We guarantee 99.9% uptime." If you miss it, you owe them money.
An SLO (Service Level Objective) is an internal target. "We want to hit 99.95% availability." It's more aggressive than your SLA because you want cushion. When you hit an SLO error budget (the 0.05% you planned for), you shift from feature development to reliability work until you're back in budget.
Most teams set SLOs based on arbitrary numbers ("99.9% sounds good"). Real SLOs are based on observability data. You measure: "When availability dropped below X%, did we lose customers? Did we lose revenue?" You set your SLO based on the threshold where business impact actually happens, not on what sounds ambitious.
Observability makes this possible. Without good observability data, you're guessing.
The Culture Piece
Observability doesn't work as an infrastructure afterthought. You need to treat "understanding what's happening in production" as a first-class engineering concern, not a "we'll add monitoring later" problem.
This means: developers spend time thinking about what they need to observe before they write code. Code reviews include questions about observability: "What happens when this service is overloaded? How will we know? What metric will change?" On-call engineers have time during their shift to improve observability of systems they support. Teams have observability budget similar to testing budget.
When a production incident happens, the post-mortem includes: "What made this hard to debug? What observability was missing?" That becomes a ticket. The next time you hit that scenario, you see it immediately.
Connecting to Codebase Intelligence
Here's where Glue comes in. Observability tells you what's happening in production. Codebase intelligence explains why the code behaves that way.
You see a trace showing that Service A is consistently slower when making calls to Service B. Why? Is Service B always slow, or is Service A doing something inefficient? Is there a retry loop? Is the connection pooling configured poorly? Is there a database N+1 problem?
Glue lets you ask: "Show me every call from Service A to Service B in the codebase. How many HTTP clients are configured? Do they share a connection pool?" You see the pattern. Maybe there are three different HTTP client instances, each with its own pool. You've found the problem without reading hundreds of lines of code.
Observability is external (what's happening), codebase intelligence is internal (why the code is structured this way). Together they close the loop.
Observability in 60 Seconds TL;DR
Monitoring tells you about known problems. Observability lets you ask arbitrary questions about system state. Three pillars: logs (structured events), metrics (aggregated measurements), traces (request flows through systems). Implement OpenTelemetry for open standard instrumentation. Connect logs to traces using trace IDs. Create real service maps from traces. Set SLOs based on observability data, not guesses. Treat observability as a first-class engineering concern, not an afterthought.
Frequently Asked Questions
Q: Doesn't implementing observability require expensive tools?
A: No. OpenTelemetry is free and open source. Jaeger (for traces) is free and open source. ELK Stack (for logs) is free and open source. You can build world-class observability on open source software. Commercial vendors (DataDog, New Relic) are expensive and add polish, but they're not required. Start with free tools and add commercial tools only if you have specific needs they solve better.
Q: How much observability is enough?
A: Enough to answer questions without digging through code or doing manual testing. If you hit a production incident and it takes an on-call engineer 30 minutes to understand the root cause, you don't have enough observability. If it takes 3 minutes, you probably do. The metric is time-to-understanding, not the number of metrics collected.
Q: Can we retrofit observability into an existing system?
A: Yes. Start with OpenTelemetry SDKs in your application and frameworks (30 minutes of work). Add structured logging (1-2 days of work). Add distributed tracing (1-2 weeks of work depending on system complexity). You can do this incrementally. Trace the most critical user flows first, then expand. Don't try to boil the ocean.
Related Reading
- What Is Technical Debt Tracking?
- What Is Technical Debt Prioritization?
- Technical Debt Patterns: The 7 Types Costing You the Most
- The CTO's Guide to Product Visibility
- What Is Code Intelligence?