By Arjun Mehta
There's a moment in every engineer's life when they're debugging a production issue at 2am, looking at a dashboard that says "the system is up," while users are complaining that nothing works. The monitoring says everything is fine. The system claims to be healthy. But something is clearly broken.
That's the moment when you realize monitoring isn't enough. You need observability.
Monitoring tells you what you expect to happen. Observability tells you what actually happened. Monitoring answers "is the system up?" Observability answers "why did that user's request take 30 seconds? Where did the 30 seconds go? Which component caused it?"
With monitoring, you have dashboards showing CPU, memory, request count. With observability, you have a system that lets you ask arbitrary questions about why something happened. It's the difference between knowing your car's fuel gauge and understanding exactly why it won't start.
What Is Observability?
Observability is the degree to which you can understand the internal state of a system based on its external outputs.
Think about it: you can't see inside your running system. You can only see what it tells you. The better it tells you what's happening, the more observable it is.
Observable systems have three pillars:
Metrics: Numbers that measure your system. Request latency. Error rate. CPU usage. Database query time. Success rate of payment processing. These are aggregate measurements over time.
Logs: Records of events. "User logged in at 2:15:32pm," "Database connection failed," "Payment processor returned timeout error." Logs are detailed but voluminous.
Traces: The path a request takes through your system. When a user clicks something, that request might go through your API gateway, then load balancer, then authentication service, then business logic service, then database. A trace shows that entire path, including how long each step took and where failures occurred.
When you have all three together, you have observability. You can answer:
"Why are requests slow?" (metrics show latency is up, traces show where time is being spent)
"Why are errors happening?" (metrics show error rate increased, logs show the underlying errors, traces show which component is failing)
"Is a specific user affected or is it system-wide?" (traces let you follow one user's journey through the system)
"What changed?" (logs of deployments, configuration changes, and errors let you correlate events)
Monitoring vs. Observability
Monitoring checks if something expected happened. You set up alerts:
- If CPU > 80%, alert
- If error rate > 5%, alert
- If response time > 500ms, alert
These work when you know what to expect. But the world is full of novel failure modes. Your system might fail in ways you didn't anticipate.
Observability doesn't require you to predict problems. It gives you the ability to investigate problems you didn't expect.
Monitoring is for known unknowns. You know that CPU spikes are bad, so you monitor CPU. You know that errors are bad, so you monitor error rate.
Observability is for unknown unknowns. A user complains that something takes too long. You don't know what "something" is, so you can't monitor for it directly. But with observability, you can ask: "show me that user's journey through the system" and trace where time is being spent.
Example: Your payment processing is slow. With monitoring, you might see that your payment service took 10 seconds on average. But which 10 seconds? Is it the network call to the payment processor? Is it database lookup? Is it validation?
With observability, you can trace a specific payment through the system and see: network call took 8 seconds, database lookup took 1 second, validation took 0.1 seconds. Now you know where to optimize.
Building for Observability
Observability isn't something you add at the end. It's something you build in from the start.
1. Instrument Your Code
Every component that does something important needs to emit data about what it's doing.
Metrics:
request_latency_ms: how long did this request take?
error_count: how many errors?
cache_hit_rate: what percentage of database queries hit cache?
payment_success_rate: what percentage of payments succeeded?
Logs:
INFO: User 12345 logged in
WARN: Payment processor response time 8000ms (slow)
ERROR: Database connection failed, retrying
DEBUG: Cache miss for key "user_profile_789"
Traces: When a request arrives, assign it a unique ID (trace_id). Every component that touches that request logs the trace_id. Later, you can search for that trace_id and see the entire journey.
2. Use Structured Logging
Unstructured logging:
"User login failed for user@example.com because password incorrect"
Structured logging:
{
"event": "user_login_failed",
"user_id": 12345,
"email": "user@example.com",
"reason": "password_incorrect",
"timestamp": "2026-02-23T02:15:32Z",
"trace_id": "abc123def456"
}
Structured logs are searchable and queryable. You can ask "how many login attempts failed due to password incorrect in the last hour?" Without structure, this requires parsing text.
3. Use Distributed Tracing
With a distributed tracing system (Jaeger, Datadog, New Relic), each request gets a trace_id. Every service that touches that request records:
- What it did
- How long it took
- Any errors
- The parent and child services in the call chain
Later, you can reconstruct the entire journey of a request through your system.
4. Monitor What Matters to Users
Don't just monitor technical metrics. Monitor things users care about:
- Payment success rate
- Search results relevance
- Page load time
- Checkout completion rate
- Feature availability
Technical metrics matter because they correlate to user experience, but user-facing metrics are what actually matter.
The Observability Culture
Observability isn't just technical. It's a cultural practice.
Developers should be familiar with production. If developers don't have access to logs and traces from production, they can't learn what actually happens. They'll build features in the dark, assuming they work correctly.
Logs should be part of incident response. When something breaks, the first step is "what happened?" You should be able to answer that by looking at logs and traces.
Use observability to find problems, not just debug them. Good observability reveals problems before they impact users. A subtle memory leak might not trigger alerts for weeks, but your metrics will show gradual degradation over time.
Document what you're observing. If you have a dashboard showing "requests that take more than 500ms per region," explain why that metric matters. What does it indicate? What action would you take if it spiked?
Tools for Observability
There are many options:
Open Source:
- Prometheus: Metrics collection and storage
- Jaeger: Distributed tracing
- ELK Stack: Elasticsearch for log storage and querying
- Grafana: Visualization for metrics
Managed Services:
- Datadog: All-in-one metrics, logs, and traces
- New Relic: Similar all-in-one approach
- Honeycomb: Focus on high-cardinality data and traces
The tool matters less than the practice. A simple setup with good observability practices is better than an expensive tool used poorly.
Building Observability into Your Architecture
When evaluating tools like Glue for understanding your codebase, think about observability. Glue analyzes your actual codebase to show you architecture, coupling, and complexity hotspots. This is observability at the code level—understanding the structure without having to read every file.
Combine code-level observability (understanding your architecture) with runtime observability (understanding how it actually behaves in production). Together, they give you full visibility.
Frequently Asked Questions
Q: Aren't metrics and logs enough?
A: They help, but they're limited. Metrics tell you "something is wrong." Logs tell you events that happened. Traces tell you "this is the path that request took and where it spent time." For complex systems, you need all three.
Q: Do I need distributed tracing?
A: For monolithic systems with few components, structured logs might be enough. For microservice architectures, distributed tracing is essential. Without it, you can't follow a request through the system.
Q: How much data should I log?
A: Enough to understand what happened, not so much that you're paying 10x for storage. Structured logs are more efficient than unstructured. Log events that matter: important state changes, errors, and timing information. Don't log every line of code execution.
Q: Doesn't observability hurt performance?
A: It can, if done poorly. Efficient logging and tracing (especially sampling in high-traffic systems) is important. Good observability tools minimize overhead.