gw auth api ord inv db
Some paths require a wider view.
Return on a larger screen.

The Full Picture

The system failed on a Tuesday.
But this time, the developer did not ask "When did this begin?"
They opened three windows and found the answer themselves.
!
PRODUCTION ALERT: orders error rate > 5%
orders · error_rate = 12.4% · threshold = 5%
14:37 UTC
An alert fires. Something is wrong. You have three tools at your disposal.
Where do you start?

Choose your first investigation step.

There is no wrong answer. Each reveals a different piece.
metrics dashboard - orders service metric
http.server.errors / http.server.requests - orders
5%
14:20 14:37 14:55
resource.service.name = orders
error_rate = 12.4%  ·  p99_latency = 3,200ms  ·  window = 14:30 – 14:40 UTC
You know THAT something is wrong with orders, and WHEN it started. But not WHAT is failing or WHY.
trace search - orders service - errors only trace
Trace f7a2c9e1 · 14:37:22 UTC · status: ERROR
gateway
3180ms
api
3040ms
orders
2810ms
database
2560ms ERR
trace_id: f7a2c9e1...3b8d · root: gateway · error at: database span
You can see WHERE the error happens, the database span. But the trace doesn't say WHY the database call failed.
log search - severity >= WARN - 14:30 to 14:40 log
Recent error logs (multiple services)
14:36:58 gateway WARN upstream timeout target=api
14:37:01 payment WARN retrying charge attempt=2
14:37:04 inventory WARN stock check slow latency=1200ms
14:37:09 api ERROR request failed path=/orders
14:37:11 orders ERROR db write failed table=orders
14:37:14 gateway ERROR 502 bad gateway route=/checkout
Errors are scattered across six services. Without knowing the request path, it is hard to tell which error is the cause and which is a symptom.
One signal gave you a piece. Not the whole answer. What do you check next?

Choose your second signal.

You need at least two signals to narrow down the problem.
Two signals, two perspectives. One showed the shape, the other added context. The last signal will complete the picture.

Check the final signal.

You know what is left.
You have seen all three signals. Errors appear in gateway, api, and orders. Logs mention payment and inventory too. But which service is the actual root cause?
Click the service where the failure originates.
Think about what the trace showed you.
gateway api orders database payment inventory
Root cause identified

The database connection pool is exhausted. Every service above it in the call chain, orders, api, gateway, fails as a consequence. The log warnings from payment and inventory are symptoms of the same bottleneck.

pool=orders · max_connections=10 · active=10 · waiting=34
Your investigation path
Remember this?
gateway api orders payment inventory database traces metrics logs
You have walked the path of 19 koans.
From a silent system to one that speaks in metrics, traces, and logs.
You began with silence.
Now you can see.
← Return to the beginning