The Full Picture

The system failed on a Tuesday.
But this time, the developer did not ask "When did this begin?"
They opened three windows and found the answer themselves.

An alert fires. Something is wrong. You have three tools at your disposal.
Where do you start?

Choose your first investigation step.

There is no wrong answer. Each reveals a different piece.

metrics dashboard - orders service metric

http.server.errors / http.server.requests - orders

14:20 14:37 14:55

resource.service.name = orders
error_rate = 12.4% · p99_latency = 3,200ms · window = 14:30 – 14:40 UTC

You know THAT something is wrong with orders, and WHEN it started. But not WHAT is failing or WHY.

trace search - orders service - errors only trace

Trace f7a2c9e1 · 14:37:22 UTC · status: ERROR

gateway

3180ms

api

3040ms

orders

2810ms

database

2560ms ERR

trace_id: f7a2c9e1...3b8d · root: gateway · error at: database span

You can see WHERE the error happens, the database span. But the trace doesn't say WHY the database call failed.

log search - severity >= WARN - 14:30 to 14:40 log

Recent error logs (multiple services)

14:36:58 gateway WARN upstream timeout target=api

14:37:01 payment WARN retrying charge attempt=2

14:37:04 inventory WARN stock check slow latency=1200ms

14:37:09 api ERROR request failed path=/orders

14:37:11 orders ERROR db write failed table=orders

14:37:14 gateway ERROR 502 bad gateway route=/checkout

Errors are scattered across six services. Without knowing the request path, it is hard to tell which error is the cause and which is a symptom.

One signal gave you a piece. Not the whole answer. What do you check next?

Choose your second signal.

You need at least two signals to narrow down the problem.

Two signals, two perspectives. One showed the shape, the other added context. The last signal will complete the picture.

Check the final signal.

You know what is left.

You have seen all three signals. Errors appear in gateway, api, and orders. Logs mention payment and inventory too. But which service is the actual root cause?

Click the service where the failure originates.

Think about what the trace showed you.

Root cause identified

The database connection pool is exhausted. Every service above it in the call chain, orders, api, gateway, fails as a consequence. The log warnings from payment and inventory are symptoms of the same bottleneck.

pool=orders · max_connections=10 · active=10 · waiting=34

Your investigation path

Remember this?

You have walked the path of 19 koans.

From a silent system to one that speaks in metrics, traces, and logs.

You began with silence.

Now you can see.

← Return to the beginning