gw auth api ord inv db
Some paths require a wider view.
Return on a larger screen.

Correlation

One signal sees the shadow. Another finds the shape. A third reveals the cause.
Apart, they are clues. Together, they are the answer.
alert triggered metric
request_duration p99
2.1s
seconds
threshold exceeded
service: orders
window: 10:40 – 10:45 UTC
threshold: 500ms
current: 2100ms, 4.2x over limit
Something is slow. The metric tells you THAT something is wrong, but not WHY.

Which signal would help you find the specific slow request?

same time window, same service
correlated trace - orders - 10:42:01 trace
gateway
2100ms
api
2050ms
orders
2020ms
database
1980ms
trace_id: a3f8b2c1d4e5 · service: orders · 10:42:01 UTC
The trace shows WHERE the time was spent, almost all of it in the database call. But WHY is the database slow?

Which signal would tell you what happened inside the database?

same trace_id
correlated log - database - 10:42:01.980 log
10:42:01.980 ERROR lock timeout waiting for table 'orders'
concurrent_locks=47 wait_time=1980ms table=orders
trace_id: a3f8b2c1d4e5 · severity: ERROR
Metric
Tells you THAT
p99 = 2.1s
10:40 – 10:45
time window
Trace
Shows you WHERE
database: 1980ms
trace_id: a3f8b2
trace_id
Log
Tells you WHY
lock timeout, 47 locks
trace_id: a3f8b2

Metrics tell you something is wrong. Traces show you where. Logs tell you why.

Correlation is the thread that connects them.

Continue →