Service Mesh Observability

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

When to Use This Skill

Setting up distributed tracing across services
Implementing service mesh metrics and dashboards
Debugging latency and error issues
Defining SLOs for service communication
Visualizing service dependencies
Troubleshooting mesh connectivity

Core Concepts

1. Three Pillars of Observability

┌─────────────────────────────────────────────────────┐
│                  Observability                       │
├─────────────────┬─────────────────┬─────────────────┤
│     Metrics     │     Traces      │      Logs       │
│                 │                 │                 │
│ • Request rate  │ • Span context  │ • Access logs   │
│ • Error rate    │ • Latency       │ • Error details │
│ • Latency P50   │ • Dependencies  │ • Debug info    │
│ • Saturation    │ • Bottlenecks   │ • Audit trail   │
└─────────────────┴─────────────────┴─────────────────┘

2. Golden Signals for Mesh

Signal	Description	Alert Threshold
Latency	Request duration P50, P99	P99 > 500ms
Traffic	Requests per second	Anomaly detection
Errors	5xx error rate	> 1%
Saturation	Resource utilization	> 80%

Templates and detailed worked examples

Full template library and detailed worked examples live in references/details.md. Read that file when you need the concrete templates.

Best Practices

Do's

Sample appropriately - 100% in dev, 1-10% in prod
Use trace context - Propagate headers consistently
Set up alerts - For golden signals
Correlate metrics/traces - Use exemplars
Retain strategically - Hot/cold storage tiers

Don'ts

Don't over-sample - Storage costs add up
Don't ignore cardinality - Limit label values
Don't skip dashboards - Visualize dependencies
Don't forget costs - Monitor observability costs