Skip to main content
AI/MLwshobson

service-mesh-observability

Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.

Stars
36,167
Source
wshobson/agents
Updated
2026-05-29
Slug
wshobson--agents--service-mesh-observability
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/wshobson/agents/HEAD/plugins/cloud-infrastructure/skills/service-mesh-observability/SKILL.md -o .claude/skills/service-mesh-observability.md

Drops the SKILL.md into .claude/skills/service-mesh-observability.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

Service Mesh Observability

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

When to Use This Skill

  • Setting up distributed tracing across services
  • Implementing service mesh metrics and dashboards
  • Debugging latency and error issues
  • Defining SLOs for service communication
  • Visualizing service dependencies
  • Troubleshooting mesh connectivity

Core Concepts

1. Three Pillars of Observability

┌─────────────────────────────────────────────────────┐
│                  Observability                       │
├─────────────────┬─────────────────┬─────────────────┤
│     Metrics     │     Traces      │      Logs       │
│                 │                 │                 │
│ • Request rate  │ • Span context  │ • Access logs   │
│ • Error rate    │ • Latency       │ • Error details │
│ • Latency P50   │ • Dependencies  │ • Debug info    │
│ • Saturation    │ • Bottlenecks   │ • Audit trail   │
└─────────────────┴─────────────────┴─────────────────┘

2. Golden Signals for Mesh

Signal Description Alert Threshold
Latency Request duration P50, P99 P99 > 500ms
Traffic Requests per second Anomaly detection
Errors 5xx error rate > 1%
Saturation Resource utilization > 80%

Templates and detailed worked examples

Full template library and detailed worked examples live in references/details.md. Read that file when you need the concrete templates.

Best Practices

Do's

  • Sample appropriately - 100% in dev, 1-10% in prod
  • Use trace context - Propagate headers consistently
  • Set up alerts - For golden signals
  • Correlate metrics/traces - Use exemplars
  • Retain strategically - Hot/cold storage tiers

Don'ts

  • Don't over-sample - Storage costs add up
  • Don't ignore cardinality - Limit label values
  • Don't skip dashboards - Visualize dependencies
  • Don't forget costs - Monitor observability costs