Scaling OpenTelemetry: Skyscanner's Collector Strategy Across 24 Clusters
In a world where observability is paramount, Skyscanner has developed a robust strategy for managing telemetry across 24 production clusters. The challenge lies in handling vast amounts of data while ensuring that services can communicate effectively. Skyscanner's architecture addresses this by using a central DNS endpoint with intelligent routing via Istio, allowing services to send telemetry to a single address regardless of their global location or cluster. This setup simplifies the process and enhances the reliability of data collection.
The architecture employs two distinct collector patterns: the Gateway Collector, which handles bulk OTLP traffic, and the Agent Collector, which scrapes Prometheus endpoints from services that don't support OTLP natively. The Gateway Collector processes the majority of telemetry, while the Agent Collector ensures that all services are covered. Metrics are transformed into semantic convention names, such as http.client.duration and http.server.duration, and aggregated by cluster, service name, and HTTP status code. Key configuration parameters like dimensions_cache_size (default: 15,000,000) and metrics_flush_interval (default: 30s) are crucial for optimizing performance and resource usage.
In production, understanding how to configure these collectors is vital. The connectors configuration allows for detailed aggregation and dimension management, which can significantly enhance your observability capabilities. Be aware that while this architecture is powerful, it requires careful tuning to avoid performance bottlenecks. Skyscanner's approach exemplifies how to effectively scale observability in a complex environment, but it demands a solid understanding of both the tools and the underlying infrastructure.
Key takeaways
- →Leverage a central DNS endpoint for telemetry to simplify service communication.
- →Utilize Gateway Collectors for bulk OTLP traffic and Agent Collectors for Prometheus scraping.
- →Configure `dimensions_cache_size` to 15,000,000 for optimal performance.
- →Set `metrics_flush_interval` to 30s to manage data flow effectively.
- →Transform metrics using semantic convention names for better aggregation and analysis.
Why it matters
Effective observability is crucial for maintaining system health and performance. Skyscanner's strategy allows for seamless telemetry management across multiple clusters, enabling faster troubleshooting and improved service reliability.
Code examples
1connectors:
2 spanmetrics:
3 aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
4 dimensions:
5 - name: http.status_code
6 - name: grpc.status_code
7 - name: rpc.service
8 - name: rpc.method
9 - name: prot
10 - name: flag
11 - name: k8s.deployment.name
12 - name: k8s.replicaset.name
13 - name: destination_subset
14 dimensions_cache_size: 15000000
15 histogram:
16 exponential:
17 max_size: 160
18 unit: ms
19 metrics_flush_interval: 30sWhen NOT to use this
The official docs don't call out specific anti-patterns here. Use your judgment based on your scale and requirements.
Want the complete reference?
Read official docsAccelerating Log Queries: Grafana Labs and Logline's Game-Changer
Discover how Grafana Labs' acquisition of Logline transforms log management. With a new indexing approach for Loki, you can now execute needle-in-the-haystack queries faster than ever.
GrafanaCON 2026: Unpacking the Latest Innovations from Grafana Labs
GrafanaCON 2026 has unveiled groundbreaking features that can transform your observability strategy. With Grafana 13 and the AI-powered Grafana Assistant, you can now harness your data like never before. Dive into the details to see how these updates can streamline your workflows.
Unlocking GrafanaCON 2026: What You Need to Know
GrafanaCON 2026 in Barcelona is the must-attend event for anyone serious about observability. Experience hands-on labs led by Grafana Labs engineers and witness the Golden Grot Awards showcasing the best dashboards. Don’t miss out on this opportunity to elevate your Grafana skills.
Get the daily digest
One email. 5 articles. Every morning.
No spam. Unsubscribe anytime.