Scaling OpenTelemetry: Skyscanner's Collector Strategy Across 24 Clusters
In a world where observability is paramount, Skyscanner has developed a robust strategy for managing telemetry across 24 production clusters. The challenge lies in handling vast amounts of data while ensuring that services can communicate effectively. Skyscanner's architecture addresses this by using a central DNS endpoint with intelligent routing via Istio, allowing services to send telemetry to a single address regardless of their global location or cluster. This setup simplifies the process and enhances the reliability of data collection.
The architecture employs two distinct collector patterns: the Gateway Collector, which handles bulk OTLP traffic, and the Agent Collector, which scrapes Prometheus endpoints from services that don't support OTLP natively. The Gateway Collector processes the majority of telemetry, while the Agent Collector ensures that all services are covered. Metrics are transformed into semantic convention names, such as http.client.duration and http.server.duration, and aggregated by cluster, service name, and HTTP status code. Key configuration parameters like dimensions_cache_size (default: 15,000,000) and metrics_flush_interval (default: 30s) are crucial for optimizing performance and resource usage.
In production, understanding how to configure these collectors is vital. The connectors configuration allows for detailed aggregation and dimension management, which can significantly enhance your observability capabilities. Be aware that while this architecture is powerful, it requires careful tuning to avoid performance bottlenecks. Skyscanner's approach exemplifies how to effectively scale observability in a complex environment, but it demands a solid understanding of both the tools and the underlying infrastructure.
Key takeaways
- →Leverage a central DNS endpoint for telemetry to simplify service communication.
- →Utilize Gateway Collectors for bulk OTLP traffic and Agent Collectors for Prometheus scraping.
- →Configure `dimensions_cache_size` to 15,000,000 for optimal performance.
- →Set `metrics_flush_interval` to 30s to manage data flow effectively.
- →Transform metrics using semantic convention names for better aggregation and analysis.
Why it matters
Effective observability is crucial for maintaining system health and performance. Skyscanner's strategy allows for seamless telemetry management across multiple clusters, enabling faster troubleshooting and improved service reliability.
Code examples
1connectors:
2 spanmetrics:
3 aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
4 dimensions:
5 - name: http.status_code
6 - name: grpc.status_code
7 - name: rpc.service
8 - name: rpc.method
9 - name: prot
10 - name: flag
11 - name: k8s.deployment.name
12 - name: k8s.replicaset.name
13 - name: destination_subset
14 dimensions_cache_size: 15000000
15 histogram:
16 exponential:
17 max_size: 160
18 unit: ms
19 metrics_flush_interval: 30sWhen NOT to use this
The official docs don't call out specific anti-patterns here. Use your judgment based on your scale and requirements.
Want the complete reference?
Read official docsOpenAI & Anthropic-compatible inference API — no GPU provisioning needed. 55+ models, pay-per-token with no minimums. VPC + zero data retention by default.
Try Serverless Inference →Unlocking Performance: Pyroscope 2.0 for Continuous Profiling at Scale
Pyroscope 2.0 revolutionizes continuous profiling, providing insights into why your code is slow or costly. With data co-location and stateless queriers, it optimizes performance and storage efficiency. Dive in to see how it can transform your observability strategy.
OTel-Arrow Phase 2: Building Efficient Telemetry Pipelines
In the world of observability, efficient telemetry pipelines are crucial for performance. The OpenTelemetry Arrow Protocol (OTAP) leverages a NUMA-friendly architecture to streamline data transport and processing. This article dives into how OTAP transforms telemetry handling.
Securing OpenTelemetry in Legacy Systems: Best Practices
Legacy environments pose unique challenges for observability and security. By leveraging the OpenTelemetry Collector as a bridge, you can enforce Zero Trust principles effectively. Discover how to design a secure telemetry pipeline that minimizes risk.
Get the daily digest
One email. 5 articles. Every morning.
No spam. Unsubscribe anytime.