OpsCanary
Back to daily brief
observabilityPractitioner

Scaling OpenTelemetry: Skyscanner's Collector Strategy Across 24 Clusters

6 min read OpenTelemetry BlogApr 21, 2026
Share
PractitionerHands-on experience recommended

In a world where observability is paramount, Skyscanner has developed a robust strategy for managing telemetry across 24 production clusters. The challenge lies in handling vast amounts of data while ensuring that services can communicate effectively. Skyscanner's architecture addresses this by using a central DNS endpoint with intelligent routing via Istio, allowing services to send telemetry to a single address regardless of their global location or cluster. This setup simplifies the process and enhances the reliability of data collection.

The architecture employs two distinct collector patterns: the Gateway Collector, which handles bulk OTLP traffic, and the Agent Collector, which scrapes Prometheus endpoints from services that don't support OTLP natively. The Gateway Collector processes the majority of telemetry, while the Agent Collector ensures that all services are covered. Metrics are transformed into semantic convention names, such as http.client.duration and http.server.duration, and aggregated by cluster, service name, and HTTP status code. Key configuration parameters like dimensions_cache_size (default: 15,000,000) and metrics_flush_interval (default: 30s) are crucial for optimizing performance and resource usage.

In production, understanding how to configure these collectors is vital. The connectors configuration allows for detailed aggregation and dimension management, which can significantly enhance your observability capabilities. Be aware that while this architecture is powerful, it requires careful tuning to avoid performance bottlenecks. Skyscanner's approach exemplifies how to effectively scale observability in a complex environment, but it demands a solid understanding of both the tools and the underlying infrastructure.

Key takeaways

  • Leverage a central DNS endpoint for telemetry to simplify service communication.
  • Utilize Gateway Collectors for bulk OTLP traffic and Agent Collectors for Prometheus scraping.
  • Configure `dimensions_cache_size` to 15,000,000 for optimal performance.
  • Set `metrics_flush_interval` to 30s to manage data flow effectively.
  • Transform metrics using semantic convention names for better aggregation and analysis.

Why it matters

Effective observability is crucial for maintaining system health and performance. Skyscanner's strategy allows for seamless telemetry management across multiple clusters, enabling faster troubleshooting and improved service reliability.

Code examples

YAML
1connectors:
2  spanmetrics:
3    aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
4    dimensions:
5      - name: http.status_code
6      - name: grpc.status_code
7      - name: rpc.service
8      - name: rpc.method
9      - name: prot
10      - name: flag
11      - name: k8s.deployment.name
12      - name: k8s.replicaset.name
13      - name: destination_subset
14    dimensions_cache_size: 15000000
15    histogram:
16      exponential:
17        max_size: 160
18      unit: ms
19    metrics_flush_interval: 30s

When NOT to use this

The official docs don't call out specific anti-patterns here. Use your judgment based on your scale and requirements.

Want the complete reference?

Read official docs

Test what you just learned

Quiz questions written from this article

Take the quiz →

Get the daily digest

One email. 5 articles. Every morning.

No spam. Unsubscribe anytime.