Learn/Kubernetes/Observability

Kubernetes

Observability

20 articles from official documentation

Practitioner20 articles

kubernetesobservabilityPractitioner

Flipkart's Chaos Engineering Revolution: Insights from KubeCon + CloudNativeCon India 2026

Chaos engineering is not just a buzzword; it's a necessity for resilient systems. Flipkart's Central Reliability Engineering team showcased their innovative use of LitmusChaos, including a DaemonSet-based model for chaos injection. Dive into how they tackled real-world challenges with this approach.

→Implement a hybrid multi-tenancy architecture for chaos engineering to support diverse workloads.
→Utilize a DaemonSet-based model for high-availability chaos injection across your Kubernetes cluster.

5 min read·CNCF Blog

kubernetesobservabilityPractitioner

Building a Custom Metrics Exporter for Kubernetes: A Practical Guide

Custom metrics exporters are essential for monitoring application states in Kubernetes. By exposing metrics through a simple HTTP server, you can gain insights into your application's performance. Learn how to implement this with concrete examples and avoid common pitfalls.

→Expose application metrics through a /metrics endpoint for Prometheus scraping.
→Use counters for totals, gauges for current values, and histograms for distributions.

5 min read·Kubernetes Blog

kubernetesobservabilityPractitioner

Diagnosing Kubernetes Control Plane Performance with AWS DevOps Agent

Kubernetes control plane performance can make or break your cluster's stability. The AWS DevOps Agent autonomously identifies issues, correlating CloudWatch logs with throttling patterns to deliver actionable insights. This article dives into how to leverage this tool effectively in production environments.

→Utilize the AWS DevOps Agent to autonomously investigate performance issues in your Kubernetes control plane.
→Configure the agent with the correct EKS Access Entry Type and Access Policy for optimal performance.

5 min read·AWS Containers Blog

kubernetesobservabilityPractitioner

Speed Up Your Volcano Workload Insights with Headlamp

Tired of slow inspections of your Volcano workloads? Headlamp integrates seamlessly with Volcano, allowing you to visualize workload states and queue behaviors in one place. Dive into the specifics of how this integration enhances your Kubernetes experience.

→Utilize Headlamp to visualize Volcano workload states and queue behaviors.
→Access dedicated views for Jobs, Queues, and PodGroups directly in Headlamp.

5 min read·Kubernetes Blog

kubernetesobservabilityPractitioner

Building High-Impact Observability Pipelines in Kubernetes

In a world where every metric consumes resources, designing sustainable observability pipelines is crucial. Implementing an observability mesh can connect your metrics, traces, and logs seamlessly, enhancing your monitoring strategy.

→Implement green observability to optimize resource usage in your telemetry.
→Utilize an observability mesh to connect metrics, traces, and logs effectively.

5 min read·CNCF Blog

kubernetesobservabilityPractitioner

Flipkart's Chaos Engineering Triumph: Scaling Kubernetes with Confidence

Chaos engineering is essential for building resilient systems, and Flipkart's recent success showcases its power. By executing 90% of chaos experiments in staging, they ensure stability during high-traffic events. Discover how they customized LitmusChaos for their unique needs.

→Execute chaos experiments in staging environments to identify weaknesses before production.
→Leverage LitmusChaos extensions for tailored chaos engineering solutions.

5 min read·CNCF Blog

kubernetesobservabilityPractitioner

Dynamic Configuration for Cloud Native Swift Services in Kubernetes

Dynamic configuration is crucial for cloud-native applications, especially in a Kubernetes environment. By leveraging the ConfigReader and ReloadingFileProvider, you can achieve hot reloading of configuration values without restarting your services. This article dives into how to set it up effectively.

→Utilize ConfigReader to manage configuration from multiple providers effectively.
→Implement ReloadingFileProvider for hot reloading of configuration without service restarts.

5 min read·CNCF Blog

kubernetesobservabilityPractitioner

Understanding the Kubernetes Integration Tax: Navigating Prometheus and Cilium in Production

Running multiple CNCF projects together in Kubernetes can lead to hidden costs, known as the integration tax. This article dives into how Cluster API manages your infrastructure and the importance of generating your monitoring effectively.

→Understand the integration tax when running multiple CNCF projects together.
→Utilize Cluster API for managing Kubernetes-native resources effectively.

5 min read·CNCF Blog

kubernetesobservabilityPractitioner

Tracing AI Agents: Jaeger's Evolution with OpenTelemetry

Jaeger is evolving to trace AI agents, addressing the complexities of monitoring AI interactions. With the integration of OpenTelemetry, it streamlines data collection through protocols like MCP and ACP, enhancing performance and collaboration.

→Understand the Model Context Protocol (MCP) for secure data access by AI models.
→Utilize the Agent Client Protocol (ACP) for uniform communication with AI agents.

5 min read·CNCF Blog

kubernetesobservabilityPractitioner

OpenTelemetry Graduation: The New Standard for Observability in Kubernetes

OpenTelemetry's graduation marks a pivotal moment in the observability landscape. This open-source framework standardizes telemetry data collection, allowing seamless transitions between analysis tools without code rewrites.

→Standardize telemetry data collection with OpenTelemetry to reduce tool fragmentation.
→Utilize a single set of APIs and SDKs to simplify observability across your systems.

5 min read·CNCF Blog

kubernetesobservabilityPractitioner

The Silent Evidence Gap in kubectl debug: What You Need to Know

When debugging Kubernetes pods, the kubectl debug command can be a lifesaver. However, it leaves behind a critical gap in evidence that can hinder your troubleshooting efforts. Understanding how ephemeral container statuses work is essential to avoid losing valuable context after a debug session ends.

→Understand that ephemeral containers do not retain termination context after a debug session ends.
→Use the `--target` parameter to route the debug container into the target container's process namespace.

5 min read·CNCF Blog

kubernetesobservabilityPractitioner

Kubernetes v1.36: Mastering Route Sync Metrics in Cloud Controller Manager

Kubernetes v1.36 introduces a game-changing metric for route synchronization that can optimize your cloud interactions. The new alpha counter, `route_controller_route_sync_total`, tracks how often routes sync with your cloud provider, giving you critical visibility into your infrastructure. Dive in to understand how this metric can enhance your cluster's efficiency.

→Monitor `route_controller_route_sync_total` to track route sync efficiency.
→Utilize the watch-based approach to minimize unnecessary API calls.

4 min read·Kubernetes Blog

kubernetesobservabilityPractitioner

Centralized Observability for Multi-Account Amazon EKS: A Practical Guide

Centralized observability is essential for managing multiple Amazon EKS accounts effectively. By leveraging CloudWatch cross-account observability, you can replicate telemetry data seamlessly across your AWS accounts. This article dives into how to set this up for maximum visibility and control.

→Implement cross-account observability to replicate telemetry data into a central monitoring account.
→Utilize IAM role assumption for querying CloudWatch data across accounts and Regions.

5 min read·AWS Containers Blog

kubernetesobservabilityPractitioner

Unlocking Efficiency with Kubernetes v1.36: Server-Side Sharded List and Watch

Kubernetes v1.36 introduces a game-changing feature: server-side sharded list and watch. This allows your API server to filter events at the source, ensuring each controller replica only receives the relevant resource slices. Dive in to learn how to leverage this for better performance and scalability.

→Enable the ShardedListAndWatch feature gate on your API server to access this functionality.
→Use the shardSelector field in ListOptions to filter events effectively.

5 min read·Kubernetes Blog

kubernetesobservabilityPractitioner

Why Are Cloud Native Teams Stuck with Three Observability Stacks?

Despite the availability of powerful tools, many cloud native teams still juggle multiple observability stacks. OpenTelemetry provides a consistent instrumentation layer, yet teams often rely on Prometheus, Jaeger, and Fluentd for metrics, tracing, and logs respectively. This article dives into the reasons behind this fragmentation.

→Understand the role of OpenTelemetry as a consistent instrumentation layer across languages.
→Leverage Prometheus for effective metrics collection in your Kubernetes environment.

5 min read·CNCF Blog

kubernetesobservabilityPractitioner

Mastering Observability in Kubernetes: Monitoring, Logging, and Debugging

In a Kubernetes environment, observability is crucial for maintaining application health and performance. Understanding how to effectively monitor, log, and debug can save you hours of troubleshooting. Dive into the key concepts that every Kubernetes operator needs to master.

→Understand debugging for both applications and clusters to quickly resolve issues.
→Set up logging to capture essential data for troubleshooting in Kubernetes.

5 min read·Kubernetes Docs

kubernetesobservabilityPractitioner

Mastering Kubernetes Logging Architecture: What You Need to Know

Kubernetes logging architecture is crucial for effective observability in your clusters. Understanding how the kubelet captures and manages logs can save you from headaches down the line. Dive into the specifics of log rotation and storage to enhance your production monitoring.

→Configure `containerLogMaxSize` to control log file sizes effectively.
→Use `kubectl logs` commands to access logs easily from your Pods.

5 min read·Kubernetes Docs

kubernetesobservabilityPractitioner

Mastering the Kubernetes Resource Metrics Pipeline

Unlock the power of Kubernetes autoscaling with the Resource Metrics Pipeline. This essential component uses the Metrics API to provide CPU and memory data for Horizontal and Vertical Pod Autoscalers. Learn how to leverage it effectively in your production environment.

→Deploy the metrics-server to access the Metrics API.
→Use the Metrics API for real-time CPU and memory metrics.

5 min read·Kubernetes Docs

kubernetesobservabilityPractitioner

Auto-Diagnosing Kubernetes Alerts: Harnessing HolmesGPT and CNCF Tools

Tired of sifting through Kubernetes alerts manually? Discover how HolmesGPT automates the diagnosis process by reading alerts and intelligently selecting the right tools. With its ability to pull logs and analyze metrics, it streamlines troubleshooting like never before.

→Utilize HolmesGPT to automate alert diagnosis and reduce manual troubleshooting time.
→Leverage runbooks to guide HolmesGPT in selecting the right tools and exclusion rules.

5 min read·CNCF Blog

kubernetesobservabilityPractitioner

Measuring Developer Tool ROI: The DORA Metrics Approach

Understanding the ROI of your developer tools is crucial for optimizing your engineering processes. By leveraging DORA metrics, you can quantify deployment frequency, lead time for changes, and more. This article dives into how to effectively measure these metrics in your Kubernetes environment.

→Leverage DORA metrics to quantify your engineering effectiveness.
→Instrument your deployment pipeline to capture key events for accurate metric collection.

5 min read·CNCF Blog

Better StackSponsor

Unified observability — logs, uptime monitoring, and on-call in one place. Used by 50,000+ engineering teams to ship faster and sleep better.

Try Better Stack free →

Get the daily digest

One email. 5 articles. Every morning.

No spam. Unsubscribe anytime.

Back to Kubernetes