OpsCanary
kubernetesobservabilityPractitioner

Flipkart's Chaos Engineering Triumph: Scaling Kubernetes with Confidence

5 min read CNCF BlogJun 18, 2026Reviewed for accuracy
Share
PractitionerHands-on experience recommended

In today's fast-paced digital landscape, ensuring system resilience is critical, especially during high-traffic events like festive sales. Chaos engineering provides a way to test and validate the robustness of your systems under stress. Flipkart's recent win in the CNCF End User Case Study Contest highlights how they effectively leveraged chaos engineering to enhance their Kubernetes-native architecture, allowing them to withstand turbulent conditions in production.

Flipkart's chaos engineering platform executes approximately 90% of chaos experiments in staging environments before major sales events. This proactive approach helps identify potential weaknesses and mitigate risks. To tailor their chaos engineering efforts, the team developed four custom extensions to LitmusChaos: a hybrid multi-tenant architecture that optimizes resource allocation, a DaemonSet-based high-availability model for parallel injection of faults, a Script Runner fault for dynamic target selection, and an internal hybrid extension to support legacy virtual machine workloads. This level of customization allows Flipkart to effectively simulate real-world scenarios and ensure their systems are battle-ready.

In production, it's crucial to understand the implications of chaos engineering. Flipkart's approach emphasizes the importance of running experiments in staging to avoid disruptions in production environments. This strategy not only builds confidence in their systems but also prepares them for unexpected challenges during peak traffic. While chaos engineering can be a powerful tool, it requires careful planning and execution to avoid unintended consequences. Always consider the specific needs of your architecture and the potential impact of your experiments.

Key takeaways

  • Execute chaos experiments in staging environments to identify weaknesses before production.
  • Leverage LitmusChaos extensions for tailored chaos engineering solutions.
  • Implement a DaemonSet-based model for high availability during fault injection.
  • Utilize dynamic target selection with Script Runner for more effective chaos tests.
  • Support legacy workloads with hybrid extensions to ensure comprehensive testing.

Why it matters

In production, chaos engineering can significantly reduce downtime and improve system reliability, especially during critical sales periods. Flipkart's approach demonstrates how proactive testing leads to a more resilient infrastructure.

When NOT to use this

The official docs don't call out specific anti-patterns here. Use your judgment based on your scale and requirements.

Want the complete reference?

Read official docs

Test what you just learned

Quiz questions written from this article

Take the quiz →
Better StackSponsor

Unified observability — logs, uptime monitoring, and on-call in one place. Used by 50,000+ engineering teams to ship faster and sleep better.

Try Better Stack free →

Get the daily digest

One email. 5 articles. Every morning.

No spam. Unsubscribe anytime.