Why Kubernetes?
The short answer? Kubernetes checks all the boxes for what we need in managing our microservices at Kakbima. It’s not just a tool that works—it solves several critical challenges that would otherwise demand significant time and effort to address manually. Some of the obvious benefits of Kubernetes are its scalability, efficient resource bin-packing, and its self-healing capabilities. These features alone save us from the complexity of managing individual instances or services by hand.
But beyond these, one of the most significant reasons Kubernetes works so well for us is its ability to simplify and streamline our deployment processes—particularly in terms of rollouts and rollbacks. This is a crucial aspect of our infrastructure, and although we’ve built some complex systems around it, we’ll dive into those details in a future post.
How are we using Kubernetes?
Our production infrastructure spans four availability zones (AZs) and is split across four unique Kubernetes clusters. While Kubernetes now supports managing this kind of topology within a single cluster entity, we’ve opted not to adopt that newer functionality yet, as we haven’t found it necessary for our use case.
Through experience, we’ve discovered significant benefits to managing our workloads across four distinct clusters. The list of advantages continues to grow, but here are a few of the key reasons we stick with this architecture:
- Traffic shifting across availability zones
Thanks to some in-house tooling, we have the flexibility to shift traffic between AZs as needed. This feature has proven invaluable during situations where an issue in one zone—whether caused by the cloud provider or some other dependency—could otherwise cause downtime or service degradation. With this setup, we can mitigate the impact on our customers and maintain high availability. - Gradual rollouts of infrastructure changes
When introducing changes to our infrastructure—whether it’s a new Kubernetes add-on or configuration update—we can safely test those changes in one cluster while continuing to serve traffic through the remaining three clusters. This allows us to validate changes with minimal risk. Additionally, if staging clusters are available, we can often validate changes there before ever reaching production. - Istio for service mesh management
We’ve chosen Istio as our service mesh, which helps us manage ingress and egress traffic with a suite of in-house controllers that ensure smooth configuration and reconciliation for flows from our CDN to each of our four clusters. This is a critical part of our architecture, but it’s a topic deserving of its own dedicated post, so we’ll leave it there for now.
Configuration & Management
When it comes to managing our Kubernetes configurations, we lean heavily on Terraform alongside some of our own custom tooling. When we first started working with Kubernetes, there weren’t many tools to help us manage Kubernetes configurations at scale, so we built an in-house app to help us template, render, and apply configurations across all of our clusters. This tool continues to be a cornerstone of our infrastructure management.
Having a single, unified tool to manage configuration templates, as well as to apply and test changes, has been invaluable. It ensures that we maintain a “source of truth” for our Kubernetes configurations, and allows us to rigorously test and validate changes before they go live. This process is particularly crucial given how rapidly the Kubernetes ecosystem is evolving.
We’re always on the lookout for new tools and processes that could further streamline our configuration management. If you’re using anything that’s made your Kubernetes experience easier, we’d love to hear about it in the comments!
Tuning for scaling — Expanding for bursts, contracting with requests
A significant amount of effort has gone into fine-tuning our application resource requests so they are right-sized based on real usage patterns. This has been instrumental in optimizing the performance of our Kubernetes nodes and improving our ability to handle scaling events efficiently. By right-sizing resource requests, we’ve been able to make better use of our nodes and dramatically improve the bin-packing efficiency of our workloads.
However, just right-sizing the resource requests wasn’t enough to smooth out the scaling process. We had to introduce additional tuning and custom tooling to handle burst traffic efficiently. Here’s how we manage that:
The cluster over-provisioner & pod preemption
One of the most useful tools in our Kubernetes scaling toolkit is the Cluster Over-Provisioner, which allows us to proactively ensure that additional resources are available when needed. In simple terms, we define the number of replicas our service needs to scale and the amount of resources they require. For example, let’s take a service like backend-A, which experiences frequent traffic bursts and needs a substantial amount of resources to handle that load.
If we expect backend-A to need 200 additional pods to absorb traffic spikes across all four clusters, we configure the over-provisioner to request more CPU and memory resources than the backend-A pods would typically use. We set a replica count for the over-provisioner pods at 50 for each cluster, and with Priority Preemption and Cluster Autoscaler configured correctly, this is what happens:
- The Cluster Over-Provisioner ensures there are sufficient resources (roughly 200 additional backend-A pods worth) on standby at all times, ready for scale-up events.
- When new backend-A pods need to be scheduled, the over-provisioner pods are evicted (preempted) to free up the necessary resources.
- When the over-provisioner pods are evicted, Kubernetes triggers a node scale-up event through the cluster-autoscaler, ensuring that new nodes are provisioned quickly to handle the increased demand.
By using the over-provisioner in this way, we absorb the delays that occur when nodes need to be scaled, allowing us to handle production scale events without disruption. This allows us to efficiently manage traffic spikes, even when our scaling events are complex or resource-intensive.
To conclude
Kubernetes can be incredibly complex, with an almost infinite variety of configurations and tools that can be used to meet different organizational needs. At Kakbima, we take pride in the way we’ve shaped Kubernetes to fit our specific use case. But that doesn’t mean we’re finished. We remain committed to exploring new tools, technologies, and approaches that will continue to enhance our infrastructure, improve scalability, and boost reliability. The Kubernetes landscape is constantly evolving, and so are we. Looking ahead, we’re excited about the opportunities that lie in further optimizing our infrastructure. Kubernetes may be a powerful tool, but the real value comes from continuously learning and adapting as our needs grow and change.