Kubernetes. Containers. Orchestration. It’s all the rage currently and it seems like you have to do it or be forgotten in the dust. Then one tries and hits the wall trying to understand how to get going outside of few nice cloud providers (like GKE), find it hard to move beyond minikube, and generally find oneself with a lot of problems and little to no benefits expounded by Kubernetes fans (like me!). This post is going to be first of a series where I am going to try and document common mistakes that might be not obvious to someone just starting out, to make your travels with Kubernetes into smooth sailing.

Not enough resources

A common complaint I’ve encountered is with regards to how many resources K8s consumes even before you start your own applications, and how few small VMs weren’t capable of getting it running because kube-system eats 70% of a node.

The reason behind it is that various system components of Kubernetes have empirically-derived resource requirements - for a reasonably big and active deployment. Minikube sets its own, much smaller requests on resources for internal services, and thus can reasonably work on much smaller system.

The answer is to use at least 2, preferably more, CPU per node. You can modify the settings to be workable on lesser resources, but when starting out it’s not advisable.

Using few big machines instead of many smaller ones

Didn't you just tell us to use a bigger machine?

Why, yes, I did.

There are two conflicting forces at play. Each individual node needs to be powerful enough to run cluster-supporting workloads, as well as reasonable amount of your own. However simply making few big nodes means that whenever a node fails, your available capacity drop much more - and that means that the knock-on effect on the rest of the cluster is much bigger. Even when you plan your capacity to account for this and use pod antiAffinities and multiple replicas, even if you run in cloud with autoscaler setting a new node to take over immediately - the bigger percentage a single node represents, the worse the chaos after failure.

Sounds like simple, obvious thing, yet I have to admit I have done that in the past.

A good rule of thumb is that you should try to have nodes sized so that losing one node doesn't take a significant portion of your capacity offline. Otherwise you're risking that possibly failing pods won't result in not enough capacity, as system loses slack necessary to start new resources in replacement of failing ones.

Nature of kubernetes as a much more dynamic system than rigid approaches like pacemaker and manually allocated applications, while making it easier to use every last bit of capacity you have, also makes it a bit more chaotic (good time to get that Chaos Engineering practices in play!).

Requesting too many resources unknowingly

This is something that has bitten me in the past. You should always, always specify resource requests (CPU, memory) and limits. Otherwise you might find yourself in a situation where adding new deployment, despite unloaded cluster, fails due to… missing capacity. This was caused by GKE having (at least in 2016!) a default limitRange set on default namespace that automatically set request.cpu=100m.

In my case the cause was historical, default CPU allocation on GKE. But the crucial part was not knowing it was there. If you have a resource reservation defined for the namespaces you use and you don't know about it, it's very easy to quickly find out that your deployments are locked due to "no capacity" errors.

Learn about resource limits, requests, and quotas, and remember to always apply them to your pods.

Any pod without explicit resource settings should be done by conscious decision, otherwise you might be woken by alerts at the most inopportune moment.

To be continued…

As suggested by title, there's a second part to this article incoming, which will delve into some more basic pitfalls that await a new kubernetes user.