Most teams set CPU and memory requests by guessing. The result is over-provisioning that wastes money or under-provisioning that causes evictions. Here is the practical method for picking each number, the difference between requests and limits, and why CPU limits are often a mistake.
Service mesh promises automatic mTLS, traffic shifting, and observability. The operational cost is real — Istio doubles a cluster's control-plane complexity. Here is the honest framework for whether your team needs a mesh, the lighter alternatives, and the migration that doesn't break production.
You set up rolling deploys carefully. Then a node drains during cluster upgrade and takes 80% of your pods at once. PodDisruptionBudget is the manifest that says “never evict more than N at a time.” Three lines of YAML, real production benefits.
Default HPA scales on CPU, which is wrong for most modern workloads. Memory, queue depth, request rate, and custom business metrics are what actually correlate with “need more pods.” Here is the working setup with custom metrics, the formula HPA uses, and the four mistakes that cause flapping.
Most teams configure liveness and readiness probes identically and wonder why a slow database makes Kubernetes restart their pods in a death spiral. Here is what each probe is actually for, the right endpoint shape for each, and the four-line config that turns an outage into a non-event.