Alaa Qutaish's Picture

Hey, I’m Alaa, I build awesome Cloud Native platforms & teams.

Hey, I am Alaa, I studied Computer Science at University of Greenwich & have 12+ years of experince in SRE, Cloud Systems & Distributed Systems. I Worked at EU Startups, Telecom Data Centers, German Automotive & Gaming industries. Available for Hands-on Consulting, Training & Team Building.

Scaling Kubernetes to Over 4k Nodes and 200k Pods

Unlike Apache Mesos, which can scale up to 10,000 nodes out of the box, scaling Kubernetes is challenging. Kubernetes’ scalability is not just limited to the number of nodes and pods, but several aspects like the number of resources created, the number of containers per pod, the total number of services, and the pod deployment throughput. This post describes some challenges we faced when scaling and how we solved them.

How to Work Asynchronously as a Remote-First SRE

The core practices for remote work at Netlify are prioritising asynchronous communication, being intentional about our remote community building, and encouraging colleagues to protect their work-life balance. Sustainable remote work starts with sustainable working hours, which includes making yourself "almost" unreachable with clear boundaries and protocols for out of hours contact

Introducing Karpenter Kubernetes Cluster Autoscaler

Karpenter is an open-source, flexible, high-performance Kubernetes cluster autoscaler built with AWS. It helps improve your application availability and cluster efficiency by rapidly launching right-sized compute resources in response to changing application load. Karpenter also provides just-in-time compute resources to meet your application’s needs and will soon automatically optimize a cluster’s compute resource footprint to reduce costs and improve performance.

Timeouts, retries, and backoff with jitter

Whenever one service or system calls another, failures can happen. These failures can come from a variety of factors. They include servers, networks, load balancers, software, operating systems, or even mistakes from system operators. We design our systems to reduce the probability of failure, but impossible to build systems that never fail. So in Amazon, we design our systems to tolerate and reduce the probability of failure, and avoid magnifying a small percentage of failures into a complete outage. To build resilient systems, we employ three essential tools (timeouts, retries, and backoff).

Amazon VPC CNI plugin increases pods per node limits

Amazon VPC Container Networking Interface (CNI) Plugin supports “prefix assignment mode”, enabling you to run more pods per node on AWS Nitro based EC2 instance types. To achieve higher pod density, the VPC CNI plugin leverages a new VPC capability that enables IP address prefixes to be associated with elastic network interfaces (ENIs) attached to EC2 instances. You can now assign /28 (16 IP addresses) IPv4 address prefixes, instead of assigning individual secondary IPv4 addresses to network interfaces. This significantly increases number of pods that can be run per node.