Alaa Qutaish

POD Readiness Gates

08 July 2022 – 1 min read

Your application can inject extra feedback or signals into PodStatus, Pod readiness. To use this, set readinessGates in the Pod's spec to specify a list of additional conditions that the kubelet evaluates for Pod readiness.

SRE Readiness Kubernetes LifeCycle External

6 Important things you need to run Kubernetes in production

23 March 2022 – 1 min read

Setting up a Kubernetes stack according to best-practices requires expertise, and is necessary to set up a stable cluster that is future-proof. Simply running a manged cluster and deploying your application is not enough. Some additional things are needed to run a production-ready Kubernetes cluster. A good Kubernetes setup makes the life of developers a lot easier and gives them time to focus on delivering business value.

SRE Production Kubernetes GitOps IaC External

Scaling Kubernetes to Over 4k Nodes and 200k Pods

13 February 2022 – 1 min read

Unlike Apache Mesos, which can scale up to 10,000 nodes out of the box, scaling Kubernetes is challenging. Kubernetes’ scalability is not just limited to the number of nodes and pods, but several aspects like the number of resources created, the number of containers per pod, the total number of services, and the pod deployment throughput. This post describes some challenges we faced when scaling and how we solved them.

SRE Scaling Kubernetes etcd Workloads External

How to Work Asynchronously as a Remote-First SRE

06 December 2021 – 1 min read

The core practices for remote work at Netlify are prioritising asynchronous communication, being intentional about our remote community building, and encouraging colleagues to protect their work-life balance. Sustainable remote work starts with sustainable working hours, which includes making yourself "almost" unreachable with clear boundaries and protocols for out of hours contact

SRE Culture Remote Commuincation Teams External

Introducing Karpenter Kubernetes Cluster Autoscaler

01 December 2021 – 1 min read

Karpenter is an open-source, flexible, high-performance Kubernetes cluster autoscaler built with AWS. It helps improve your application availability and cluster efficiency by rapidly launching right-sized compute resources in response to changing application load. Karpenter also provides just-in-time compute resources to meet your application’s needs and will soon automatically optimize a cluster’s compute resource footprint to reduce costs and improve performance.

EKS Autoscaler Events Capacity Compute External

Timeouts, retries, and backoff with jitter

21 November 2021 – 1 min read

Whenever one service or system calls another, failures can happen. These failures can come from a variety of factors. They include servers, networks, load balancers, software, operating systems, or even mistakes from system operators. We design our systems to reduce the probability of failure, but impossible to build systems that never fail. So in Amazon, we design our systems to tolerate and reduce the probability of failure, and avoid magnifying a small percentage of failures into a complete outage. To build resilient systems, we employ three essential tools (timeouts, retries, and backoff).

Failures Timeouts Retries Backoffs Distributed-Systems External

Amazon VPC CNI plugin increases pods per node limits

18 September 2021 – 1 min read

Amazon VPC Container Networking Interface (CNI) Plugin supports “prefix assignment mode”, enabling you to run more pods per node on AWS Nitro based EC2 instance types. To achieve higher pod density, the VPC CNI plugin leverages a new VPC capability that enables IP address prefixes to be associated with elastic network interfaces (ENIs) attached to EC2 instances. You can now assign /28 (16 IP addresses) IPv4 address prefixes, instead of assigning individual secondary IPv4 addresses to network interfaces. This significantly increases number of pods that can be run per node.

EKS CNI VPC ENI IPAMD CIDR PODs External

How to enable Kubernetes container Runtime Default seccomp profile for all workloads

24 August 2021 – 1 min read

Seccomp (Secure Computing) is a feature in the Linux kernel that allows a userspace program to create syscall filters. In the context of containers, these syscall filters are collated into seccomp profiles that can be used to restrict which syscalls and arguments are permitted. Applying seccomp profiles to containers reduces the chance that a Linux kernel vulnerability will be exploited.

containers security seccomp syscalls linux kernel External

Root cause of failure, root cause of success

22 August 2021 – 1 min read

Everyone likes the idea of a single root cause when a problem occurs. This post compares that to how we think about successes, to make the point about the fragility of looking for a singular root cause

Success Failure SRE Systemsm External

The Hidden Dangers of Terminating K8S Namespaces

11 August 2021 – 1 min read

Controllers are one of the foundational components of Kubernetes whose job is to constantly monitor (through a control loop) the defined API resources in order to bring the cluster to the desired state. Each controller has a designed purpose that manages the entire lifecycle of a particular component. An important concept to remember with any cloud native technology is that availability is not guaranteed. If a controller was designed to take action when a resource was deleted and the controller was unavailable at that point in time, the intended action would not occur and state would no longer be in sync.

Kubernetes Namespaces Termination GC Openshift External

How to Serve 200K Samples per second with Prometheus

27 June 2021 – 1 min read

I will explain how to build a monitoring system that can retain data for long periods, which can handle up to 200K samples per second. The important point is that all of these processes are realized on one centralized Prometheus and Thanos server.

Kubernetes Prometheus Monitoring Thanos TSDB External

A multi-cluster shared services architecture with EKS using Cilium ClusterMesh

23 June 2021 – 1 min read

ClusterMesh is Cilium’s multi-cluster implementation that is built on top of Cilium CNI. It enables users to set up cross-cluster connectivity with standard Kubernetes semantics for transparent service discovery. Each cluster in the mesh participates as a peer. Cross-cluster traffic is handled by individual nodes rather than using a central gateway.

EKS ClusterMesh Cilium Networking multi-cluster CNI eBPF External

Security Practices for Multi-Tenant SaaS Applications using Amazon EKS

04 June 2021 – 1 min read

This technical guide shows you how to securely manage and operate multi-tenant software-as-a-service (SaaS) applications on Amazon Elastic Kubernetes Service (Amazon EKS) clusters.

Kubernetes EKS SaaS Multi-Tenant PDF SELinux OPA External

11 of the Best Open-Source Kubernetes Tools

02 June 2021 – 1 min read

The incredible community around Kubernetes is constantly sharing tools that help improve the experience of being a Kubernetes developer. Here is my list of the 11 essential tools I keep in my arsenal. I break them down by important categories which ones help me run Kubernetes, test Kubernetes, and — last but not least — have fun in my IDE.

Kubernetes Podman Skaffold External