I develop well architected cloud-native platforms & build SRE teams

Hello, my name is Alaa. I studied Computer Science at the University of Greenwich and have over 12 years of experience in Site Reliability Engineering, Cloud Systems, and Distributed Systems. I have worked with startups across Europe, the United States, and Japan, as well as in various industries such as Telecom, Automotive, Energy Transmission, Gaming, AR & Generative AI. I offer hands-on consulting, training, and team building services.

Timeouts, retries, and backoff with jitter

21 November 2021 – 1 min read

Whenever one service or system calls another, failures can happen. These failures can come from a variety of factors. They include servers, networks, load balancers, software, operating systems, or even mistakes from system operators. We design our systems to reduce the probability of failure, but impossible to build systems that never fail. So in Amazon, we design our systems to tolerate and reduce the probability of failure, and avoid magnifying a small percentage of failures into a complete outage. To build resilient systems, we employ three essential tools (timeouts, retries, and backoff).

Failures Timeouts Retries Backoffs Distributed-Systems External

Amazon VPC CNI plugin increases pods per node limits

18 September 2021 – 1 min read

Amazon VPC Container Networking Interface (CNI) Plugin supports “prefix assignment mode”, enabling you to run more pods per node on AWS Nitro based EC2 instance types. To achieve higher pod density, the VPC CNI plugin leverages a new VPC capability that enables IP address prefixes to be associated with elastic network interfaces (ENIs) attached to EC2 instances. You can now assign /28 (16 IP addresses) IPv4 address prefixes, instead of assigning individual secondary IPv4 addresses to network interfaces. This significantly increases number of pods that can be run per node.

EKS CNI VPC ENI IPAMD CIDR PODs External

How to enable Kubernetes container Runtime Default seccomp profile for all workloads

24 August 2021 – 1 min read

Seccomp (Secure Computing) is a feature in the Linux kernel that allows a userspace program to create syscall filters. In the context of containers, these syscall filters are collated into seccomp profiles that can be used to restrict which syscalls and arguments are permitted. Applying seccomp profiles to containers reduces the chance that a Linux kernel vulnerability will be exploited.

containers security seccomp syscalls linux kernel External

Root cause of failure, root cause of success

22 August 2021 – 1 min read

Everyone likes the idea of a single root cause when a problem occurs. This post compares that to how we think about successes, to make the point about the fragility of looking for a singular root cause

Success Failure SRE Systemsm External

The Hidden Dangers of Terminating K8S Namespaces

11 August 2021 – 1 min read

Controllers are one of the foundational components of Kubernetes whose job is to constantly monitor (through a control loop) the defined API resources in order to bring the cluster to the desired state. Each controller has a designed purpose that manages the entire lifecycle of a particular component. An important concept to remember with any cloud native technology is that availability is not guaranteed. If a controller was designed to take action when a resource was deleted and the controller was unavailable at that point in time, the intended action would not occur and state would no longer be in sync.

Kubernetes Namespaces Termination GC Openshift External