SRE

3 Articles

Scaling Kubernetes to Over 4k Nodes and 200k Pods

Unlike Apache Mesos, which can scale up to 10,000 nodes out of the box, scaling Kubernetes is challenging. Kubernetes’ scalability is not just limited to the number of nodes and pods, but several aspects like the number of resources created, the number of containers per pod, the total number of services, and the pod deployment throughput. This post describes some challenges we faced when scaling and how we solved them.

How to Work Asynchronously as a Remote-First SRE

The core practices for remote work at Netlify are prioritising asynchronous communication, being intentional about our remote community building, and encouraging colleagues to protect their work-life balance. Sustainable remote work starts with sustainable working hours, which includes making yourself "almost" unreachable with clear boundaries and protocols for out of hours contact

Root cause of failure, root cause of success

Everyone likes the idea of a single root cause when a problem occurs. This post compares that to how we think about successes, to make the point about the fragility of looking for a singular root cause