The Five-Year Iteration: A Founder's Kubernetes Odyssey
An anonymous founder's five-year journey through five distinct Kubernetes cluster rebuilds reveals a relentless pursuit of robust, self-hosted infrastructure, driven by specific technical challenges…
An anonymous founder's five-year journey through five distinct Kubernetes cluster rebuilds reveals a relentless pursuit of robust, self-hosted infrastructure, driven by specific technical challenges and iterative problem-solving.
In a Reddit post published in May 2026, a user identified only by the handle Inevitable_Remove_67 recounted a five-year odyssey in self-hosting Kubernetes. The account detailed not merely the maintenance of a production system, but its complete reconstruction five separate times. Each rebuild was a direct response to a specific technical limitation or architectural challenge encountered in the preceding iteration, a testament to an iterative approach to infrastructure design that prioritized function over static adherence to initial choices. The narrative offers a granular look at the practicalities of operating complex systems outside the managed cloud ecosystem, highlighting the continuous learning inherent in such an endeavor.
Thesis
The founder's journey, as documented in the Reddit post, illustrates a profound commitment to understanding and mastering the foundational layers of cloud-native infrastructure. This is not a story of a single, grand architectural vision, but rather a compelling case study in iterative engineering, where each failure or inefficiency became a catalyst for a more resilient, cost-effective, or secure system. The founder's willingness to dismantle and rebuild, rather than patch, underscores a pragmatic approach to technical debt and operational excellence.
Origin / Background
The specific personal background of the founder remains unstated in the public record. The journey into self-hosted Kubernetes began approximately five years prior to the Reddit post, indicating a long-term engagement with the complexities of production infrastructure. The initial impetus was the need for a highly available (HA) system, a common requirement for any production workload. This early stage involved following established guides, specifically citing Techno Tim's k3s guide, suggesting a foundation built on community knowledge and readily available open-source tools. The choice to run bare metal nodes, rather than relying on managed services like EKS or GKE, set the stage for a hands-on, deeply technical learning curve that would define the subsequent years.
The Build
The founder's technical evolution unfolded across five distinct stages, each marked by a specific problem and a corresponding architectural overhaul.
Stage 1: The Quest for High Availability. The initial setup aimed for HA, leveraging three DigitalOcean VPS instances. These virtual private servers were configured with Nginx to act as a makeshift load balancer, positioned behind a cloud load balancer. While functional, this configuration proved inefficient, with the most expensive components performing the least critical tasks, prompting the first re-evaluation.
Stage 2: Optimizing Resource Utilization and Networking. The second iteration addressed the inefficiency of idle RAM on Hetzner nodes. The founder migrated the infrastructure to Contabo, a provider that, at the time, lacked a private networking option. This constraint led to the development of a WireGuard mesh network, implemented with Netclient, to facilitate secure private communication between nodes. The Nginx VPS and cloud load balancer from Stage 1 were entirely removed, replaced by Klipper for load balancing, streamlining the architecture and reducing costs.
Stage 3: Embracing Multi-Architecture and Cost Savings. The third stage saw an expansion of the cluster by integrating Oracle Cloud's ARM nodes, which were available for free. The existing WireGuard mesh was extended to incorporate these new workers. This move necessitated a multi-architecture build process for application images, which the founder managed using GoReleaser and GitHub Actions, ensuring compatibility across both amd64 master nodes and ARM64 worker nodes.
Stage 4: Navigating Network Layers and Latency. A key challenge in Stage 4 was the desire to avoid exposing ports 80 and 443 on every node. The founder experimented with Calico BGP in conjunction with MetalLB to announce a private load balancer IP, an architecturally sound approach. However, after a month of operation, this configuration introduced noticeably high HTTP latency. The system was reverted, retaining the internode WireGuard mesh but returning to Keepalived for managing the floating IP, prioritizing performance over the initial architectural ideal.
Stage 5: Hardening Security with eBPF and Egress Control. The final, and current, iteration focused heavily on security and network segmentation. The founder adopted Cilium, a CNI (Container Network Interface) that leverages eBPF (extended Berkeley Packet Filter) for network policy enforcement. This allowed for a host firewall to operate at a layer below potential CNI conflicts. The entire setup was migrated to rke2 on OVH bare metal. A critical security enhancement was configuring every node for egress-only communication on its public interface, with all ingress traffic routed exclusively through Cloudflare, protected by mTLS. This stage also implemented hard tenant isolation between namespaces by default, significantly enhancing the security posture of the production system.
The Break
The journey was characterized by a series of incremental improvements rather than a single
Every claim ties to a primary source. See our methodology.