HomeReadTactics deskSelf-Hosting GitHub Actions on EKS: Uncovering Silent Infrastructure Failures
Tactics·Jun 19, 2026

Self-Hosting GitHub Actions on EKS: Uncovering Silent Infrastructure Failures

A founder reports 85% cost savings by moving CI to self-hosted EKS, but only after debugging multiple silent failures. The experience details critical lessons in cloud infrastructure and Kubernetes…

A founder reports 85% cost savings by moving CI to self-hosted EKS, but only after debugging multiple silent failures. The experience details critical lessons in cloud infrastructure and Kubernetes configuration.

A founder operating under the handle Blue_Flam3s claims an 85% reduction in CI runner compute costs by migrating GitHub Actions to self-hosted EKS. This efficiency came after navigating a series of "silent failures" where misconfigurations appeared correct but led to significant operational or financial drains. The experience highlights the hidden complexities of infrastructure optimization.

Architecture Overview

The setup uses Actions Runner Controller (ARC) in gha-runner-scale-set mode on EKS, with Karpenter managing node provisioning. Spot instances are prioritized, and minRunners: 0 ensures the cluster scales to zero when idle. This architecture aims to maximize cost savings by leveraging ephemeral, cheaper compute.

Uncovering Hidden Costs: Spot Instance Role

The first major failure involved Karpenter falling back to more expensive on-demand instances despite being configured for Spot. The root cause was a missing AWSServiceRoleForEC2Spot service-linked role in the AWS account. Karpenter's role lacked permissions to create this prerequisite, leading to silent Spot CreateFleet failures and automatic fallback to on-demand pricing. The founder says this issue went unnoticed until a manual node check, leading to unexpected costs.

Helm's List Semantics

A "maddening" issue arose from Helm's list merging behavior. Overriding containers[0].image or .resources in Helm values replaced the entire list element, inadvertently removing the chart's default command: ["/home/runner/run.sh"]. Runner pods would connect to GitHub but then exit immediately due to the missing command, resulting in a backoff loop and jobs waiting indefinitely.

Pinned Images and Future Outages

For "reproducibility," the founder initially pinned the runner image to a fixed tag. GitHub, however, hard-rejects deprecated runner versions from its message bus with a 403 error. Since ARC runs runners with DisableUpdate: true, a pinned image became a guaranteed future outage dictated by GitHub's deprecation schedule. The founder notes that :latest is often the correct choice in this specific context.

Taints and CoreDNS

The strategy of tainting on-demand base nodes to force runner pods onto Spot instances created a critical vulnerability. When the cluster scaled to zero, consolidating all Spot nodes, the tainted base node became the only remaining node. If CoreDNS pods did not tolerate this taint, the cluster lost DNS resolution, rendering it inoperable.

Terraform Destroy Hangs

A terraform destroy operation repeatedly hung due to Karpenter-launched nodes not being tracked in Terraform state. An orphaned Spot instance retained an Elastic Network Interface (ENI), blocking the VPC teardown with a DependencyViolation. Manual deletion of nodepools and nodeclaims was required to drain nodes before a successful destroy.

Optimizing infrastructure costs through self-hosting often involves a trade-off between direct expense and operational complexity. The experience of Blue_Flam3s demonstrates that significant savings are attainable, but they demand a deep understanding of underlying cloud provider mechanisms, Kubernetes configuration nuances, and platform-specific behaviors. The "silent failure" modes encountered are a reminder that apparent success in deployment does not always equate to correct or efficient operation.

The investor read

The drive for 85% cost savings on CI runners reflects a broader market trend among bootstrapped and capital-efficient startups to aggressively optimize infrastructure spend. While the technical complexity detailed here is high, the payoff for intermittent workloads can be substantial, freeing up capital for product development or marketing. This signals a continued demand for tools and managed services that abstract away such complexities, or for specialized DevOps talent capable of implementing and maintaining these custom solutions. For investors, the ability to achieve such savings internally can be a strong indicator of technical depth and capital efficiency, but it also flags the potential for significant operational drag if not executed by highly skilled teams. The market for managed CI/CD or specialized cloud cost optimization platforms remains robust, as most companies cannot afford this level of custom engineering.

Pull quote: “The founder says this issue went unnoticed until a manual node check, leading to unexpected costs.”

Sources · how we verified
  1. Self-hosted GitHub Actions runners on EKS: the failures that taught me the most

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
M
Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.