A slow-read bot took down dozens of sites while the server CPU sat 84% idle
An operator faced a server-wide timeout with no obvious cause. The diagnostic path reveals a critical lesson: to find application stalls, sort logs by time spent, not request count. Dozens of…
An operator faced a server-wide timeout with no obvious cause. The diagnostic path reveals a critical lesson: to find application stalls, sort logs by time spent, not request count.
Dozens of e-commerce sites on a single shared server went dark simultaneously. The alerts screamed "response timeout," but server monitoring told a different story. According to a post-mortem by the operator, the server's CPU was 84% idle, the database was quiet, and memory was stable. The sites were down, but the machine appeared healthy.
This specific type of failure, where system-level metrics look green while the application is unavailable, points to resource exhaustion at a different layer. The problem was not a lack of compute power. It was a lack of available slots to handle incoming connections.
The usual suspects were innocent
The initial investigation followed a standard playbook, clearing the most common causes of a server-wide outage. The operator reports ruling out CPU saturation (84% idle), database bottlenecks, and memory pressure. No Out-of-Memory killer events were found. The TCP connections were being established, but the server sent nothing back, leading to socket timeouts for legitimate users.
This pattern suggested the server's worker processes, which handle requests in an Apache prefork model, were all occupied. The next logical step was to check the server's global error.log for the "reached MaxRequestWorkers" message, which would confirm this hypothesis. The log was empty.
A critical log pipeline was broken
The investigation revealed a silent, critical failure in the server's observability stack. The operator had configured Apache's ErrorLog to pipe to syslog, but a regex mismatch in the rsyslog configuration caused all server-scope errors to be discarded. The very log file meant to diagnose this exact problem was not recording data.
This left the access.log as the only source of truth. Because the attacker was not generating a high volume of requests, simple counts of requests by IP address did not reveal any anomalies. The attacker was invisible to traditional analysis.
Sorting by time revealed the attacker
The breakthrough came from changing the sorting metric. Instead of aggregating the access log by request count, the operator sorted entries by response time, using Apache's %D format specifier which records the time taken to serve the request in microseconds.
This immediately surfaced the problem. The log showed 159 requests had each taken over 15 seconds to complete. One request from a single IP address held an Apache worker process hostage for 17 minutes. To find a stall, sort by time spent, not by request count. This single IP, belonging to a commercial scraping service, was responsible for consuming all available workers by initiating connections and then reading the responses at an excruciatingly slow pace. The request rate was low, just 0.45 requests per second, which slid under existing rate-limiting rules.
WHAT WE'D CHANGE
The core vulnerability was architectural. A shared-everything server creates a single point of failure where one tenant's problem, or one targeted attack, can cause cascading failure for all others. Modern containerized or serverless architectures are designed to mitigate this "noisy neighbor" problem by isolating resources and enabling horizontal scaling. This playbook is most relevant for operators of monolithic, resource-constrained systems.
The logging failure represents a significant operational gap. An untested logging pipeline is a non-existent one. Critical monitoring and alerting infrastructure should be subject to regular failure testing, just like any other part of the application. An operator should be able to trigger a test alert and verify it propagates through the entire system as expected.
Finally, the server's defenses were insufficient for this type of attack. Rate limiting based on connection frequency is a common, but incomplete, solution. A more robust defense involves setting timeouts not just for connections, but for the data transfer itself. Apache's mod_reqtimeout module, for example, is designed specifically to mitigate slow-read attacks by setting deadlines for how long a client can take to send headers and the request body, and how slowly it can read the response.
LANDING
The incident demonstrates that the most effective attacks are not always the loudest. A high-volume DDoS attack is obvious, but a low-and-slow resource exhaustion attack can be invisible to dashboards focused on CPU, memory, and request counts. The primary lesson is a diagnostic one. When an application is stalled but system metrics are normal, the bottleneck is likely a finite resource pool like connections, workers, or threads. Analyzing logs by time spent is the fastest way to identify the outliers consuming those resources.
The investor read
This incident highlights a specific operational risk inherent in legacy or resource-constrained technical architectures, common in bootstrapped and small-business SaaS. While a VC-backed company would likely use a cloud-native architecture that mitigates this 'noisy neighbor' risk, the story is a valuable proxy for a team's operational maturity. For investors conducting due diligence, it provides a template for questions that go beyond uptime metrics. Asking a founding team to describe a non-obvious outage and the subsequent changes to their monitoring can reveal more about their engineering discipline than a simple architectural diagram. It signals whether a team is prepared for sophisticated, non-obvious threats or just the most common failure modes.
Pull quote: “To find a stall, sort by time spent, not by request count.”
Every claim ties to a primary source. See our methodology.