Tools·Jun 10, 2026

Claude Code's 24-hour unsupervised run shows promise, but also hallucinations

An experiment with Claude Code running unsupervised for 24 hours on a Python project reveals its capabilities in refactoring and bug fixing, alongside critical limitations like hallucination and the…

By Riley · Tools desk·Human-reviewed·✓ Verified Jun 10, 2026·5 min read·1 source

An experiment with Claude Code running unsupervised for 24 hours on a Python project reveals its capabilities in refactoring and bug fixing, alongside critical limitations like hallucination and the need for strict guardrails.

The Answer Up Front

For developers exploring the frontiers of autonomous coding agents, particularly those working with Python, this experiment with Claude Code (claude-sonnet-4-5) offers valuable insights. It demonstrates a capacity for effective code refactoring and even bug fixing with proactive test generation. However, those expecting a fully autonomous, 'fire and forget' solution should temper expectations. The agent exhibits a tendency to hallucinate, requiring careful task definition and a robust review process. The bottom line: Claude Code, when tightly scoped and monitored, can augment development workflows, but it is not yet a replacement for human oversight.

Methodology

This v0 review draws on the founder's published claims at https://dev.to/numbpill3d/i-let-claude-code-run-unsupervised-for-24-hours-heres-what-happened-179a; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior.

The review covers the performance of claude-sonnet-4-5 as an unsupervised coding agent, specifically its behavior over a 24-hour period on a Python-based recon automation tool. The founder, numbpill3d, documented the setup, the agent's successes in task completion, its points of failure (blocked tasks), and instances of incorrect output. Key technical details, such as the use of OpenClaw for session management, the CLAUDE.md instruction file, and scoped tool permissions, are included. The experiment was conducted on a headless Ubuntu VPS within a tmux session, with claude-sonnet-4-5 configured for a max of 8192 tokens per call. The agent was given 15 tasks.

This review does not cover independent performance benchmarks, the long-term workflow implications of integrating such an agent, or exhaustive testing of edge cases beyond those encountered in the founder's specific project. The focus remains on the specific behaviors observed and reported by the founder in this single, controlled experiment.

What It Does

This experiment deployed Claude Code, specifically the claude-sonnet-4-5 model, as an autonomous agent tasked with improving an existing Python codebase. The setup was designed for minimal human intervention, allowing the agent to operate unsupervised for 24 hours.

Unsupervised Agent Setup

The agent ran within a tmux session on a headless Ubuntu VPS. OpenClaw managed the persistent session, ensuring continuity despite potential connection drops. Tool permissions were strictly scoped: file read/write was limited to the project directory, Bash execution to the virtual environment, and network access only to localhost. A CLAUDE.md file at the project root served as the primary instruction set, defining task priority, off-limits directories, expected output format, and a critical rule: if a decision involved more than two plausible outcomes, the agent was to stop and write a BLOCKED.md file detailing the ambiguity.

Task Execution and Outcomes

The agent was given a list of 15 issues, ranging from minor refactors to a complex bug in rate-limiting logic. Within six hours, it completed 9 of the 15 tasks. The founder reports that refactors were clean, variable naming consistent with existing conventions (inferred by the agent), and an inconsistent output formatting issue was resolved with minimal code changes. The agent also encountered three tasks it could not complete, generating BLOCKED.md entries, and three tasks it reportedly got wrong, though the specific details of these incorrect tasks are not fully elaborated in the source.

What's Interesting / What's Not

The most compelling outcome of this experiment is Claude Code's ability to not only fix a complex bug but also proactively enhance the codebase's defensibility. The agent correctly identified the root cause of a rate-limiting bug (stale timestamp recalculation), implemented a fix, and then, unprompted, added three targeted unit tests covering the bug's specific edge cases. These tests reportedly passed and would have caught the original issue. This behavior moves beyond simple task completion, demonstrating a form of proactive quality assurance that is highly valuable.

Another interesting observation was the agent's inference capabilities. It maintained consistent variable naming conventions, suggesting it could learn and adhere to existing code styles rather than imposing its own. The agent also correctly identified a genuinely ambiguous task related to config loading logic, producing a well-described BLOCKED.md entry, which validates the effectiveness of the CLAUDE.md guardrails.

Conversely, a significant concern emerged from the agent's hallucination. In one instance, Claude Code blocked itself on a dependency update task, citing a version constraint that did not exist in the requirements.txt file. This indicates a critical failure mode: when uncertain, the agent may manufacture reasons for its uncertainty rather than admitting a lack of knowledge. This behavior necessitates rigorous human review of any agent-generated BLOCKED.md files or decisions. The fact that three tasks were reportedly

The investor read

This experiment highlights the continuing trend towards autonomous agents in software development, signaling a potential shift in tooling spend from human-centric IDEs to agent orchestration platforms. The ability of Claude Code to perform complex bug fixes and generate unit tests proactively suggests a future where AI augments, and in some cases replaces, junior development tasks. The hallucination issue, however, underscores the current limitations and the need for robust verification layers, creating opportunities for companies building agent-monitoring or output-validation tools. For an investor, a company that can reliably mitigate these failure modes or provide a 'human-in-the-loop' framework for autonomous agents would be highly attractive. This also suggests that smaller, bootstrapped plays focused on specific, well-defined coding tasks might find a niche, as the broader 'generalist' autonomous agent market still requires significant R&D to overcome reliability hurdles.

Sources · how we verified

I Let Claude Code Run Unsupervised for 24 Hours. Here's What Happened. ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

The Answer Up Front

Methodology

What It Does

Unsupervised Agent Setup

Task Execution and Outcomes

What's Interesting / What's Not

The investor read

Robinhood Chain demo app shows standard Ethereum dev tools still work

Web Crypto API offers secure browser-side UUID v4 generation

Git-absorb uses git blame to automate fixup commits