Tactics·Jun 10, 2026

Three-Stage Pipeline Powers Mano-P to OSWorld #1 Rank

Mano-P, a GUI-VLA agent, achieved top performance on OSWorld using a three-stage training pipeline. This approach addresses cold start and exploration challenges for AI models interacting with user…

By Maya · Tactics desk·Human-reviewed·✓ Verified Jun 10, 2026·4 min read·1 source

Mano-P, a GUI-VLA agent, achieved top performance on OSWorld using a three-stage training pipeline. This approach addresses cold start and exploration challenges for AI models interacting with user interfaces.

Mano-P, a Vision-Language-Action (VLA) agent, claims the #1 rank on the OSWorld specialized benchmark with a 58.2% success rate. The model, designed to operate on edge devices, reportedly decodes at 80 tokens per second on an M5 Pro chip. This performance is attributed to a sequential three-stage training pipeline, which the developers argue is critical for building competent GUI-VLA agents.

Supervised Fine-Tuning: Learning Basics

The initial stage, Supervised Fine-Tuning (SFT), addresses the cold start problem inherent in training models for complex decision spaces. Before a model can learn from rewards, it requires foundational competence to generate meaningful interaction trajectories. The Mano-P team used human-annotated GUI interaction traces, structured as sequences of (screenshot, thought, action) tuples. The model learns to map visual observations to grounded actions, such as where to click or what to type.

SFT focuses on establishing visual grounding, action vocabulary, and task decomposition. Visual grounding involves identifying UI elements from pixels without relying on underlying DOM trees. Action vocabulary covers primitive actions and their screen coordinate mappings. Task decomposition teaches the model to break high-level instructions into atomic interactions. This stage yields a model capable of executing familiar GUI patterns reliably, though it remains brittle when encountering novel interfaces or multi-step tasks with branching logic. The goal is a capable, but inflexible, base.

Offline RL: Refining Decisions from History

Transitioning directly from SFT to online reinforcement learning (RL) is problematic because the initial model is not proficient enough for productive exploration. A freshly SFT'd model deployed in a live environment tends to produce catastrophic failures, leading to sparse and uninformative reward signals. Offline RL bridges this gap by training the model on a large dataset of pre-collected trajectories, encompassing both successful and unsuccessful interactions.

This dataset includes expert demonstrations, the model's own previous rollouts, and trajectories from earlier model checkpoints. The core principle of offline RL is to extract an improved policy from suboptimal data. Even failed trajectories contain useful information: 'clicking here led to a dead end' is a valuable signal. The model learns to favor actions historically associated with task completion and to avoid those linked to failure, refining its decision-making without the instability of live interaction.

What We'd Change

The described pipeline provides a logical progression for training GUI-VLA agents, but the source text is incomplete, cutting off before detailing the crucial Online RL stage. This omission leaves a gap in understanding the full refinement process, particularly how the model handles real-world exploration and adaptation beyond historical data. Without the specifics of Online RL, the generalizability of the "three-stage" claim is limited.

Furthermore, the reliance on "thousands of expert demonstrations" for SFT implies a significant data collection burden. For founders without access to extensive human annotation resources, this initial stage could be a substantial bottleneck. The post does not specify the cost or time investment required for this data collection, which is a critical factor for early-stage teams. The reported performance benchmarks (OSWorld #1, 80 tok/s) are presented as claims within the blog post, lacking direct links to public dashboards or research papers for independent verification. Future iterations of such a playbook would benefit from transparent access to these metrics.

The Mano-P training pipeline illustrates a structured approach to building complex AI agents. By segmenting the learning process into distinct stages, each addressing specific challenges, the developers claim to mitigate common pitfalls like the cold start problem and unproductive exploration. This sequential refinement, moving from imitation to historical learning, offers a blueprint for developing robust models capable of navigating intricate digital environments.

The investor read

The reported performance of Mano-P, particularly its #1 ranking on the OSWorld benchmark and 80 tok/s on edge devices, signals increasing viability for GUI-VLA agents. This category, focused on AI interacting directly with user interfaces, represents a significant market opportunity for automating complex digital workflows. The emphasis on edge device deployment suggests a potential for lower operational costs and enhanced privacy compared to cloud-based solutions, which could attract enterprise adoption. Investors should note the high data requirements for the SFT stage, indicating that data acquisition and labeling remain critical bottlenecks and potential areas for investment in tooling or specialized services. The sequential training approach offers a de-risking strategy for model development, making such projects more predictable.

Pull quote: “Even failed trajectories contain useful information: 'clicking here led to a dead end' is a valuable signal.”

Sources · how we verified

SFT Offline RL Online RL: The Three-Stage Training Pipeline Behind Mano-P ↗

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Supervised Fine-Tuning: Learning Basics

Offline RL: Refining Decisions from History

What We'd Change

The investor read

Developer details Iceberg partition overwrite for atomic data corrections in pipelines

Developer traces inconsistent AI output to floating-point rounding noise

Engineer details config-driven pipeline for unifying CSVs via EAV model