Computer-Use Agents and the Deployment Overhang

#Foreword

The capability curve is steep; the deployment curve is not. In October 2024 Anthropic shipped Claude 3.5 Sonnet's computer-use beta with an OSWorld score of 14.9 percent. In April 2026 OpenAI shipped GPT-5.4 with an OSWorld-Verified score of 75.0 percent — surpassing the 72.4 percent human baseline. That is a 5x improvement in eighteen months on the canonical desktop-agent benchmark. It is also the kind of trajectory that, if it were a conventional software metric, would justify wholesale procurement of computer-use agents across enterprise IT by Q4 2026.

That is not what happened. Production deployments lag the benchmarks by twelve to eighteen months, and the gap is widening on the metrics that matter most for shipping — per-task cost economics, end-to-end reliability on multi-step workflows, latency under realistic operating conditions, and the security posture of an agent that perceives an entire rendered desktop as trusted instruction context. This paper is the operational walkthrough for the platform engineer asked "should we ship a computer-use agent in 2026?" The honest answer is "yes, narrowly, with these defenses, for these task types, on this cost curve." The rest of the paper documents what each of those qualifiers means.

This paper sits at the front of the perea.ai canon's "agent deployment" subseries, alongside the MCP Server Playbook, the Agent Observability Stack, the Indirect Prompt Injection Defense paper, and Agent Memory in Production. Computer-use is unique among the operational layers covered in those papers because it changes the threat model: every pixel on the rendered screen is a potential injection vector, every observation-to-action gap is an exploit window, and the visual layer collapses the data/instruction boundary in ways that text-based defenses cannot recover.

#Executive Summary

OSWorld went from 12 percent to 75 percent in eighteen months. Anthropic Claude 3.5 Sonnet (October 2024) opened the category at 14.9 percent screenshot-only / 22.0 percent with more allowed steps. OpenAI Computer-Using Agent in Operator (January 2025) hit 38.1 percent. Agent S3 from Simular AI (2025) reached 62.6 percent in a 100-step setting on community follow-ups. GPT-5.4 (April 2026) shipped with native computer-use at 75.0 percent on OSWorld-Verified — past the 72.4 percent human reference. The benchmark trajectory is one of the steepest in agent research history.
Production reliability still hovers around 58 percent for general workflows. Independent testing of Claude 3.5 Sonnet's computer-use beta on 50 tasks found 80 percent completion on web browsing, 70 percent on form filling, 60 percent on file management, 50 percent on application interaction, and 30 percent on multi-step workflows. Eighteen months of model improvement raised the OSWorld score 5x but the multi-step ceiling has not moved that fast — multi-application workflow tasks were the hardest category in the original OSWorld paper at 6.57 percent, and they remain the hardest today.
Cost-per-task economics still break for many use cases. Computer-use is expensive because every action requires a screenshot, image-token processing, and another model call. A simple web browse averages 4 screenshots and $0.14 per task; a complex multi-step workflow averages 12+ screenshots and $0.50+ per task. At 30 percent completion on complex tasks, the effective cost per successful complex task exceeds $1.67. The decision matrix is unforgiving: API integration delivers 99 percent+ reliability at $0.001-0.01 per task; traditional RPA delivers 90 percent+ at $0.01-0.05; computer use delivers ~58 percent at $0.14-0.50; a human virtual assistant delivers 95 percent+ at $3-10. Computer use lives in a specific niche — tasks where no API exists and per-success value clears the $1.67 hurdle.
Latency is the silent killer. OSWorld-Human, the first temporal-performance study of computer-use agents, found that LLM calls for planning and reflection account for most of the overall latency, and that each successive step in a task takes approximately three times longer than steps at the start of the task. Leading agents take 1.4 to 2.7 times more steps than necessary compared to a human reference trajectory. The best-scoring agent on OSWorld at 42.5 percent raw success drops to 17.4 percent on a strict efficiency metric. End-to-end latencies of "tens of minutes" for tasks that take humans a few minutes are typical, not exceptional.
Computer-use opens a novel attack surface. A new vulnerability class formalized in 2026 — Visual Atomicity Violation (VAV) — exploits the mean 6.51-second gap between when an agent observes the screen and when it dispatches an action. An unprivileged attacker on the same desktop session can manipulate UI state during that gap, redirecting clicks to malicious targets. Three demonstrated attack primitives reach 100 percent action-redirection on Win Focus Manipulation alone. Adversarial pop-up windows, designed to look legitimate to the agent while being immediately dismissible by a human, achieve 86 percent attack success rate across OSWorld and VisualWebArena, reducing agent task completion by 47 percent.
Architectural defenses are landing. Pre-execution UI State Verification (PUSV) — a three-layer middleware that re-verifies UI state immediately before action dispatch via SSIM at the click target, global screenshot diff, and X Window snapshot diff — achieves 100 percent Action Interception Rate against OS-level attacks with under 0.1 second overhead. PUSV exhibits near 0 percent AIR against pure DOM injection attacks, exposing a fundamental blind spot in screenshot-only defenses. CaMeLs Can Use Computers Too (Foerster et al., Cambridge + ETH Zürich, January 2026) extends the CaMeL pattern documented in the Indirect Prompt Injection Defense paper to the screen-perception case, retaining 57 percent of frontier-model task performance while providing provable control-flow integrity against instruction injection.
Production wins are narrow but real. The deployable pattern in 2026 is Stagehand (open-source SDK with 22,000 GitHub stars and 700,000+ weekly downloads, four primitives — act, extract, observe, agent) plus Browserbase (managed cloud browsers with stealth, captcha solving, session replay) plus a Vercel Function with cron scheduling. Coding agents in sandboxed environments, browser agents for read-mostly workloads, form filling, and spreadsheet automation are working today. Multi-app workflows, sensitive state-changing operations, and sustained autonomous operation past 30 minutes are not.

The rest of the paper expands these findings. Read Part I for the capability curve, Part II for why production lags, Part III for the failure modes specific to screen-perceiving agents, Part IV for the defense stack, Part V for what works today, Part VI for the 90-day shipping playbook.

#Part I: The Capability Curve, 2024-2026

The history is short and dense. In October 2024 Anthropic launched Claude 3.5 Sonnet's computer-use beta — the first frontier model with public-beta computer use, available on Anthropic API, Amazon Bedrock, and Google Vertex AI. The mechanism is simple: the agent looks at a screenshot, counts the vertical and horizontal pixels to figure out where to move the cursor, issues mouse and keyboard commands, and observes the resulting screen. On OSWorld — the canonical benchmark Tianbao Xie and colleagues introduced at NeurIPS 2024 — Claude 3.5 Sonnet scored 14.9 percent in the screenshot-only category, "almost double the 7.7 percent acquired by the next best AI system." With more allowed steps the score rose to 22.0 percent. Anthropic was explicit in the launch communication: the capability was "experimental, at times cumbersome and error-prone," and developers were encouraged to begin with low-risk tasks.

In January 2025 OpenAI introduced Computer-Using Agent (CUA), the model behind Operator. CUA combined GPT-4o vision with reinforcement-learning-based reasoning, trained to interact with graphical user interfaces — buttons, menus, text fields — through the same universal interface of screen, mouse, and keyboard. CUA delivered 38.1 percent on OSWorld, 58.1 percent on WebArena, and 87.0 percent on WebVoyager (where most tasks are relatively simple). Operator launched as a research preview at operator.chatgpt.com to Pro Tier US users.

In July 2025 OpenAI shipped ChatGPT Agent, unifying Operator's browser interaction with deep-research's synthesis ability and ChatGPT's conversational fluency. ChatGPT Agent equipped the model with a visual browser, a text-based browser for reasoning-heavy queries, a terminal, direct API access, and ChatGPT connectors (Gmail, GitHub, etc.). The benchmark numbers were strong across the board: a new state-of-the-art on Humanity's Last Exam at 41.6 pass@1 (44.4 with parallel rollout), 27.4 percent on FrontierMath with terminal access, 45.5 percent on SpreadsheetBench (against Copilot in Excel's 20.0 percent), and a new SOTA on BrowseComp at 68.9 percent — 17.4 percentage points above deep research alone. WebArena performance improved over o3-powered CUA. The Operator preview was sunset and ChatGPT Agent absorbed its capabilities.

In October 2025 Anthropic shipped Claude for Chrome (a browser extension with research-preview restrictions) and OpenAI shipped ChatGPT Atlas — a full browser product built around ChatGPT, with browser memories, agent mode, and explicit constraints (no code execution, no file downloads, no extensions, no other-app access, no filesystem access, financial-site auto-pause). Atlas brought computer-use into a consumer product surface for the first time. The same week, Anthropic published an extended writeup on browser-use prompt-injection mitigations for Claude Opus 4.5, reporting approximately 1 percent attack success rate against an internal best-of-N adaptive attacker.

In April 2026 OpenAI shipped GPT-5.4 as the first general-purpose model with native, state-of-the-art computer-use capabilities. The headline numbers reset the discussion. OSWorld-Verified: 75.0 percent, beating human performance at 72.4 percent. WebArena-Verified: 67.3 percent with combined DOM and screenshot interaction. Online-Mind2Web: 92.8 percent on screenshot-only observations, against ChatGPT Atlas Agent Mode at 70.9 percent. GPT-5.4 supports 1M tokens of context, includes tool search, and is the most token-efficient reasoning model OpenAI has shipped. Custom confirmation policies allow developers to tune safety behavior to their risk tolerance.

The community trajectory on OSWorld between the original NeurIPS 2024 baseline of 12.24 percent and GPT-5.4 in 2026 is the more interesting story. Agent S (GPT-4o backbone with hierarchical memory) reached 20.58 percent. RL-based ARPO pushed to 29.9 percent. Agent S3 from Simular AI in 2025 claimed 62.6 percent in the 100-step setting, approaching human parity through a combination of better grounding models and reinforcement-learning fine-tuning. Successor benchmarks emerged: AndroidWorld extends OSWorld to mobile devices, WindowsAgentArena (ICLR 2025) adapts the framework to Windows with 150+ tasks, Agent S2 introduces a compositional generalist-specialist architecture. The ecosystem is no longer reliant on a single benchmark or a single lab.

Two patterns matter from this history. First, the capability gains came from grounding models and RL fine-tuning, not from the base prompted LLMs OSWorld originally tested. Teams that hope to ride the curve cannot just upgrade their model; the gains live in the training pipeline and the supporting architecture. Second, the multi-application workflow ceiling has barely moved. OSWorld's original multi-app workflow category capped at 6.57 percent in 2024, and the post-paper trajectory papers continue to flag this as the hardest category. The 75.0 percent OSWorld-Verified score from GPT-5.4 averages across categories; the long tail is still where production ships.

#Part II: Why Production Lags the Benchmarks

The first reliability number worth holding in mind is 58 percent. That is the completion rate Dataku measured for Claude 3.5 Sonnet's computer-use beta across 50 tasks in the launch week of October 2024. Eighteen months later, the corresponding number for GPT-5.4 on similar real-world workloads is higher but not transformatively so for multi-step, multi-application workflows. The benchmark numbers (OSWorld 75.0 percent, OSWorld-Verified) measure constrained tasks under controlled conditions; the production number measures the messy reality of agents handling real user requests, real third-party websites, and real edge cases.

The cost arithmetic is the second binding constraint. Computer-use costs add up because every action requires a screenshot — image tokens at substantial cost — plus the model processing to decide the next action. The Dataku breakdown is representative: simple web-browse tasks averaged 4 screenshots and 8,000 tokens at $0.14 per task; medium form-filling tasks averaged 7 screenshots and 14,000 tokens at $0.25 per task; complex multi-step tasks averaged 12+ screenshots and 28,000+ tokens at $0.50 or more per task. At a 30 percent success rate on complex tasks, the effective per-success cost crosses $1.67. The trade-off matrix against alternatives is unforgiving. A custom API integration delivers 99-plus percent reliability at $0.001-0.01 per task, but takes hours-to-days of setup and is brittle to UI changes. Traditional RPA (UiPath, Automation Anywhere) delivers 90-plus percent reliability at $0.01-0.05 per task, but takes days-to-weeks of setup and breaks the moment the UI redesigns. A human virtual assistant delivers 95-plus percent reliability at $3-10 per task. Computer-use sits in the band between RPA and the human VA: more flexible than RPA, faster to set up than custom integration, but only economically viable when no API exists and the per-success value clears the cost threshold.

Latency is the third constraint, and it is the one production teams discover only after deployment. The OSWorld-Human study from arXiv 2506.16042 — the first temporal-performance investigation of computer-use agents — found that LLM calls for planning and reflection account for most of the overall latency, with each successive step taking three times longer than steps at the beginning of a task. Leading agents take 1.4 to 2.7 times more steps than the human reference trajectory captured in OSWorld-Human. End-to-end task latency stretches into "tens of minutes" for tasks that humans complete in a few minutes. The result is a user-experience cliff: the agent that completes a task in 12 minutes when a human would have completed it in 4 is not a productivity win at any cost level until the benchmark-to-production gap closes.

The fourth constraint is the multi-app ceiling. OSWorld's multi-application workflow category capped at 6.57 percent in the original NeurIPS 2024 paper. Successor work has lifted single-application performance into respectable territory, but the multi-app number has barely moved. The structural reason is that multi-app workflows compound failure modes — a click target drift in app A leaves the agent in an unrecoverable state when it tries to switch to app B and finds the wrong window in focus. Without explicit cross-application planning and recovery, the failure rates multiply rather than add. Production workflows that touch more than two distinct applications are the worst-served use case for computer-use in 2026.

The deployable conclusion: computer-use lives in the niche of tasks where (a) no API exists, (b) per-success value clears $1.50-2.00, (c) latency tolerance is minutes rather than seconds, and (d) the workflow stays mostly within one application. That niche is real and growing — the GPT-5.4 release will expand it — but it is narrower than the benchmark progression suggests.

#Part III: Failure Modes Specific to Screens

Three categories of failure are unique to agents that perceive screens. Text-only agents do not have these problems; computer-use agents have them all.

Visual hallucination is the first. The HalluClear paper (arXiv 2604.17284) introduced the first comprehensive hallucination suite for GUI agents — a GUI-specific taxonomy derived from empirical failure analysis, a calibrated three-stage VLM-as-judge evaluation workflow, and a closed-loop structured-reasoning mitigation. The headline operational finding: post-training on 9,000 carefully-curated samples significantly reduces hallucinations, suggesting that the path to robust GUI automation is targeted post-training rather than scale alone. The dominant failure pattern in production is misidentifying UI elements — small text, low-contrast UIs, adjacent buttons, dynamic loading states — exactly the cases where a screenshot at fixed resolution loses the structural information a DOM or accessibility tree would preserve.

Idempotent failure is the second. The TVAE paper (arXiv 2604.05477) made the operational discovery: in real mobile and desktop deployments, execution timeouts caused by repeated ineffective actions account for 72.3 percent of all failures. The mechanism is straightforward and predictable. The agent issues an action, the screen does not change (because the click missed, the network is slow, the dialog hasn't loaded yet, or the previous action silently failed), and the agent — without an explicit verification step — issues the same action again. And again. And again, until the timeout fires. This is not a model intelligence problem; it is an architecture problem. The fix is a verification loop: after every action, the agent must verify that the expected effect occurred, and if not, must reason explicitly about whether the previous action failed, whether the environment is in transition, or whether the plan needs adjustment. TVAE's two-stage training pipeline (Robust SFT on synthetic failure trajectories, then GRPO with asymmetric penalties for hallucinating success on unchanged screens) demonstrates the discipline can be trained in.

Visual Atomicity Violation (VAV) is the third — and the most dangerous. The arXiv 2604.18860 paper formalized the threat as a Time-Of-Check-Time-Of-Use vulnerability unique to GUI agents. The mean observation-to-action gap in current desktop agents is 6.51 seconds. During that window, an unprivileged attacker on the same desktop session can manipulate UI state and redirect the agent's intended action to a malicious target. The paper demonstrates three attack primitives. Primitive A — Notification Overlay Hijack — pops up a notification at the click target between observation and action; the agent clicks the notification instead of the intended button. Primitive B — Win Focus Manipulation — switches the active window between observation and action; the agent's keystrokes go to the attacker's window. Primitive B achieves 100 percent action-redirection success rate across 45 trials. Primitive C — Web DOM Injection — modifies the DOM between the screenshot capture and the dispatch event, exploiting the gap purely at the application layer with no visual footprint.

The corresponding offensive research at the visual-prompt-injection layer adds a fourth category. Cloud Security Alliance research (April 2026) documented adversarial pop-up windows designed to appear legitimate to the agent while being immediately dismissible by a human user, achieving 86 percent attack success rate across OSWorld and VisualWebArena benchmarks, reducing agent task completion by 47 percent. The attack is a direct analog of EchoLeak (covered in the Indirect Prompt Injection Defense paper), but the injection vector is the rendered screen rather than an email body. System prompts instructing the agent to ignore pop-ups are insufficient — the agent's visual reasoning is susceptible to the interface framing regardless of textual instruction. ACL 2025 work by Zhang, Yu, and Yang documented related "visual semantics manipulation" attacks where GUI elements are arranged so the agent reads the visual composition as an instruction; CVPR 2025 Workshop work showed that GroundingDINO-based grounding models are susceptible to image-perturbation attacks that redirect click targets without altering visual appearance perceptible to humans.

These failures are different in kind from the failures of text-only agents. A text-only agent can be hardened against malicious instructions by separating trusted from untrusted text, applying classifiers, or wrapping content with provenance labels. A screen-perceiving agent collapses the data/instruction boundary at the visual layer — every pixel is potential instruction context, and there is no equivalent to base-64-encoding or Spotlighting that preserves the semantic content while marking it as untrusted. The defenses come from elsewhere.

#Part IV: The 2026 Defense Stack

Four layers of defense are now standard in serious computer-use deployments. Each addresses a specific failure mode from Part III, and each has well-documented limitations.

Pre-execution UI State Verification (PUSV) addresses the Visual Atomicity Violation class. PUSV is a three-layer middleware that re-verifies UI state immediately before action dispatch. Layer 1 (SSIM at the click target) computes structural similarity between the observation-time screenshot and a fresh capture immediately before the click; deviations above threshold trigger an abort. Layer 2a (global screenshot diff) compares the full screen state, catching attacks that don't modify pixels at the click target but change other relevant context. Layer 2b (X Window snapshot diff) compares the OS-level window registry — what the OS thinks the active window state is — catching window-focus manipulations that change semantics without changing pixels. PUSV achieves 100 percent Action Interception Rate against OS-level structural attacks (Primitives A and B) with negligible overhead under 0.1 seconds. Crucially, PUSV achieves near 0 percent AIR against pure DOM injection (Primitive C). Visual and OS-level defenses are inherently incapable of detecting semantic application-layer manipulations that lack a visual footprint. The lesson, as the VAV paper put it: "the security of future agentic operating systems cannot rely on screenshot analysis alone."

CaMeLs Can Use Computers Too (Foerster et al., Cambridge + ETH Zürich, January 2026) extends the CaMeL architectural pattern documented in the Indirect Prompt Injection Defense paper to the screen-perception case. The architecture splits the agent into two components. A Privileged Planner receives only the operator's task specification and never processes untrusted environmental content; it emits a restricted action plan. A Quarantined Perception module processes screenshots and environmental data but cannot issue privileged actions directly. The Planner and Perception modules interact through a constrained interface that prevents the Perception module from injecting arbitrary instructions into the Planner's reasoning context. Early results: this architecture retains up to 57 percent of frontier-model task performance while providing what the authors call "provable control-flow integrity" against instruction injection. The 43 percent performance loss is the cost of architectural separation — a tradeoff that production teams must weigh against the unbounded blast radius of a successful injection.

Watch Mode and confirmation prompts are the third layer, and the most directly productized. Anthropic's Claude Computer Use ships with Watch Mode patterns that pause execution when the user becomes inactive in sensitive contexts (logged into email or banking). OpenAI's Operator and ChatGPT Agent ship with confirmation prompts before consequential actions — when ChatGPT Agent prepares to take a state-changing action, the user reviews the proposed action and approves or denies. ChatGPT Atlas formalizes this as automatic pause on financial-institution sites. GPT-5.4 introduces custom confirmation policies — developers specify per-action-class risk tolerance and the model surfaces confirmations accordingly. The confirmation pattern is the operational implementation of Meta's Rule of Two for the computer-use case: an agent that can simultaneously perceive untrusted content (the screen), access sensitive systems (your logged-in accounts), and change external state must require human approval at the third property.

Atlas constraints as defense pattern is the fourth layer — the practice of designing the agent's operating environment to foreclose attack vectors structurally. ChatGPT Atlas at launch could not run code in the browser, could not download files, could not install extensions, could not access other apps on the host computer, and could not access the filesystem. Each of these prohibitions is a defense: an agent that cannot install extensions cannot be tricked into installing a malicious one; an agent that cannot run code cannot exfiltrate data via a malicious script. The lesson generalizes: every capability the agent does not need is a defense the deployer is buying for free. The OpenAI Atlas hardening playbook from December 2025, covered in detail in the Indirect Prompt Injection Defense paper, adds the LLM-based automated attacker with try-before-ship simulator as the continuous-improvement loop on top of the static-defense layers.

The convergence of these four layers is the 2026 defense stack: text-only prompt-injection defenses (covered in Indirect Prompt Injection Defense) plus visual-layer defenses (PUSV) plus architectural separation (CaMeLs Can Use Computers Too) plus environment minimization (Atlas constraints). No single layer suffices. Defense-in-depth specific to computer-use is the architecture pattern that production teams converge on by their second incident.

#Part V: Where Computer-Use Actually Works Today

The honest framing of "what computer-use can ship in 2026" is narrower than the benchmark progression suggests, but the niche is real and growing. Five categories work well today.

Coding agents in sandboxed environments are the most successful production deployment of computer-use to date. Claude Code, Cursor, GitHub Copilot Workspace, and similar products use computer-use primitives — running terminal commands, opening files, reading IDE state, applying edits — in a deterministic sandboxed environment where the failure modes are constrained. The sandbox foreclooses the worst attack vectors (no production credential access, no email send, no payment authorization), and the deterministic state model makes idempotent failures easy to detect. The unit economics work because the per-success value of completed coding work is high (engineer-hours saved) and the alternative — manual coding — is more expensive than the per-task cost of computer-use.

Browser agents for read-mostly workloads are the second category. Research synthesis, data extraction, summarization, and competitive monitoring are tasks where the agent observes web content and produces structured output without taking state-changing actions. WebVoyager success rates of 87 percent reflect this — the read-mostly browser case is well within the capability frontier of GPT-5.4 and Claude. Stagehand's extract and observe primitives are designed for exactly this case.

Form filling and structured-data entry are the third. Dataku's 70 percent completion rate on form-filling tasks at $0.25 per task demonstrates the working economics: when the task is "fill this form with data from these sources," the failure modes are bounded (a wrong field can be corrected, a missed submission can be retried), and the latency is acceptable (a minute to fill a form vs. five minutes for a human, with an unattended-bot benefit). The task type generalizes — vendor onboarding, expense reporting, customer-record updates, data migration between systems without APIs.

Spreadsheet automation is the fourth, and the one ChatGPT Agent's SpreadsheetBench score (45.5 percent vs Copilot in Excel's 20.0 percent) made into an explicit selling point. Editing real-world spreadsheets is a domain where the visual presentation matters (cell formatting, column structure, conditional formatting) but the underlying state is structured enough that a hybrid agent — visual observation plus formula-level reasoning — outperforms either pure-vision or pure-API approaches. The use case is narrow but valuable: financial-model maintenance, data cleanup, report generation.

The Stagehand pattern is the fifth — not a use case but a deployment architecture. Stagehand is the open-source SDK from Browserbase, with 22,000 GitHub stars and 700,000+ weekly downloads at the time of writing. Its four primitives — act (single action), extract (structured data), observe (read), and agent (multi-step autonomous) — let teams compose deterministic primitives with autonomous primitives in the same script. Critically, Stagehand works locally in development and connects to Browserbase's cloud browsers for production with no code changes. Browserbase provides Agent Identity, action caching, session replay, prompt observability, captcha solving, and zero-infrastructure deployment via Vercel Functions. The full deployable stack for many production cases — Stagehand + Browserbase + Vercel Function with a cron schedule — fits in a single afternoon of engineering work, with environment variables and a 60-second maxDuration setting.

Where computer-use does not work today is just as important as where it does. Multi-app workflows remain stuck near the original 6.57 percent OSWorld ceiling for the hardest cases. Sustained autonomous operation past 30 minutes accumulates compounding errors faster than the agent can recover. Sensitive state-changing operations — wire transfers, employment decisions, medical actions, irreversible deletions — are inappropriate for any agent without explicit human-in-the-loop confirmation regardless of the cost. Operations that require strong stealth or anti-bot circumvention skirt acceptable-use lines and create legal exposure. And any task where a clean API exists is almost certainly better served by the API.

#Part VI: A 90-Day Implementation Playbook

The pattern that has worked across multiple production deployments factors into ninety days of focused work on one task type.

Days 1-30: scope and baseline. Pick exactly one task type with no available API and a per-success value of at least $2. The "no API" requirement keeps you in the niche where computer-use is the right tool; the value floor keeps the unit economics solvent. Run a benchmark of 50 representative tasks against your chosen agent (Claude Computer Use, ChatGPT Agent, GPT-5.4 via API) and measure your specific completion rate. The number you get will be different from the public benchmark numbers because your task distribution is different — that is the point of running the test. Stand up a development sandbox with read-only mode, no production credential access, and content sanitization on incoming pages (strip zero-width Unicode, CSS-hidden text, off-screen elements per OWASP recommendations). Capture screenshots of every action for post-incident analysis. Run the sanity check that an adversarial pop-up does not derail the agent — and if it does, log the percentage of tasks affected.

Days 31-60: harden the loop. Adopt PUSV-style verification before consequential actions: capture a fresh screenshot immediately before each click and compare it against the observation-time screenshot using SSIM at the click target. Wire confirmation prompts for any state-changing operation — sending email, submitting a form that affects external state, modifying a record, authorizing a payment — using GPT-5.4's custom confirmation policies if available, or building the equivalent middleware for your platform. If your task type is high enough stakes to warrant the architectural investment, apply the CaMeLs Can Use Computers Too separation: a Privileged Planner that sees only the task specification and a Quarantined Perception module that processes screens but cannot dispatch privileged actions. For most production cases the lighter-weight plan-then-execute pattern (the Privileged Planner emits a fixed plan from the trusted query, and tool outputs cannot influence the plan) will get you 80 percent of the security benefit with much lower engineering cost. Set Watch Mode triggers for sensitive contexts — pause when the user becomes inactive in any context that is logged into a financial, medical, or HR system.

Days 61-90: ship narrow, monitor wide. Pick the bottom-quintile of your completion rate distribution as the deploy gate, not the median. If the median completion rate is 75 percent and the 20th percentile is 40 percent, ship for the 40 percent floor — the reputational damage of the bottom-quintile failures will outweigh the benefit of the median successes if you ship to the median. Wire observability per the Agent Observability Stack paper: every screenshot, every action, every confirmation, every recovery, every failure mode logged with full provenance. Add cost monitoring with alerts on per-success cost crossing thresholds — the unit economics are how this fails, and you want a paged alert before the finance team notices. Tabletop the screen-injection incident: who detects it, who pauses the agent, who notifies affected users, who issues refunds. Most teams discover during the tabletop that they have not assigned ownership of computer-use incidents, and the runbook lives in three different team's heads. Write it down before the incident.

After the first agent ships, the marginal cost of the second agent drops by half — the platform investment (sandbox, observability, confirmation system, PUSV middleware, runbook, cost-monitoring) is reusable. By the third agent, computer-use deployment is the kind of work a senior engineer does in a sprint rather than a quarter. That is the trajectory worth investing in.

#Part VII: Where This Goes (2027 and beyond)

The OSWorld curve will plateau near 80-85 percent in 2027. GPT-5.4's 75.0 percent at human-parity is not the asymptote; another generation of reinforcement-learning training and grounding-model improvement will push the number higher. But the plateau will arrive — multi-app workflows, contradiction-resolution, and event-ordering remain hard for reasons that are not just compute. Production reliability will lag the benchmark by twelve to eighteen months, and the gap will compress only as the architectural defenses (PUSV, CaMeLs Can Use Computers Too, Watch Mode) become standard primitives in agent frameworks.

The compliance layer will shape what "production-ready" means. EU AI Act classifications for high-risk AI systems, ISO/IEC 42001, and NIST AI RMF will all eventually require computer-use agents to document their failure modes, their attack-surface coverage, and their deletion architectures. The teams that build to the seven-layer defense stack documented in this paper and the Indirect Prompt Injection Defense paper will pass audits in 2027 procurement reviews; the teams that ship without them will not.

Cost will fall by 50-70 percent as inference economics improve. Token costs are still on a steep deflation curve; quantized vision models and edge inference will bring the per-action cost from $0.05-0.10 today to $0.01-0.02 by 2027. The unit-economics threshold for deployable computer-use will drop accordingly, opening more vertical use cases. By 2028 the per-success value floor for justifying computer-use will be around $0.50 rather than $1.50.

The convergence with adjacent agent infrastructure will continue. Computer-use agents need memory (per the Agent Memory in Production paper), need observability (per Agent Observability Stack), need security defenses (per Indirect Prompt Injection Defense), need supply-chain documentation (per the AI-BOM derivative). The papers in this canon will increasingly be read together rather than separately, because the production deployments increasingly require all of them.

The honest framing for the 2027 horizon is the same as Part II's. Computer-use will not replace coherent APIs where they exist — the 99 percent reliability and $0.001 per task numbers do not move. Computer-use will replace narrow RPA workflows by 2027, will broaden into multi-app workflows by 2028, and will become a default capability of every major agent platform. The teams who deploy carefully in 2026, with the four-layer defense stack and the 90-day playbook, will be the ones whose agents are still shipping in 2030.

#Closing

The capability curve runs from 12 percent OSWorld in 2024 to 75 percent in 2026, surpassing humans on the benchmark. The deployment curve runs more slowly — production reliability, cost economics, latency, and a novel attack surface that did not exist before agents could see screens. The gap is the deployment overhang.

The single ask of the reader is the one that runs through every paper in this canon. Pick exactly one task type with no API and a per-success value north of two dollars. Spend ninety days on the scope-harden-ship loop in Part VI. Ship narrow. Monitor wide. Then pick the next task type and let the platform investment compound.

The work is not glamorous. Most of it is permission boundary plumbing, screen verification middleware, confirmation prompts, observability hooks, and runbook tabletops. That is the architectural floor that makes computer-use a deployable capability rather than a demo. Build it in 2026; the agents you will need to ship in 2027 are riding on it.