Dynamic Margin Engineering

#Foreword: Your COGS Changes Without Asking You

Anysphere's Michael Truell apologized in a July 2025 blog post for a pricing change that caught Cursor's $20-a-month Pro users by surprise^[1]. The proximate cause was Anthropic's Claude Opus 4 launching at $15/M input and $75/M output^[1] — "new models can spend more tokens per request on longer-horizon tasks"^[1], and Cursor had been eating those costs under the old plan. The new Pro plan gave subscribers $20 worth of usage at API rates^[1]; one long-time user reported $10 in usage charges within the first week^[2]. Replit ran the same play three weeks earlier with effort-based pricing^[3], then refunded all impacted users and issued $10 credits when a billing incident hit 6% of accounts^[4]. Anthropic shifted Claude Enterprise from fixed $200/seat/month to $20/user/month plus usage in April 2026^[5]. OpenAI raised GPT-5.5 prices in May 2026 enough that users reported 40% bill increases even as the model used fewer tokens per task^[6].

Your model provider can change your cost-of-goods overnight. The pricing infrastructure that lets you survive it — multi-provider routing, prepaid hedging, the eight cost modifiers Anthropic publishes for stacking discounts^[7], per-trace observability — is the topic of this paper. Most agent companies don't have any of it. The ones that survive 2026-2028 will.

#Executive Summary

Three forces are converging to make dynamic margin engineering the highest-leverage discipline in agent-based businesses.

First — the pricing shifts are no longer hypothetical. Cursor shifted Pro from request-based to credit-based in June 2025^[1], reaching $500M ARR by then^[8] and $29.3B valuation by November 2025^[9]. Replit launched effort-based pricing the same month^[3], then a $100/month Pro plan in February 2026^[10]. Anthropic moved Claude Enterprise to usage-based in April 2026^[5] after inference costs grew "more than threefold" in 2025^[5] and Claude Code annualized revenue jumped from $1B (Dec 2024) to $2.5B (Feb 2025)^[5]. OpenAI raised GPT-5.5 prices in May 2026, producing 40% bill increases despite token-efficiency gains^[6]. The pattern is identical: model providers internalize cost pressure, then push it through to platform vendors, then platform vendors push it through to end customers.

Second — the discounts stack to 80-90% off when engineered. Anthropic's eight published cost modifiers^[7] compose: cache reads at 10% of base input^[11]^[12], Batch API at 50% off both input and output^[11], 1M context at standard pricing on Opus 4.7 / Sonnet 4.6 / Opus 4.6^[11]^[13]^[13], data-residency multiplier 1.1x optional^[11], fast mode 6x premium for Opus 4.6^[7], plus web search ($10 per 1,000 searches^[7]) and tool-token overhead. Stacked correctly, "cached input on Haiku batch costs $0.05/M tokens — 20x cheaper than standard Haiku input"^[14]. Multi-model routing achieves another 30-85% reduction at 95% quality parity^[15]^[16]^[17]. The DeepSeek V4-Flash at $0.14/$0.28^[18]^[19] is 9-11× cheaper than Claude Opus 4.6 ($5/$25) on equivalent SWE-Bench Verified tasks^[18].

Third — AI-native unit economics demand new pricing architecture. Per-seat collapsed from 21% to 15% of SaaS pricing in 12 months; hybrid (base + usage) grew from 27% to 41%^[16]. AI gross margins run 50-60% vs 80-90% traditional SaaS^[16]^[20]^[21]. ICONIQ data shows AI product margins climbing 41% (2024) → 52% (projected 2026)^[21]. Bessemer Supernovas reach ~$100M ARR in 1.5 years but with ~25% margins^[22]^[23], often negative; Shooting Stars take ~4 years to $100M ARR with 60% margins^[22]^[23]. The cost-management discipline now sits alongside product and GTM as a CEO-level priority — Replit's Effort-Based pricing "lays the groundwork for a more agentic future"^[3], Cursor's Ultra plan unlocks via multi-year model-provider partnerships^[8], and a16z explicitly defends AI margin compression as "structural, not temporary"^[24].

This paper synthesizes the 2025-2026 pricing-shift evidence, the eight Anthropic cost modifiers, the five LLM gateway primary canons (Vercel AI Gateway, OpenRouter, LiteLLM, Portkey, Helicone), the hedging mechanics (OpenAI Reserved Capacity, Scale Tier, committed-spend agreements), the per-trace observability schema (Helicone, Langfuse, Portkey, CloudZero), and the AI-native unit economics framework (Bessemer Supernova benchmarks, ICONIQ State of AI, Spendline contribution-margin-per-customer) into one buildable spec. The 7-decision dynamic-margin-engineering playbook closes the paper.

#Part I: The 2025-2026 Pricing Shifts — Cursor / Replit / Anthropic / OpenAI

Four canonical pricing shifts in 2025-2026 demonstrate the dynamic-margin failure mode at scale.

Cursor (Anysphere) — June 2025 + Pro/Pro+/Ultra tiering. Cursor's $20/month Pro plan switched on June 16 2025 from "500 fast responses + unlimited slower responses" to "$20 worth of usage at API rates"^[1]^[2]. The reason: "new models can spend more tokens per request on longer-horizon tasks" — Anthropic's Claude Opus 4 at $15/M input + $75/M output^[1] was eating into Cursor's margins. Anysphere CEO Michael Truell apologized in a July 7 blog post for unclear communication^[1]. Anysphere had also struck multi-year deals with OpenAI, Anthropic, Google DeepMind, and xAI to launch the $200/month Cursor Ultra plan with 20x Pro usage^[8]. Cursor reached $500M ARR by June 2025^[8] (one of the fastest companies ever to $100M ARR, growing $200M in 2 months^[8]). The pattern recurred at the Cursor $29.3B valuation in November 2025 ($2.3B raised, tripling value)^[9]. The Pro plan was further restructured by March 2026 into compute-unit (CU) pricing — 500 CUs at $20/month with overage at $0.05/CU^[8]. The Cursor case study is the canonical example of a platform vendor passing through the model-provider's cost pressure to end users — and the customer-trust cost of doing it badly.

Replit — Effort-Based Pricing. Replit moved from $0.25/checkpoint flat to effort-based pricing on June 18 2025^[3]. "Simple changes still result in a single checkpoint — typically costing less than $0.25"^[3]; complex tasks bundle into one checkpoint that "may cost more than $0.25"^[3]. The recap on July 13 documented simpler requests as low as $0.06; complex requests sometimes "multiple dollars"^[4]. Replit shipped a "major double-digit percent efficiency win in AI model usage" within three weeks of the new pricing and passed those savings to users^[4] — the architectural promise of effort-based: when pricing is tied to operational cost, every efficiency win is a customer benefit. A July 11 billing incident affected 6% of users; Replit refunded fully and gave $10 credits to all impacted users^[4]. The February 2026 Pro Plan launch ($100/month) added Economy Mode (1/3 cost per prompt), Power Mode (more capable models), and Turbo Mode (2x faster at up to 6x cost)^[10] — three explicit cost-vs-quality dials at the user level. Replit's annualized revenue climbed from $2.8M to $150M in roughly a year^[20]; the company is targeting $1B ARR by end-2026^[24].

Anthropic — Claude Enterprise usage-based shift (April 2026). The Information's reporting via Longbridge^[5] documented Anthropic's quiet pivot from fixed $200/user/month enterprise pricing to $20/user/month base + usage. The trigger: Claude Code annualized revenue jumped from $1B in December 2024 to $2.5B in February 2025^[5]; Anthropic's inference costs "grew more than threefold" in 2025^[5] with gross margins below expectations. An Anthropic spokesperson framed it: "Better reflects how customers actually use Claude, as workloads are shifting from fixed-seat productivity tools to agent applications"^[5]. The April 4 2026 third-party-tool restriction^[25]^[26] terminated 135,000+ active OpenClaw instances' subscription-based access overnight — heavy users saw 10x to 50x cost increases^[26], documented at one extreme as $109/day Opus 4.6 usage = $3,286/month vs $200/month subscription = 16x cost increase^[25]. The Boris Cherny quote captures the discipline: "Computing power is a resource we need to manage prudently; we prioritize customers using our own products and APIs"^[5]. Anthropic's $100M Claude Partner Network commitment in March 2026^[26] is the constructive flipside — first-party tools and partners get continued subsidy; arbitrary third-party tools don't.

OpenAI — GPT-5.5 May 2026 40% price increase. OpenAI raised GPT-5.5 prices in May 2026^[27]^[6] enough that users reported total bill increases of "around 40 percent"^[6] even though the model uses fewer tokens per task^[6]. The structural diagnosis: "Efficiency gains in the system do not automatically translate to savings in your bill"^[6]. Expected 15-20% cost reduction from efficiency vs actual 40% cost increase^[6]. OpenAI is projected to lose $14B in 2026^[28] (cumulative $44B by 2028)^[28]; only 5.5% of ChatGPT's 900M weekly users pay^[28]; the $600B total compute spend by 2030 plan requires balanced consumer + enterprise revenue^[28]^[29]. The structural tension between model provider profitability and platform vendor margin is now public.

Across all four cases, the same pattern: model provider absorbs cost pressure, then internalizes it, then pushes it through. Platform vendors that don't have dynamic margin engineering get the surprise bill. The next section catalogs the discounts that exist to stack against this.

#Part II: The Eight Cost Modifiers — Stacking Discounts to 80-90% Off

Anthropic's official pricing page publishes eight cost modifiers that compose multiplicatively^[11]^[7]. Used together, they routinely produce 80-90% reductions vs naive standard-rate usage^[14].

Modifier 1 — Cache reads at 0.1× base input. Anthropic's cache-read pricing is 10% of base input^[11] across all current models — Opus 4.7/4.6 cache reads at $0.50/MTok vs $5 base^[11]; Sonnet 4.6 at $0.30/MTok vs $3 base^[11]; Haiku 4.5 at $0.10/MTok vs $1 base^[11]. The break-even after one cache read on 5-minute writes (1.25× premium)^[11] and two reads on 1-hour writes (2× premium)^[11]^[12]. Minimum cache block: 1,024 tokens (Haiku)^[30], 2,048 tokens (Sonnet/Opus)^[30]. For repeated workloads — chatbots with stable system prompts^[30], document Q&A^[30], code review on the same codebase^[30] — caching delivers "60-95% input cost reduction depending on hit rate"^[30]. Anthropic's 90% cache discount^[12] substantially exceeds OpenAI's 50% cache discount on GPT-4o^[31].

Modifier 2 — Batch API at 50% off. "The Batch API allows asynchronous processing of large volumes of requests with a 50% discount on both input and output tokens"^[11]. Opus 4.6 batch input drops from $5/MTok to $2.50/MTok; Sonnet 4.6 from $3 to $1.50; Haiku 4.5 from $1 to $0.50^[11]. Anthropic explicitly states Batch API and prompt-caching discounts stack^[11]. The composed math: cached input on Haiku batch costs $0.05/MTok — 20× cheaper than standard Haiku input^[14]. For nightly scoring jobs at 50M tokens of shared context, Anthropic batch + caching is "the cheapest frontier option in 2026"^[31]. OpenAI also offers 50% Batch discount but its caches don't share pool with the Batch API^[31] — Anthropic's design wins for stackability.

Modifier 3 — 1M context standard pricing. As of March 13 2026, Anthropic removed the 2× input / 1.5× output surcharge that previously applied to prompts over 200K tokens^[13]^[11]^[13]. The old surcharge applied to the ENTIRE request, not just overflow — "Send 201K tokens and every single one... billed at the inflated rate"^[13]. The new flat pricing covers Opus 4.7, Opus 4.6, Sonnet 4.6, and Claude Mythos Preview at the full 1M context window^[11]^[13]. By contrast, GPT-5.4 has a 272K cliff that doubles input cost above threshold; Gemini 3.1 Pro has a 200K cliff^[13]. "A 900K-token request now costs exactly the same per-token rate as a 9K-token request"^[11].

Modifier 4 — Data residency 1.1× multiplier (optional). US-only inference (inference_geo=us) incurs a 1.1× multiplier on Opus 4.7, Opus 4.6, Sonnet 4.6^[11]^[7]. Global routing (the default) uses standard pricing^[11]. The trade-off: compliance with data-residency requirements for regulated workloads (financial services, healthcare, government) costs 10% on top of the base rate^[11].

Modifier 5 — Fast mode (6× premium for Opus 4.6). Anthropic offers a fast-mode beta on Opus 4.6 at 6× standard rates^[7] for speed-critical premium tasks. The 6× multiplier exists because compute capacity that prioritizes latency over throughput is more expensive to provision^[7]; speed-critical workloads pay for that.

Modifier 6 — Web search ($10 per 1,000 searches + token costs). Anthropic charges $10 per 1,000 web-search calls^[7] on top of token costs^[7]. Useful for agent workloads that need grounded answers, but consumes budget separately from the model-token line item^[7].

Modifier 7 — Code execution. Tool execution carries token costs at the request and response level^[7]; sandbox compute is billed separately^[27]. For Anthropic Skills running code execution^[7], the cost is the response-token cost plus container compute time^[27].

Modifier 8 — Tool-token overhead. Every tool definition consumes context tokens on every call^[23]. The Tianpan FinOps analysis notes "structured-output retry compute"^[23] can produce 2-3× retry rates that drive gross margin from 40% to 25%^[23]; tool schema bloat compounds the same dynamic^[23]. Caching tool definitions^[31] mitigates this — "Cache tool definitions on agent workloads; they are usually static and high volume"^[31].

The stack. PE Collective's worked math: "A pipeline using all three [model routing + caching + batch] can run at $0.05/1M cached input tokens on Haiku batch, which is 100x cheaper than standard Opus input pricing"^[14]. Cache breakeven hits at 1.3-1.4 reads across all models^[14]^[30]. The discount stack is the load-bearing infrastructure for AI gross margin.

#Part III: Multi-Provider Routing — Vercel / OpenRouter / LiteLLM / Portkey / Helicone

Five LLM gateways converged on the same architectural pattern by mid-2026: unified API + provider routing + automatic failover + observability + cost arbitrage^[32]^[33]^[34].

Vercel AI Gateway — zero markup, fluid compute backend. Vercel AI Gateway is "a unified API to access hundreds of models through a single endpoint" with "no markup on tokens, including with Bring Your Own Key (BYOK)"^[35]. Free tier: $5/month credit on every team account^[36]. Pay-as-you-go AI Gateway Credits with auto-top-up^[36]. The architectural primitive is providerOptions.gateway.models for model fallback chains^[37] — primary model + ordered fallback models + provider preference. Service tiers pass through OpenAI's default | priority | flex settings with adjusted pricing^[38]. The Custom Reporting API costs $0.075 per 1,000 unique tag values + $5 per 1,000 queries^[38] — group-by model, user, tag, provider, credential_type. Runs on Fluid compute: "Paying CPU rates for less than 8% of runtime instead of 100%"^[39], with anycast PoP routing at sub-20ms latency^[39].

OpenRouter — managed marketplace, 5.5% markup. OpenRouter aggregates 300+ models from 60+ providers through a single OpenAI-compatible API^[40]. "Pay-per-request at provider cost + small markup (typically 5% on hosted models)"^[32]; the 5.5% credit-purchase fee means $1,000 of credits costs $1,055^[32]. Provider routing defaults to "load balance requests across providers, prioritizing price"^[40]. The sort field exposes price | throughput | latency; max_price filter prevents request from running if price unavailable; preferred_max_latency and preferred_min_throughput filters refine selection^[40]. Weakness: hosted-only, no self-host option; 5% markup compounds at scale ($5K/year at $100K/year provider spend)^[32].

LiteLLM — open-source proxy, zero markup, MIT-licensed. LiteLLM is the open-source AI Gateway with 100+ providers behind a single OpenAI-compatible API^[41]. "8ms P95 latency at 1k RPS"^[41]; the proxy ships with virtual keys, spend tracking, guardrails, load balancing, and an admin dashboard^[41]. Real operational cost: ~$40-60/month on Hostinger VPS + managed Postgres + monitoring per the AICraftGuide breakdown^[33]. Pricing math: a 5.5% markup on $20,000/month provider spend through OpenRouter is $1,100/month; the same workload through LiteLLM costs ~$40-60/month operational — LiteLLM wins above ~$2K/month inference^[33].

Portkey — open-sourced under Apache 2.0 (March 2026). Portkey open-sourced its entire gateway under Apache 2.0 in March 2026^[32], becoming the most production-feature-complete open-source option. 250+ models, guardrails (PII redaction, jailbreak detection, prompt injection filters), semantic caching (fuzzy matching), prompt management with versioning^[32]^[34]. Free self-host; hosted version $49/month at $1K provider spend^[32]^[34]. Best for teams needing observability + governance + guardrails as one stack.

Helicone — observability-first MIT-licensed. Helicone is "purpose-built for LLM observability"^[32], MIT licensed, free tier covers 100K requests/month^[32]. Provider Routing automatically "Routes to the cheapest available provider"^[42]; failover triggers on 429 (rate limit), 401 (auth), 400 (context), 408 (timeout), 500+ (server errors)^[42]. The credit mechanism is distinctive: "Helicone's managed keys (credits) - automatic fallback at 0% markup"^[42] — pay providers directly through Helicone's pooled keys. BYOK + Helicone managed keys fallback is the default^[42]; manual fallback chain syntax: model: "gpt-4o-mini/azure,gpt-4o-mini/openai,gpt-4o-mini"^[42].

Routing economics in practice. CallSphere's May 2026 image-understanding benchmark^[15]: classify each request, route easy 60-70% to DeepSeek V4-Flash ($0.14/$0.28) / Gemini 2.5 Flash-Lite ($0.10/M), medium 20-30% to Sonnet 4.5 ($3/$15), hard 5-15% to GPT-5.5 / Opus 4.7. "Smart routing economics: a $50K/mo all-GPT-5.5 workload typically becomes $7-15K/mo when 70% of traffic is routed to DeepSeek V4-Flash or Gemini Flash-Lite, while preserving 95%+ of measured quality"^[15]. Zylos reports "30-85% cost reductions" with 95% quality parity for the routing pattern^[16]^[17]^[43]. The decision tree for which gateway to pick^[15]^[33]: under $2K/month inference → OpenRouter or Portkey Free; $2-10K → any of the three viable; over $10K/month → LiteLLM self-hosted as the clear cost winner. The hybrid pattern most teams converge on: LiteLLM for routing + Helicone for observability, OR Portkey for the all-in-one stack^[34]. The DeepSeek V4-Pro arbitrage is especially compelling: $0.28/$2.48 vs Claude Opus 4.6 $5/$25 — 9-11× cheaper end-to-end at within one point of Opus accuracy on SWE-Bench Verified^[18]. A team currently spending $38,000/month on Opus would expect the DeepSeek-equivalent workload to land between $3,400 and $4,200^[18].

#Part IV: Hedging Mechanics — Reserved Capacity, Committed Spend, Pre-Paid Credits

Routing alone is not enough. The model-provider price changes documented in Part I happen on a calendar the platform vendor doesn't control. Three hedging mechanics give platform vendors a forward-priced position: reserved capacity, committed spend, and pre-paid credits.

OpenAI Reserved Capacity. OpenAI's Reserved Capacity is the most explicit forward-pricing primitive published by any model provider^[44]. "Static allocation of capacity dedicated to you, providing a predictable environment that you control"^[44]. Pricing is per compute-unit per day: $260/unit/month on 3-month commit^[44], $220/unit/month on 1-year commit^[44] (~15% savings)^[44]. GPT-5 at 400K max context requires a minimum 3 instances × 270 units/instance = $70,200/month (3-month)^[44] or $59,400/month (1-year)^[44] = $712,800 annual commitment^[44]. GPT-4o at 128K context: 500 units^[44] = $130,000/month (3-month)^[44] or $110,000/month (1-year)^[44] = $1.32M annual^[44]. SLAs included: 99.5% uptime + on-call engineering^[44]. The economics: you give up flexibility to model swaps in exchange for predictable cost and capacity guarantees. For high-volume production workloads where the model and the cost are stable, Reserved Capacity protects against both rate changes AND capacity shortages during model-provider squeezes.

OpenAI Scale Tier. Scale Tier is the pay-as-you-go-but-bundled alternative^[45]: purchase token units upfront with 30-day minimum per unit. GPT-5.5 input bundle: 50,000 TPM = $750/unit/day^[45]. GPT-5: 25,000 TPM input = $75/unit/day + 2,500 TPM output = $60/unit/day^[45]. GPT-5-mini: 500,000 TPM = $275/unit/day input + 50,000 TPM = $220/unit/day output^[45]. Uptime SLA 99.9%; latency SLA varies by model (99% > 50 tokens/sec for GPT-5; 99% > 100 tps for mini)^[45]. With GPT-5.4 and later, Scale Tier moved to Combined Input and Output tokens/min bundles — removes the need to predict input/output ratio^[45]. Priority Processing^[46] is the per-request alternative: service_tier="priority" parameter, premium pricing, same 99.9% uptime SLA. Vercel AI Gateway passes Priority Processing through with adjusted pricing^[38].

Anthropic Enterprise committed spend. Anthropic doesn't publish list discounts for committed spend, but the Vendor Benchmark analysis surfaced the structure: $500K+ annual API commit unlocks 18-24% discount^[47] on Claude Sonnet 4.5 input/output^[47]; $1M-$2.4M unlocks 24-30%^[47]; $2.5M+ unlocks 28-36%^[47]. Bedrock EDP overlay (Enterprise Discount Program) layers 20-30% on top for AWS-committed organizations^[47] — "AWS Bedrock (Claude Sonnet) with 30% EDP overlay = $2.10 effective per 1M input tokens"^[47] (vs $3.00 list)^[11]. Combined Anthropic direct + Bedrock for strategic accounts: $1.95-2.20 effective = 27-35% off list^[47]. The Anthropic Enterprise per-seat list is $30/user/month at small deployments^[47], $17-30 with negotiation^[47]; the April 2026 shift to $20/user/month base + usage^[5] internalized the volatility^[5].

Pre-paid credits. OpenAI's prepaid billing^[48] is the on-ramp before committed spend: load credits^[48], API spend them^[48], auto-recharge at threshold^[48]. Credits expire 12 months after purchase^[48]. Enterprise / Edu / Business plans use shared credit pools at the contract level^[49] with RBAC spend controls by group^[49]. Anthropic ships pre-purchased usage bundles at up to 30% off during pricing-shift transitions^[25]^[26] — the OpenClaw termination credit was the example^[25]^[26]. Replit Pro's $100/month plan gives tiered credit options from $100 to $4,000/month with "largest discounts"^[10] and 1-month rollover^[10] — credits are the rental price; the discount tier is the hedge^[10].

Hedging math. A startup spending $20K/month on Sonnet 4.6 API list ($3/$15) faces three hedge instruments: (1) committed spend at $500K+ annual = 18-24% off effective $2.28-2.46/$11.40-12.30 = $4,440-5,280/month savings^[47]; (2) Bedrock EDP overlay at 30% on $2M AWS commit = $2.10/$10.50 effective^[47]; (3) prepaid credits + auto-recharge for predictability without commitment. The procurement-side observation from Redress^[50]: "Volume commitment minimums require paying for unused token capacity if actual consumption falls below the committed level... Price change provisions in standard terms that historically allowed rate increases with as little as 14 days' notice." The negotiation: 12-month ramp from 30% to 100% commitment + price-lock for initial term + non-negotiable substitution rights across providers.

#Part V: Pass-Through Clauses and the Effort-Based Pricing Pattern

The customer-facing question is: do you pass through cost changes, eat them, or pre-price them? The 2025-2026 pricing-shift evidence narrows the answer.

Effort-based pricing — Replit's design. Replit's effort-based model^[3]^[4] is the architectural answer for usage-based agent products. "Cost per checkpoint is now determined based on the amount of computing resources consumed"^[4]. Simple requests as low as $0.06; complex requests "sometimes multiple dollars"^[4]. The customer-facing promise: when Replit ships a "major double-digit percent efficiency win in AI model usage," those savings pass through immediately to users^[4]. The architectural trade-off: cost-prediction shifts from vendor to user. Replit's three explicit cost dials — Economy Mode (1/3 cost per prompt), Power Mode (more capable models), Turbo Mode (2x faster at up to 6x cost)^[10] — make the trade-off explicit at the user level.

Pass-through with annual buckets — Intercom Fin's design. Intercom Fin's $0.99 per-resolution^[19] uses annual buckets to solve the predictability problem: "Users can buy a set number of resolutions that they can burn down at any point during the year"^[19]. CFOs get fixed cost; usage-month volatility doesn't break the budget. The architectural primitive: hybrid base + variable, with the variable component pre-purchased in advance. O'Reilly's forecast: "We'll see three big shifts: outcome-based pricing, differentiated pricing for different types of agents, and AI verification of resolutions... The companies still charging purely by Token in two years will be the outliers"^[19].

Tiered pricing with credits — Cursor's design (after multiple iterations). Cursor's Pro Plus ($60/month, $70 in API agent usage)^[2], Ultra ($200/month, $400 in API agent usage)^[8], plus the March 2026 compute-unit pricing on Pro^[8] is the multi-tier-with-credits answer. Heavy users self-select into the tier that matches their workload. The lesson from Cursor's June 2025 mistake^[1]: customers buy on predictability, not on raw cost — the surprise charge is worse than the higher base price.

Subscription with hard caps — Claude Code's design. Anthropic's Claude Code on Pro/Max subscriptions caps by weekly limits rather than per-token billing^[2] — "capping its biggest spenders" rather than passing token cost through. Anthropic doubled Claude Code 5-hour rate limits across Pro/Max/Team/Enterprise plans^[51] and removed peak-hours limit reduction on Pro/Max^[51] in May 2026. The trade-off: predictable cost for the user, but heavy users get throttled. Anthropic absorbed the cost; the platform/provider managed compute scarcity by capacity, not pricing.

Outcome-based with vendor risk. The Intercom Fin model (per-resolution)^[19], Decagon's per-resolution AOP-driven pricing (Paper #10), Salesforce Agentforce $2.00/conversation^[19] — these put vendor revenue on the resolution rate. AgentMarketCap names the risk: "Outcome-based AI changes all three assumptions [gross margin, revenue predictability, churn]. Gross margins compress to 50-65%. Revenue is proportional to task volume and success rate"^[20]. The vendor accepts accuracy degradation as a revenue risk: "If an agent's resolution rate drops from 82% to 74% due to a model update, revenue drops proportionally with no corresponding cost reduction"^[20].

The decision tree: pass through (effort-based) when the user can rationally choose the model; pre-bundle (annual buckets) when CFO predictability dominates; tiered credits when the workload distribution is bimodal; subscription with caps when the customer values predictability over flexibility; outcome-based when the outcome is measurable and the vendor can absorb the variance. The wrong answer is opaque pass-through — Cursor's June 2025 communication failure was the canonical case^[1].

#Part VI: Cost Observability — Per-Trace Attribution and the 7-Dimension Tagging Schema

Without per-trace cost attribution, every optimization in Parts II-V is guesswork. Three observability primitives anchor the discipline.

The per-request cost record. Every LLM call needs to capture seven dimensions: (1) input tokens, output tokens, cached tokens^[52]; (2) model and tier used^[52]; (3) dollar cost of the call^[52]; (4) user ID, tenant ID, session ID, feature flag, prompt version^[52]; (5) trace ID linking the call to its parent workflow^[52]; (6) tool calls and outputs (for agents)^[52]; (7) latency^[52]. CloudZero's diagnostic: "Most stop at token counts. Some go further. Few integrate with your broader cloud and SaaS cost picture"^[52]. Without dollar cost as a live signal tagged to user and feature, "you can't tell which customer is most expensive to serve, which feature is eating your margin, which agent tool call is responsible for the 40% jump last week"^[52].

The metrics that matter. Six metrics make per-trace observability actionable^[43]: cost per trace/workflow run^[43], cost per user^[43], cost per model tier^[43], cache hit rate^[43], tokens per tool call^[43], output token ratio^[43]. The three CFO operating KPIs^[22]: CPT (cost per thousand tokens)^[22], CPR (cost per resolved request)^[22], CPAM (cost per agent minute)^[22]. The discipline shift: "Move from account-level margin to workflow-level margin as your operating KPI"^[22].

Budget governance and circuit breakers. Cycles' per-conversation cap analysis^[53] is the load-bearing example: a B2B SaaS company pricing an AI copilot at $15/user/month^[53] with $3 estimated cost^[53] actually hit 23% margin^[53] (vs 80% target)^[53]; top 10% of users by cost consumed 72% of spend^[53]; the worst-case user cost $310^[53]. A $2/conversation hard cap moved the worst-case user from $310 to $22^[53] and feature gross margin from 23% to 68%^[53]. A $15/user/month cap moved the worst-case to $15 with 5% of users hitting the cap^[53]. Cycles' diagnostic: "Cannot control variance at the pricing layer. Must control at the execution layer"^[53]. Practical controls^[17]: max iterations per task (50 calls without completing = likely loop, not thoroughness)^[17]; token budget per trace^[17]; cost alerts at 50%, 80%, 100% of projected monthly spend^[17]; per-user and per-feature quotas^[17].

FinOps tooling. Five gateway/observability stacks have matured^[43]: Portkey / Helicone gateway proxies that inject per-request cost tracking without code changes^[43]; Langfuse / Traceloop open-source LLM tracing with cost attribution at trace and span level^[43]; Datadog LLM Observability enterprise-grade cost monitoring^[43]; Vantage dedicated FinOps platform with an MCP server enabling agents to query cost data autonomously^[43]; custom dashboards exporting token usage to Grafana/Metabase^[43]. The architectural sequence: gateway captures the per-request data → observability layer aggregates and attributes → FinOps platform compares against budget → alerting at variance thresholds → budget enforcement at execution layer^[43].

The architectural sequence in one sentence. "Keep APM for the surrounding application; add an LLM observability layer specifically for model calls, tracing, and evaluation; make sure cost data flows through both so engineering decisions connect to financial outcomes"^[52]. The three-layer stack — APM, LLM observability, FinOps — share a single trace and a single cost ledger^[52].

#Part VII: AI-Native Unit Economics — The 50-60% Gross Margin Reality

Traditional SaaS gross margins run 80-90%^[16]^[20]. AI-native SaaS runs 50-60%^[16]^[20]^[21]^[20]. The 23-33 percentage point gap is structural, not temporary^[16]^[54]. The dynamic-margin-engineering discipline is what closes the trajectory toward 65-75% over 18-24 months^[22]^[54].

The gross margin reality. ICONIQ's January 2026 State of AI^[21] put precise numbers on the shift: AI product gross margins climbing 41% (2024)^[21] → 52% (projected 2026)^[21]. Companies with "balanced differentiation" (model + product innovation) hit 53%; pure application-layer companies sit at 45%^[21]. Bessemer's portfolio data^[23]^[20]^[22]: AI Supernovas average ~25% gross margins (often negative)^[22], reach ~$100M ARR in 1.5 years; Shooting Stars reach ~$100M ARR in 4 years with 60% margins^[22]. Replit went from -14% to +36% gross margin during late-2025 optimization^[16]; the trajectory is achievable. Udit's feature-type breakdown^[54]: RAG margin 76-84%, chatbot 65-75%, document analysis 58-70%, code generation 65-75%, autonomous agents 35-55%, reasoning analysis 30-50%. Agent products are the lowest-margin feature category because they "multi-step, with branching and retry logic that multiplies inference calls"^[54].

The ARR multiple decoupling. Traditional SaaS prices at 8-12× revenue^[22]. Sierra at $15.8B / $150M ARR = 105×^[55]; Harvey at $11B / $190M ARR = 58×^[55]; Replit at $9B / on-track-$1B = 9× forward^[24]^[20]; Anthropic Series G at $380B post-money on $30B run-rate = 12.7× forward^[29]. The market prices the trained-agent / vertical-AI as an asset distinct from the recurring software subscription^[24]. a16z's Sarah Wang defense^[24]: "Lower gross margins at a moment in time are not a long term indicator of a lack of a sustainable business model"^[24]. Bessemer's read^[22]: AI Supernovas show ARR/FTE of $1.13M^[22] — 4-5× the typical SaaS benchmark^[22].

The Jevons' Paradox at play. Token prices fell 50× from late 2022^[16] ($20/M for GPT-4-equivalent^[16]) to August 2025 ($0.40/M)^[16]. Yet enterprise AI spending surged 320%^[16] from $11.5B (2024)^[16] to $37B (2025)^[16]. Average monthly AI budget hit $85,521 in 2025^[16] (+36% YoY)^[16]; organizations spending >$100K/month doubled to 45% of market^[16]. The diagnostic: "Inference prices have dropped roughly 1000x in three years"^[23]; "the marginal saving from negotiating a 15% discount on inference is real"^[23] but "dwarfed by the marginal saving from a retrieval architecture that hits the cache 60% of the time"^[23], or "a structured-output schema that drops the retry rate from 22% to 3%"^[23].

The 2026 renewal cliff. Bessemer's pricing playbook names the risk^[20]: "Much of the 'sexy' AI products today live in soft ROI territory, which is dangerous for monetization"^[20]. "As many enter renewal cycles for the first time in 2026, pricing will need to reflect actual value, not merely potential or promise"^[20]. ICONIQ data^[21]: 37% of companies plan to change AI pricing model in next year^[21]; 58% still use subscription^[21]; 35% usage^[21]; 18% outcome-based^[21] (up from 2% Q2 2025)^[21]; hybrid (49% use annual commitments)^[21]; (29% use overages at tiered rates)^[21]. Outcome-based is the rising default for support and sales^[20]; subscription survives for predictable workloads^[20].

The midjourney TPU swap — extreme cost engineering. Midjourney migrated inference from NVIDIA A100/H100 to Google TPU v6e: monthly inference spend from $2.1M to <$700K — $16.8M annualized savings^[23]. H100 spot instances dropped from $7/hour to $1.49/hour at market low^[23]. The lesson: "Inference cost management is now a core competency for margin survival"^[23]. Most AI companies don't have Midjourney's scale to negotiate custom infrastructure deals, but the principle holds — dynamic margin engineering is engineering, not procurement.

#Part VIII: The 7-Decision Dynamic-Margin-Engineering Playbook

The seven decisions that compose into a dynamic-margin-engineering posture.

Decision 1 — Build the multi-provider gateway before you need it. Vercel AI Gateway, OpenRouter, LiteLLM, Portkey, or Helicone — pick one^[35]^[40]^[41]^[32]^[42] before your first $5K/month inference bill^[15], not after^[15]. The decision sequence: under $2K/month → OpenRouter or Portkey Free^[15]; $2-10K → any viable^[15]; over $10K → LiteLLM self-hosted^[15]^[33]. The most common stacking pattern: LiteLLM for routing + Helicone for observability, or Portkey for the all-in-one^[34]. Avoid stacking two routing gateways^[34]^[32]. The Anthropic-OpenAI-DeepSeek-Gemini routing space is too large to manage with provider SDKs and a hand-rolled abstraction.

Decision 2 — Pin model versions to dated identifiers; never use latest. Paper #10's eval-first migration discipline applies in reverse for dynamic margin: when Anthropic raises Opus prices, you want to know exactly which model you're billed against — not "the latest"^[7]. Anthropic publishes deprecated models at higher prices ($15/$75 for Opus 4 and 4.1 vs $5/$25 for 4.5/4.6/4.7)^[11]. The dated pin protects you from silent downgrades and silent price hikes equally.

Decision 3 — Stack the eight cost modifiers from day one. Prompt caching (cache reads at 0.1× base^[11]) + Batch API (50% off for async^[11]) + 1M context flat pricing^[11]^[13] + tool-definition caching^[31] + Haiku-batch for offline pipelines^[14] = up to 100× cost reduction vs naive Opus usage^[14]. PE Collective's worked example: chatbot with 2,000-token system prompt handling 100 messages/session, $0.006/message standard → $0.0006/message cached = $5,400/month savings on 1M messages^[30]. The discount stack is the load-bearing infrastructure for AI gross margin; not stacking is leaving 60-90% on the table^[17]^[43].

Decision 4 — Negotiate committed spend at the $500K annual threshold; layer Bedrock EDP if you're AWS-committed. Vendor Benchmark's discount structure: $500K-$999K commit = 18-24% off Anthropic Sonnet^[47]; $1M-$2.4M = 24-30%^[47]; $2.5M+ = 28-36%^[47]. Bedrock EDP overlay adds 20-30% for AWS-committed organizations^[47]. Combined Anthropic direct + Bedrock for strategic accounts: 27-35% off list^[47]. Insist on a 12-month ramp from 30% to 100% commitment^[50], price-lock for the initial term^[50], and substitution rights across Anthropic / OpenAI / Bedrock without breaching the agreement^[50].

Decision 5 — Reserve capacity for stable workloads; pre-paid credits + auto-recharge for variable. OpenAI Reserved Capacity at $260/unit/month (3-mo)^[44] or $220 (1-yr)^[44] protects predictability and capacity simultaneously. 1-year commitments save ~15% vs 3-month^[44]. GPT-5 400K context: $712,800 annual commit per instance^[44]. For production workloads where the model and the cost are stable, this is a forward-priced position. For variable workloads, OpenAI prepaid billing^[48] with auto-recharge^[48] is the predictability tool — Replit Pro tiered credits^[10] and Anthropic transition bundles 30% off^[25] are the equivalent at the platform layer.

Decision 6 — Instrument every request with the 7-dimension cost record from day one. Input/output/cached tokens^[52] + model + dollar cost^[52] + user/tenant/session IDs^[52] + trace ID + tool calls + latency^[52]. Three CFO KPIs (CPT/CPR/CPAM)^[22]. Six operational metrics (cost per trace^[43], per user^[43], per model tier^[43], cache hit rate^[43], tokens per tool call^[43], output token ratio)^[43]. Without this instrumentation, every optimization in Decisions 1-5 is guesswork^[52]^[53]. The Cycles per-conversation cap is the load-bearing example: $2/conversation cap moves gross margin from 23% to 68%^[53] with 12% of users hitting the cap^[53]. Budget enforcement is at the execution layer, not the pricing layer^[53].

Decision 7 — Choose your pass-through pattern explicitly; communicate it transparently. Five patterns: effort-based (Replit), annual-buckets (Intercom Fin), tiered-credits (Cursor Pro/Pro Plus/Ultra), subscription-with-caps (Claude Code weekly limits), outcome-based (Decagon/Sierra/Salesforce)^[8]^[1]^[3]^[19]^[20]. The wrong answer is opaque pass-through — Cursor's June 2025 communication failure ($10 in usage charges within the first week without warning^[2]) is the canonical case^[1]. The right answer for usage-based: clear cost dials (Replit Economy/Power/Turbo^[10]) + pre-purchased buckets^[19] + per-tier credits^[8]. For outcome-based: contractually-defined resolution criteria before deployment^[19]^[20] to prevent billing disputes.

The seven decisions compose. Skip Decision 1 and Decisions 3 and 7 become harder to operationalize. Skip Decision 6 and Decisions 3 and 5 become guesswork. The agent companies that survive the 2026 renewal cliff^[20] and the model-provider pricing-pressure cycles are the ones that compose all seven into a posture, not the ones that adopt one in isolation. The architecture is dull; the discipline isn't. Dynamic margin engineering separates the agent businesses that have a P&L from the ones that have a runway.

#Glossary

Active CPU Pricing (Vercel): Pricing model where you pay CPU rates only when CPU is actively running; the rest of the time pays memory-only rates. For AI Gateway: CPU rates billed for <8% of runtime instead of 100%^[39].

Batch API: Asynchronous processing tier offering 50% discount on input and output tokens with a 24-hour SLA. Stacks with prompt caching^[11].

Bedrock EDP (Enterprise Discount Program): AWS's enterprise commitment program that allows AWS-committed organizations to apply EDP discounts to Bedrock spend, typically 20-30% off list^[47].

Cache hit / cache read: A subsequent request that retrieves cached content rather than recomputing. Anthropic prices cache hits at 0.1× base input^[11].

Cache write (5m / 1h): Initial cache creation, priced at 1.25× (5-minute TTL) or 2× (1-hour TTL) base input on Anthropic models^[11].

CPR / CPT / CPAM: Cost per Resolution, Cost per Thousand tokens, Cost per Agent Minute — the three CFO-level KPIs for agent unit economics^[22].

Effort-Based Pricing (Replit): Pricing model where cost per request is determined by actual compute resources consumed, not a flat per-checkpoint fee. Simple requests as low as $0.06; complex requests "sometimes multiple dollars"^[3]^[4].

Fluid Compute (Vercel): Vercel's serverless runtime that bills CPU time only when active. AI Gateway runs on Fluid compute^[39].

Jevons' Paradox: The phenomenon where falling input costs drive exponentially more consumption. Token prices fell 50× from 2022-2025; enterprise AI spending grew 320%^[16].

Per-Resolution Pricing: Outcome-based pricing where the buyer pays only when the AI successfully resolves a query end-to-end. Range $0.50-$2.00 per resolution as of 2026^[19]^[20].

Priority Processing (OpenAI): Per-request premium tier with 99.9% uptime SLA and latency guarantees. service_tier="priority" parameter^[46]^[38].

Reserved Capacity (OpenAI): Static dedicated compute allocation for Enterprise customers with 3-month or 1-year commitments, 99.5% uptime SLA, and on-call engineering support. GPT-5 instances start at $59,400-$70,200/month^[44].

Scale Tier (OpenAI): Pay-as-you-go-but-bundled offering. Purchase token units (TPM) upfront with 30-day minimum per unit, 99.9% uptime SLA^[45].

Service Tier (OpenAI): Request-level processing tier control. Values: default (standard), priority (premium speed), flex (lower cost, higher latency)^[38].

The Agent Operating Procedure Playbook — the AOP needs a cost envelope to be deployable at scale; dynamic margin engineering supplies it.
The Agent Marketplace Thesis — the per-resolution pricing standard meets the gateway routing layer here.
The Klarna AI Postmortem 2024–2026 — Klarna's cost-evaluation-dominated decision was the dynamic-margin failure.

#References

Maxwell Zeff / TechCrunch (2025-07-07), Cursor apologizes for unclear pricing changes that upset users. https://techcrunch.com/2025/07/07/cursor-apologizes-for-unclear-pricing-changes-that-upset-users/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹²
Tessl (2025-08-20), Cursor's new pricing structure explained. https://tessl.io/blog/cursor-new-pricing-structure-explained/ ↩ ↩² ↩³ ↩⁴ ↩⁵
Replit (2025-06-18), Introducing Effort-Based Pricing for Replit Agent. https://replit.com/blog/effort-based-pricing ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹
Replit (2025-07-13), Effort-Based Pricing Recap. https://replit.com/blog/effort-based-pricing-recap ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹
Longbridge / The Information (2026-04-15), Demand Surges, Computing Power Strained! Anthropic Adjusts Claude Enterprise Pricing to Usage-Based Model. https://longbridge.com/en/news/282775400 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹
Rivo Raphaël Chreçant / Cambridge Analytica (2026-05-09), OpenAI's GPT-5.5 uses fewer tokens but just raised prices again. https://cambridgeanalytica.org/ai-machine-learning/openai-gpt-55-price-increase-may-2026-50949/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
TokenMix (2026-04-04), Anthropic API Pricing 2026: Cache, Batch, Data Residency Fees. https://tokenmix.ai/blog/anthropic-api-pricing ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴
Maxwell Zeff / TechCrunch (2025-06-17), Anysphere launches a $200-a-month Cursor AI coding subscription. https://techcrunch.com/2025/06/17/anysphere-launches-a-200-a-month-cursor-ai-coding-subscription/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰
Rachel Metz / Bloomberg (2025-11-13), AI Startup Cursor Raises Funds at $29.3 Billion Valuation. https://www.bloomberg.com/news/articles/2025-11-13/ai-startup-cursor-raises-funds-at-29-3-billion-value-wsj-says ↩ ↩²
Replit (2026-02-24), Replit Pro Is Here — and Core Now Offers Even Better Value. https://blog.replit.com/pro-plan ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
Anthropic, Pricing - Claude API Docs. https://docs.anthropic.com/en/docs/about-claude/pricing ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸ ↩¹⁹ ↩²⁰ ↩²¹ ↩²² ↩²³ ↩²⁴ ↩²⁵ ↩²⁶ ↩²⁷ ↩²⁸
Anthropic (2025-08-14), Prompt caching with Claude. https://anthropic.com/news/prompt-caching ↩ ↩² ↩³
Ankit Aglawe / TokenCost (2026-03-24), Claude 1M Token Cost: Long-Context Surcharge Gone. https://tokencost.app/blog/anthropic-long-context-flat-pricing ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
PE Collective (2026-04-21), Claude Pricing Guide 2026: Opus, Sonnet, Haiku Costs with Batch and Caching Discounts. https://pecollective.com/tools/claude-pricing-guide/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
CallSphere (2026-05-09), Image understanding and OCR in 2026: Smart routing across providers. https://callsphere.ai/blog/llm-comparison-image-understanding-ocr-hybrid-router-may-2026 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹
Zylos Research (2026-03-29), AI Agent Platform Economics: Pricing Models, Unit Economics, and Subscription Lifecycle Management. https://zylos.ai/research/2026-03-29-ai-agent-platform-economics-pricing-unit-economics ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸
Zylos Research (2026-04-12), AI Agent Cost Optimization: Token Budgets, Model Routing, and Production FinOps. https://zylos.ai/research/2026-04-12-ai-agent-cost-optimization-token-budget-model-routing ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
Particula Tech (2026-04-28), DeepSeek V4 vs Kimi K2.6 vs GLM-5.1: Open-Weight Coding Tested. https://particula.tech/blog/deepseek-v4-vs-kimi-k2-6-vs-glm-5-1-open-weight-coding ↩ ↩² ↩³ ↩⁴
Stripe, Intercom on the evolution of value-based pricing for AI agents like Fin. https://stripe.com/ie/customers/intercom-pricing ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰
Christine Deakers / Bessemer Venture Partners (2026-02-10), The AI pricing and monetization playbook. https://www.bvp.com/atlas/the-ai-pricing-and-monetization-playbook ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸
Jason Lemkin / SaaStr (2026-02-07), 5 Key Takeaways from ICONIQ's State of AI Report. https://www.saastr.com/the-execution-era-of-ai-5-key-takeaways-from-iconiqs-state-of-ai-report/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵
Christine Deakers / Bessemer Venture Partners (2025-08-12), The State of AI 2025. https://www.bvp.com/atlas/the-state-of-ai-2025 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷
Christine Deakers / Bessemer Venture Partners (2025-11-10), Scaling an AI Supernova: Lessons from Anthropic, Cursor, and fal. https://www.bvp.com/atlas/scaling-an-ai-supernova-lessons-from-anthropic-cursor-and-fal ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴
Sarah Wang / a16z (2025-09-12), Questioning Margins is a Boring Cliche…. https://www.a16z.news/p/questioning-margins-is-a-boring-cliche ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Ankit Aglawe / TokenCost (2026-04-08), Claude Code third-party pricing change. https://tokencost.app/blog/claude-code-third-party-pricing-change ↩ ↩² ↩³ ↩⁴ ↩⁵
ThePlanetTools (2026-04-05), 50x Cost Spike: Anthropic Cuts Off 135K Claude OpenClaw Instances. https://theplanettools.ai/blog/anthropic-blocks-openclaw-third-party-tools-claude-subscriptions-april-2026 ↩ ↩² ↩³ ↩⁴ ↩⁵
OpenAI, API Pricing. https://openai.com/api/pricing/ ↩ ↩² ↩³
AInvest (2026-04-06), OpenAI's Inference Spend Now Outpaces GPT-4 Training Costs. https://www.ainvest.com/news/openai-inference-spend-outpaces-gpt-4-training-costs-structural-loss-fueling-capital-efficiency-race-2604/ ↩ ↩² ↩³ ↩⁴
Alina Maria Stan / TNW (2026-04-07), Anthropic signs biggest compute deal yet with Google and Broadcom as run rate hits $30bn. https://thenextweb.com/news/anthropic-google-broadcom-compute-deal ↩ ↩²
ClaudeGuide (2026-04-26), Claude Prompt Caching: The 90% Discount Most Claude Developers Miss. https://claudeguide.io/claude-prompt-caching-guide ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
LLMversus (2026-04-15), Hidden Cost of LLM Caching: Anthropic vs OpenAI 2026. https://llmversus.com/blog/hidden-cost-llm-caching-anthropic-vs-openai ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Dmytro Klymentiev (2026-05-10), LLM Gateway 2026: OpenRouter vs LiteLLM vs Portkey vs Helicone. https://klymentiev.com/blog/llm-gateway-guide ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹
AICraftGuide (2026-05-04), LiteLLM vs Portkey vs OpenRouter: LLM Gateway 2026. https://aicraftguide.com/article/litellm-vs-portkey-vs-openrouter-llm-gateway-cost-control-2026 ↩ ↩² ↩³ ↩⁴ ↩⁵
LLMversus (2026-04-16), LLM Gateway Comparison 2026: OpenRouter vs LiteLLM vs Portkey vs Vercel AI Gateway. https://llmversus.com/blog/llm-gateway-comparison-2026 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Vercel, AI Gateway Overview. https://vercel.com/docs/ai-gateway ↩ ↩²
Vercel (2026-02-26), AI Gateway Pricing. https://vercel.com/docs/ai-gateway/pricing ↩ ↩²
Vercel, AI Gateway Model Fallbacks. https://vercel.com/docs/ai-gateway/models-and-providers/model-fallbacks ↩
Vercel, AI Gateway Service Tiers + Custom Reporting. https://vercel.com/docs/ai-gateway/capabilities/service-tiers ↩ ↩² ↩³ ↩⁴ ↩⁵
Vercel, How AI Gateway runs on Fluid compute. http://vercel.com/blog/how-ai-gateway-runs-on-fluid-compute ↩ ↩² ↩³ ↩⁴
OpenRouter, Provider Routing. https://openrouter.helicone.ai/docs/guides/routing/provider-selection ↩ ↩² ↩³ ↩⁴
LiteLLM (BerriAI/jibanez-staticduo), LiteLLM GitHub. https://github.com/jibanez-staticduo/litellm ↩ ↩² ↩³ ↩⁴
Helicone, Provider Routing. https://docs.helicone.ai/gateway/provider-routing ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Zylos Research (2026-02-19), AI Agent Cost Optimization: Token Economics and FinOps in Production. https://zylos.ai/research/2026-02-19-ai-agent-cost-optimization-token-economics ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸ ↩¹⁹ ↩²⁰ ↩²¹ ↩²²
OpenAI, Reserved Capacity for API Customers. https://openai.com/reserved-capacity/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸
OpenAI, Scale Tier for API Customers. https://openai.com/api-scale-tier/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
OpenAI, Priority Processing for API Customers. https://openai.com/api-priority-processing ↩ ↩²
Fredrik Filipsson / Vendor Benchmark (2026-03-03), Anthropic Claude Enterprise Pricing Benchmarks 2026. https://vendorbenchmark.com/blog/anthropic-claude-enterprise-pricing-benchmark.html ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷
OpenAI, How can I set up prepaid billing?. https://help.openai.com/en/articles/8264644-what-is-prepaid-billing ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
OpenAI, Flexible pricing for the Enterprise, Edu, and Business plans. https://help.openai.com/en/articles/11487671-flexible-pricing-for-the-enterprise-edu-and-business-plans ↩ ↩²
Redress Compliance, Enterprise AI Licensing Guide 2026. https://redresscompliance.com/enterprise-ai-licensing-guide-2026-openai-anthropic-google-aws.html ↩ ↩² ↩³ ↩⁴
Anthropic (2026-05-06), Higher usage limits for Claude and a compute deal with SpaceX. https://www.anthropic.com/news/higher-limits-spacex ↩ ↩²
Keith MacKenzie / CloudZero (2026-04-22), What Is LLM Observability? For CFOs & Engineers, Cost Is Missing. https://www.cloudzero.com/blog/llm-observability/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶
Cycles (2026-03-22), AI Agent Unit Economics: Cost and Margin Analysis. https://runcycles.io/blog/ai-agent-unit-economics-cost-per-conversation-per-user-margin ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵
Udit, Saas gross margin ai era. https://udit.co/blog/raw/saas-gross-margin-ai-era ↩ ↩² ↩³ ↩⁴
Ashley Capoot / CNBC (2026-03-25), Legal AI startup Harvey raises $200 million at $11 billion valuation. https://www.cnbc.com/2026/03/25/legal-ai-startup-harvey-raises-200-million-at-11-billion-valuation.html ↩ ↩²