We Simulated a Runaway Agent and Left It Running Overnight. Here's What the Bill Looked Like.

A runaway agent in a retry loop costs more than most teams expect. Here's the math, why SDK wrappers don't stop it, and how a proxy does.

Share
We Simulated a Runaway Agent and Left It Running Overnight. Here's What the Bill Looked Like.
Photo by Zac Ong / Unsplash

The Failure Mode

Here's a bug I've seen in three different production systems in the last six months.

An agent is orchestrating a multi-step workflow. One tool call starts returning a transient error. The agent, faithfully following its retry logic, calls the tool again. The tool errors again. The agent retries. This repeats -- not forever in theory, but in practice, "not forever" and "not before your billing dashboard sends an alert" are different things.

The agent isn't broken in any obvious way. It's doing exactly what it was told to do. The LLM is responding. The SDK isn't throwing. There's no crash to page on. The loop is silent.

Here's what the silence costs.

The Math

Model: GPT-4o. Pricing: $0.01 per 1K output tokens. Prompt size: 2,000 tokens (a typical tool-call context with system prompt and history). Loop speed: 1 call every 10 seconds (reasonable retry backoff that still loops fast).

Per-hour cost: 360 calls x 2,000 tokens x $0.01/1K = $7.20/hour

Leave it running overnight (8 hours): $57.60

Leave it running over a long weekend (72 hours): $518.40

This is the conservative case. If your agent uses GPT-4o with a 16K context (not unusual for document processing), multiply by 8. A weekend loop at 16K tokens costs $4,147.

The "$47K" in the headline is based on a model with higher per-token pricing, a larger context, and a faster retry loop. We've seen real incidents in that range posted on social media. The math is not hard.

Why SDK Wrappers Don't Stop This

The natural first instinct is to wrap your LLM client with a budget-tracking decorator:

class BudgetedOpenAI:
    def __init__(self, client, max_spend_usd):
        self.client = client
        self.max_spend = max_spend_usd
        self.spent = 0.0

    def chat_completions_create(self, **kwargs):
        response = self.client.chat.completions.create(**kwargs)  # call fires HERE
        tokens = response.usage.total_tokens
        cost = tokens * 0.00001
        self.spent += cost
        if self.spent > self.max_spend:
            raise BudgetExceededError(f"Spent over budget")
        return response

This looks reasonable. But the call fires before you check the budget. By the time you count the tokens and check your max_spend, the request has already been sent to OpenAI. The tokens are already being processed. You've already incurred the cost.

In a fast retry loop, you might check the budget 50 or 100 times before the decorator can actually stop anything -- each check arriving after the damage is done.

There's a subtler problem too: SDK wrappers are fragile. OpenAI shipped openai>=1.0.0 with a redesigned client interface. Anthropic changed their streaming API. If your wrapper monkeypatches a method that gets renamed in the next SDK release, your budget enforcement silently breaks and you won't know until the bill arrives.

The Proxy Approach

A proxy intercepts the HTTP request before it reaches the provider.

Your agent code
     |
[Elevation proxy]  <- budget check happens HERE (< 1ms)
     |
  OpenAI / Anthropic / Gemini

The budget check is a Redis read. If the budget is exceeded, the proxy returns 402 Payment Required immediately. The upstream call never fires. There is no network round-trip to OpenAI. There are no tokens billed. The loop stops at the proxy.

This is the only place in the call path where you can actually block a request.

The Three-Line Change

Elevation is self-hosted today -- get it running with Docker Compose in about 5 minutes. Once it's up, routing an existing OpenAI client through it is a single line change:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3001/openai",  # only change
    default_headers={"X-Elevation-Key": "your-api-key"},
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

For Anthropic:

import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:3001/anthropic",
    default_headers={"X-Elevation-Key": "your-api-key"},
)

Budget Configuration

Once the proxy is running, set per-agent budgets via the API:

curl -X POST http://localhost:3002/api/budgets \
  -H "Content-Type: application/json" \
  -d '{"agentId": "agent-prod-workflow", "dailyBudgetUsd": 50.00, "hourlyBudgetUsd": 10.00, "mode": "kill"}'

With "mode": "kill", any request that would exceed the budget returns 402 immediately -- the upstream provider never sees it.

Caveats

The proxy adds one network hop. In practice this is negligible (our p99 in local testing adds < 2ms on the same host), but if you're running agents with sub-10ms latency requirements, measure first.

Budget enforcement operates at the request level. A concurrent burst of 10 requests can all check the budget simultaneously before any usage is recorded. Hard atomic enforcement is on the roadmap.

This does not replace rate limiting, quotas, or alerting from your LLM provider. It's an additional layer, not a replacement.

Get It

Elevation is free and open source under the MIT license.

GitHub: https://github.com/elevation-networks/elevation

git clone https://github.com/elevation-networks/elevation
cd elevation
docker compose up

If you've had a runaway agent incident (or a near-miss), I'd genuinely like to hear what happened. The failure modes are more varied than I expected. Reach out: [email protected]