We Simulated a Runaway Agent and Left It Running Overnight. Here's What the Bill Looked Like.
A runaway agent in a retry loop costs more than most teams expect. Here's the math, why SDK wrappers don't stop it, and how a proxy does.
The Failure Mode
Here's a bug I've seen in three different production systems in the last six months.
An agent is orchestrating a multi-step workflow. One tool call starts returning a transient error. The agent, faithfully following its retry logic, calls the tool again. The tool errors again. The agent retries. This repeats -- not forever in theory, but in practice, "not forever" and "not before your billing dashboard sends an alert" are different things.
The agent isn't broken in any obvious way. It's doing exactly what it was told to do. The LLM is responding. The SDK isn't throwing. There's no crash to page on. The loop is silent.
Here's what the silence costs.
The Math
Model: GPT-4o. Pricing: $0.01 per 1K output tokens. Prompt size: 2,000 tokens (a typical tool-call context with system prompt and history). Loop speed: 1 call every 10 seconds (reasonable retry backoff that still loops fast).
Per-hour cost: 360 calls x 2,000 tokens x $0.01/1K = $7.20/hour
Leave it running overnight (8 hours): $57.60
Leave it running over a long weekend (72 hours): $518.40
This is the conservative case. If your agent uses GPT-4o with a 16K context (not unusual for document processing), multiply by 8. A weekend loop at 16K tokens costs $4,147.
The "$47K" in the headline is based on a model with higher per-token pricing, a larger context, and a faster retry loop. We've seen real incidents in that range posted on social media. The math is not hard.
Why SDK Wrappers Don't Stop This
The natural first instinct is to wrap your LLM client with a budget-tracking decorator:
class BudgetedOpenAI:
def __init__(self, client, max_spend_usd):
self.client = client
self.max_spend = max_spend_usd
self.spent = 0.0
def chat_completions_create(self, **kwargs):
response = self.client.chat.completions.create(**kwargs) # call fires HERE
tokens = response.usage.total_tokens
cost = tokens * 0.00001
self.spent += cost
if self.spent > self.max_spend:
raise BudgetExceededError(f"Spent over budget")
return responseThis looks reasonable. But the call fires before you check the budget. By the time you count the tokens and check your max_spend, the request has already been sent to OpenAI. The tokens are already being processed. You've already incurred the cost.
In a fast retry loop, you might check the budget 50 or 100 times before the decorator can actually stop anything -- each check arriving after the damage is done.
There's a subtler problem too: SDK wrappers are fragile. OpenAI shipped openai>=1.0.0 with a redesigned client interface. Anthropic changed their streaming API. If your wrapper monkeypatches a method that gets renamed in the next SDK release, your budget enforcement silently breaks and you won't know until the bill arrives.
The Proxy Approach
A proxy intercepts the HTTP request before it reaches the provider.
Your agent code
|
[Elevation proxy] <- budget check happens HERE (< 1ms)
|
OpenAI / Anthropic / GeminiThe budget check is a Redis read. If the budget is exceeded, the proxy returns 402 Payment Required immediately. The upstream call never fires. There is no network round-trip to OpenAI. There are no tokens billed. The loop stops at the proxy.
This is the only place in the call path where you can actually block a request.
The Three-Line Change
Elevation is self-hosted today -- get it running with Docker Compose in about 5 minutes. Once it's up, routing an existing OpenAI client through it is a single line change:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3001/openai", # only change
default_headers={"X-Elevation-Key": "your-api-key"},
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)For Anthropic:
import anthropic
client = anthropic.Anthropic(
base_url="http://localhost:3001/anthropic",
default_headers={"X-Elevation-Key": "your-api-key"},
)Budget Configuration
Once the proxy is running, set per-agent budgets via the API:
curl -X POST http://localhost:3002/api/budgets \
-H "Content-Type: application/json" \
-d '{"agentId": "agent-prod-workflow", "dailyBudgetUsd": 50.00, "hourlyBudgetUsd": 10.00, "mode": "kill"}'With "mode": "kill", any request that would exceed the budget returns 402 immediately -- the upstream provider never sees it.
Caveats
The proxy adds one network hop. In practice this is negligible (our p99 in local testing adds < 2ms on the same host), but if you're running agents with sub-10ms latency requirements, measure first.
Budget enforcement operates at the request level. A concurrent burst of 10 requests can all check the budget simultaneously before any usage is recorded. Hard atomic enforcement is on the roadmap.
This does not replace rate limiting, quotas, or alerting from your LLM provider. It's an additional layer, not a replacement.
Get It
Elevation is free and open source under the MIT license.
GitHub: https://github.com/elevation-networks/elevation
git clone https://github.com/elevation-networks/elevation
cd elevation
docker compose upIf you've had a runaway agent incident (or a near-miss), I'd genuinely like to hear what happened. The failure modes are more varied than I expected. Reach out: [email protected]