llm cost control

LLM Proxy vs. SDK Wrappers: Why We Built a Proxy (And What the Latency Numbers Show)

SDK wrappers are the obvious first choice for LLM cost tracking. Here's why they break -- and what the latency benchmarks look like for the proxy alternative.

Photo by Bernd 📷 Dittrich / Unsplash

When we decided to build observability and cost control for LLM agents, the first question was: where do you intercept the call?

The obvious answer is "wrap the SDK." Most engineers reach for this first -- it's familiar, it's local, it doesn't require running another service. We built an SDK wrapper prototype before we built the proxy.

Here's what we learned, why we switched, and what the latency benchmarks actually show.

The Obvious Solution: SDK Wrappers

SDK wrappers are a reasonable first instinct. You create a client decorator that intercepts method calls, counts tokens, records latency, and enforces budgets:

class TrackedOpenAI:
    def __init__(self, api_key: str, budget_usd: float = 100.0):
        self._client = OpenAI(api_key=api_key)
        self._budget = budget_usd
        self.total_cost_usd = 0.0

    def chat_completions_create(self, **kwargs):
        if self.total_cost_usd >= self._budget:
            raise BudgetExceededError(f"Budget exceeded")
        response = self._client.chat.completions.create(**kwargs)
        self.total_cost_usd += response.usage.total_tokens * 0.00001
        return response

This works -- until it doesn't.

Three Ways SDK Wrappers Break

1. They fire too late to block requests.

There's a race condition: if two concurrent requests check the budget simultaneously, both see total_cost_usd = $98.50 against a $100 budget and both proceed. More fundamentally: any request that gets past the pre-call check will fire against the provider. You cannot cancel it mid-flight from inside a post-call hook. By the time your exception handler runs, the tokens have been processed and the cost is queued.

2. SDK releases silently break your wrapper.

OpenAI shipped openai>=1.0.0 in late 2023. The entire client interface changed -- openai.ChatCompletion.create() became client.chat.completions.create(). Every wrapper that patched the old interface broke silently. Token tracking returned zeros. Budget checks stopped running. The anthropic-sdk had a similar transition.

3. They're language-specific.

If your infrastructure uses Python agents, TypeScript lambdas, and a Go service for batch processing, you're maintaining three wrappers. When you fix a token-counting bug, you fix it in three places.

The Proxy Approach

A proxy solves all three problems by moving the interception point from the SDK layer to the HTTP layer.

One environment variable change routes all HTTP traffic through the proxy -- regardless of language or SDK:

# Python (OpenAI SDK)
export OPENAI_BASE_URL="http://localhost:3001/openai"

# TypeScript (OpenAI SDK)
const client = new OpenAI({ baseURL: "http://localhost:3001/openai" });

# curl
curl http://localhost:3001/openai/v1/chat/completions \
  -H "X-Elevation-Key: $ELEVATION_KEY" \
  -d '{"model": "gpt-4o", "messages": [...]}'

Budget enforcement fires before the upstream call. The budget check runs in the proxy's request handler. If the budget is exceeded, the request dies at the proxy. The provider never sees it. Nothing is billed.

The Latency Question

We benchmarked three configurations against the OpenAI API in June 2026:

LLM Proxy vs. SDK Wrappers: Why We Built a Proxy (And What the Latency Numbers Show)

The Obvious Solution: SDK Wrappers

Three Ways SDK Wrappers Break

The Proxy Approach

The Latency Question

Read more

We Simulated a Runaway Agent and Left It Running Overnight. Here's What the Bill Looked Like.

Why Your AI Bill Has No Attribution (And What That's Costing You)

Building an Open-Source LLM Cost Control Plane