Building an Open-Source LLM Cost Control Plane

Why we chose a proxy over SDK instrumentation, how TimescaleDB makes budget checks fast, and what the three enforcement modes look like in production.

Share
Building an Open-Source LLM Cost Control Plane
Photo by Bernd 📷 Dittrich / Unsplash

Building an Open-Source LLM Cost Control Plane

When we started building Elevation Networks, the architecture decision that shaped everything else was surprisingly simple: do we intercept LLM calls at the SDK layer or at the network layer?

We chose the network layer. Here's why that decision matters, and what we built around it.

The Core Architecture Problem

Every AI-powered application makes HTTP requests to model providers. Those requests contain prompts, tool definitions, and conversation history. The responses contain completions and, critically, token usage data. If you want to track, budget, and control AI spend, you need to see every one of those exchanges.

There are two places you can intercept them:

SDK instrumentation wraps the client library. You monkey-patch the OpenAI or Anthropic SDK, capture calls before they go out, and log responses when they come back. This works for simple cases, but it breaks down quickly: you need separate integrations for every SDK and every language. In a real organization, those codebases are owned by different teams, written in different languages, and deployed on different schedules.

Proxy interception puts a routing layer in front of your providers. Clients send their API calls to your proxy, which forwards them to the actual provider, captures the exchange, and streams the response back. No SDK changes required. No per-language instrumentation. Works for any client that can make HTTP requests.

We chose the proxy approach, and it's the right call for any organization that wants centralized cost governance without requiring every team to instrument their own code.

The Proxy in Practice

Our proxy is a lightweight Rust service that handles the full OpenAI and Anthropic API surfaces, including streaming responses. When a request comes in:

  1. We parse the request to extract the model, the agent identifier (passed via a custom header or extracted from the API key mapping), and the estimated prompt token count.
  2. We check the agent's current budget state. If they're at their hard limit, we return a 429. If they're approaching a soft limit, we pass through and emit a warning event.
  3. We forward the request to the upstream provider and stream the response back to the client.
  4. On completion, we extract the actual token usage from the response and write a cost event to TimescaleDB.

The round-trip overhead is typically under 5ms on the same network, which is well within the noise floor of LLM latency.

Why TimescaleDB for Cost Events

Cost data has a natural time series shape. Every API call produces a point-in-time record: when it happened, which agent made it, which model was used, how many tokens were consumed, and what that cost in normalized dollars. You want to query this data the way you'd query any time series: give me total spend for agent X over the last 24 hours, show me the per-hour token burn rate for the support pipeline this week, alert me when any agent's 15-minute rolling spend exceeds $5.

We evaluated several storage options:

Plain PostgreSQL works, but aggregating over large event volumes gets slow. Time-based partitioning helps but requires manual management.

ClickHouse gives excellent query performance, but operational overhead and the eventual-consistency model makes it harder to enforce real-time budget limits accurately.

TimescaleDB is a PostgreSQL extension, so we stay in the Postgres ecosystem. Automatic time-based partitioning, continuous aggregates for pre-computed rollups, and native support for gap-filling in time queries. Budget queries run in milliseconds even over millions of events.

The continuous aggregate feature is particularly useful here. We pre-compute hourly and daily spend rollups per agent, which makes the real-time budget check (in the hot path of every proxied request) a sub-millisecond point lookup instead of an aggregate query.

Budget Enforcement Modes

We support three enforcement modes, configurable per agent or per team:

Hard limit: When the agent hits its budget, requests return 429 immediately. The agent's workload stops until someone intervenes -- either resetting the budget, raising the limit, or investigating the spike. Appropriate for autonomous agents where unconstrained spend is a real risk.

Soft limit with alerting: Requests continue to pass through, but we emit alerts when the agent crosses its configured threshold. Useful for human-in-the-loop workflows where you want visibility without hard stops.

Rate limiting: Instead of a total budget cap, we limit the token rate -- e.g., no more than 100K tokens per hour. This smooths spend over time and prevents bursty agents from exhausting their budget in a single spike.

Most real organizations use a combination: hard limits on fully-autonomous production agents, soft limits on internal tools, and rate limiting on batch pipelines.

Quota Hierarchy

Budget enforcement is only useful if it maps to your organizational structure. We model quotas as a hierarchy:

  • Company
  • Team
  • Project
  • Agent

A request gets checked against all four levels. If any level is at its hard limit, the request is rejected. This lets you set a company-wide budget cap that can't be exceeded even if individual agents are within their own limits -- which is the control that finance teams actually need.

What's Open Source

The core proxy, the cost event ingestion pipeline, the TimescaleDB schema, and the budget enforcement engine are all MIT-licensed. The repository includes a Docker Compose setup that gets you running locally in under five minutes.

We're currently building provider support for the full OpenAI and Anthropic APIs, with Bedrock and Vertex AI on the roadmap. Model pricing is maintained as a versioned YAML file in the repo -- PRs welcome.

If you're running AI workloads in production and don't have per-agent cost attribution today, this is the gap we're filling.

GitHub: https://github.com/elevation-networks/elevation

git clone https://github.com/elevation-networks/elevation
cd elevation
docker compose up

The proxy and cost engine are MIT-licensed. Contributions and feedback are welcome.