Why AI API costs spiral out of control
LLM API costs work differently from traditional infrastructure costs. A single request can range from $0.0001 to $0.50 depending on the model, input length, output length, and whether reasoning tokens or multimodal inputs are involved. Costs are invisible in real time, vary wildly by usage pattern, and compound silently across thousands of daily requests.
Analysis of enterprise AI deployments shows that unmonitored LLM applications typically waste 25–35% of their API budget. The sources of that waste are consistent across organisations:
| Cost leak | Avg budget impact | How monitoring detects it |
|---|---|---|
| Redundant API calls (no caching) | 15–30% | Request deduplication analysis |
| Oversized prompts (unnecessary context) | 10–20% | Token usage tracking per request |
| Wrong model for task (GPT-4 for classification) | 20–40% | Model usage breakdown by task type |
| Retry storms (aggressive failed-request retries) | 5–15% | Error rate and retry pattern tracking |
| Unconstrained output length | 10–25% | Output token distribution per endpoint |
What LLM token monitoring actually tracks
Token monitoring is not just counting tokens in a dashboard. A complete implementation captures four distinct layers of data on every API call:
1. Per-request token data
Input tokens, output tokens, cached tokens, and reasoning tokens for every individual request. This is the raw material for all cost attribution. Without per-request granularity, you can only see aggregate spend — not which specific feature, user, or prompt is responsible.
2. Cost attribution
Token counts translated into cost figures and tagged by feature, user, team, environment (dev/staging/prod), and model. This turns raw numbers into actionable data: "Feature X costs $1,200/month and accounts for 40% of total spend" is something you can act on. "We spent $3,000 this month" is not.
3. Trend and anomaly detection
Token spend tracked over time so regressions are visible. A prompt change that doubles token consumption should trigger an alert, not appear silently in the next invoice. Trend monitoring also surfaces gradual drift — token counts that creep up 5% per week are easy to miss without a baseline.
4. Rate limiting and budget enforcement
Hard or soft limits applied at the feature, user, or environment level. This prevents runaway agentic loops, batch jobs gone wrong, or a single bug from consuming an entire monthly budget in hours.
Input tokens vs output tokens — why the distinction matters
All major LLM API providers price input and output tokens separately — and output tokens are consistently more expensive, typically by a factor of 3 to 6x. Understanding this asymmetry is the foundation of effective cost control.
Input tokens
The tokens in your prompt: system instructions, user message, conversation history, retrieved context (RAG), and any examples.
- → Optimise with prompt compression
- → Cache repeated system prompts (up to 90% saving on cached tokens)
- → Trim RAG context to relevant chunks only
- → Manage conversation history window carefully
Output tokens
The tokens in the model's response. Output tokens cost 3–6x more than input tokens and are harder to predict — making them the bigger cost lever.
- → Set
max_tokenson every API call - → Instruct the model to be concise in the system prompt
- → Use structured outputs (JSON) to eliminate padding
- → Use stop sequences to end responses early
Important: Output tokens are always more expensive than input tokens across every major provider. A prompt optimised from 500 to 200 input tokens saves ~60% on input cost — but cutting output from 400 to 150 tokens saves even more because the per-token rate is higher.
How to implement LLM token monitoring
Effective token monitoring requires instrumentation at the API call level, not just aggregate billing dashboards. Here is a practical implementation path.
Step 1: Capture raw usage data per request
Every major LLM provider returns token counts in the API response. OpenAI returns usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens. Anthropic returns similar fields. Capture and log these on every call.
Step 2: Tag requests with context
Raw token counts without context are nearly useless. Tag each request with:
- → Feature or product area (e.g. chat, summarisation, search)
- → User ID or session ID
- → Model name and version
- → Environment (production, staging)
Step 3: Translate tokens to cost
Apply each provider's published pricing to convert token counts into dollar amounts in real time. Store both the raw tokens and the computed cost so you can recompute when prices change.
Step 4: Aggregate and alert
Roll up costs by feature, user cohort, and time window. Set threshold alerts for when per-request costs exceed a baseline, when daily spend trends above forecast, or when a specific feature's cost-per-call spikes unexpectedly.
Key metrics every LLM monitoring dashboard should show
Once you are capturing per-request data, build visibility around these metrics:
| Metric | Why it matters | Alert threshold |
|---|---|---|
| Cost per request (p50/p95) | Spot expensive outlier calls | >2× baseline p50 |
| Input/output token ratio | High output ratio means verbose responses | Ratio > 3× historical avg |
| Daily / monthly spend rate | Forecast overage before it hits | Spend pace > 80% of budget by mid-month |
| Cost per feature / endpoint | Find which features drive cost | Any feature >40% of total spend |
| Tokens per user session | Identify power users and abuse | Session cost >10× median |
Proven strategies to reduce LLM API costs
Monitoring tells you where money goes. These techniques help you spend less without degrading quality.
Prompt compression
Remove redundant instructions, whitespace, and examples from system prompts. Tools like LLMLingua can compress prompts by 3–5× with minimal quality loss, directly cutting input token costs.
Semantic caching
Cache responses for semantically similar queries. For applications with repetitive queries (FAQs, product descriptions), cache hit rates of 30–60% are achievable, eliminating those API calls entirely.
Model routing
Route simple queries to cheaper, smaller models and reserve frontier models for complex tasks. A routing layer that sends 70% of traffic to a smaller model can cut costs by 50% or more.
max_tokens limits
Always set max_tokens appropriate to the task. An uncapped chat response can balloon to 4,000 tokens when 400 would suffice. Per-endpoint limits prevent runaway output costs.
LLM token monitoring tools and platforms
Several categories of tooling address token monitoring:
- → LLM observability platforms — tools like Langfuse, Helicone, and Phoenix provide per-request token tracking, cost attribution, and dashboards purpose-built for LLM workloads.
- → APM tools with LLM support — Datadog, New Relic, and Dynatrace have added LLM monitoring modules that integrate token tracking into broader infrastructure observability.
- → Custom logging pipelines — teams with existing data infrastructure often build lightweight middleware that intercepts API calls, logs token counts, and pushes them to ClickHouse or BigQuery.
- → Provider dashboards — OpenAI, Anthropic, and Google provide usage dashboards, but they lack per-feature attribution and real-time alerting, making them insufficient as the sole monitoring layer.
For teams that also need endpoint uptime, latency, and error-rate monitoring alongside cost tracking, a unified monitoring platform reduces the number of tools in the stack.
Setting budgets and automated alerts
Reactive cost management — checking the bill at month-end — is too slow. Effective LLM cost governance uses proactive budget controls:
- → Set soft limits that trigger a Slack/PagerDuty alert at 70% of monthly budget
- → Set hard limits that throttle or disable non-critical features at 90% of budget
- → Alert on cost-per-request anomalies (e.g. p95 exceeds 3× the 7-day rolling average)
- → Alert when a new deployment causes a sudden increase in average token usage
- → Track cost alongside latency — a cheaper response that is also slower may indicate a degraded model or routing error
Monitor your LLM costs and API uptime in one place
MonitorGiant gives AI engineering teams per-request token tracking, cost attribution by feature, anomaly alerts, and endpoint uptime monitoring — all from a single dashboard. Stop discovering cost spikes in your cloud bill.
Start free monitoring →Conclusion
LLM token monitoring is no longer optional for teams running AI features in production. As model API spending scales with usage, the gap between teams that have per-request visibility and those relying on monthly invoices grows wider — in both cost efficiency and reliability.
Start by capturing raw token counts on every API call, tag them with feature context, translate them to cost, and build alerting around the metrics that matter most. Combined with prompt optimisation and model routing, most teams can reduce their LLM API spend by 30–50% without changing their product experience.
Written by
Dileep KK, MonitorGiant
LinkedIn21+ years in IT infrastructure management and observability. Built monitoring dashboards, custom alerting pipelines, and AI token-tracking systems across cloud platforms — AWS, GCP, and Azure — and for organisations spanning defence IT, IoT manufacturing, digital marketing, SaaS email, insurance broking, parliamentary digital services, and educational ERP. Active directory, SIEM, WAF, Cloudflare, MSSQL, Linux, Windows, Entra ID — operated at every layer of the stack.