Skip to main content
Blog AI Monitoring · May 2026 · 13 min read

LLM Token Monitoring:
How to Track and Control AI API Costs

Model API spending doubled from $3.5 billion to $8.4 billion in under a year. For engineering teams, token costs have become one of the fastest-growing line items in the budget — and the teams spending the most are often the ones with the least visibility. Here is how to fix that.

Why AI API costs spiral out of control

LLM API costs work differently from traditional infrastructure costs. A single request can range from $0.0001 to $0.50 depending on the model, input length, output length, and whether reasoning tokens or multimodal inputs are involved. Costs are invisible in real time, vary wildly by usage pattern, and compound silently across thousands of daily requests.

Analysis of enterprise AI deployments shows that unmonitored LLM applications typically waste 25–35% of their API budget. The sources of that waste are consistent across organisations:

Cost leak Avg budget impact How monitoring detects it
Redundant API calls (no caching) 15–30% Request deduplication analysis
Oversized prompts (unnecessary context) 10–20% Token usage tracking per request
Wrong model for task (GPT-4 for classification) 20–40% Model usage breakdown by task type
Retry storms (aggressive failed-request retries) 5–15% Error rate and retry pattern tracking
Unconstrained output length 10–25% Output token distribution per endpoint

What LLM token monitoring actually tracks

Token monitoring is not just counting tokens in a dashboard. A complete implementation captures four distinct layers of data on every API call:

1. Per-request token data

Input tokens, output tokens, cached tokens, and reasoning tokens for every individual request. This is the raw material for all cost attribution. Without per-request granularity, you can only see aggregate spend — not which specific feature, user, or prompt is responsible.

2. Cost attribution

Token counts translated into cost figures and tagged by feature, user, team, environment (dev/staging/prod), and model. This turns raw numbers into actionable data: "Feature X costs $1,200/month and accounts for 40% of total spend" is something you can act on. "We spent $3,000 this month" is not.

3. Trend and anomaly detection

Token spend tracked over time so regressions are visible. A prompt change that doubles token consumption should trigger an alert, not appear silently in the next invoice. Trend monitoring also surfaces gradual drift — token counts that creep up 5% per week are easy to miss without a baseline.

4. Rate limiting and budget enforcement

Hard or soft limits applied at the feature, user, or environment level. This prevents runaway agentic loops, batch jobs gone wrong, or a single bug from consuming an entire monthly budget in hours.

Input tokens vs output tokens — why the distinction matters

All major LLM API providers price input and output tokens separately — and output tokens are consistently more expensive, typically by a factor of 3 to 6x. Understanding this asymmetry is the foundation of effective cost control.

Input tokens

The tokens in your prompt: system instructions, user message, conversation history, retrieved context (RAG), and any examples.

  • Optimise with prompt compression
  • Cache repeated system prompts (up to 90% saving on cached tokens)
  • Trim RAG context to relevant chunks only
  • Manage conversation history window carefully

Output tokens

The tokens in the model's response. Output tokens cost 3–6x more than input tokens and are harder to predict — making them the bigger cost lever.

  • Set max_tokens on every API call
  • Instruct the model to be concise in the system prompt
  • Use structured outputs (JSON) to eliminate padding
  • Use stop sequences to end responses early

Important: Output tokens are always more expensive than input tokens across every major provider. A prompt optimised from 500 to 200 input tokens saves ~60% on input cost — but cutting output from 400 to 150 tokens saves even more because the per-token rate is higher.

How to implement LLM token monitoring

Effective token monitoring requires instrumentation at the API call level, not just aggregate billing dashboards. Here is a practical implementation path.

Step 1: Capture raw usage data per request

Every major LLM provider returns token counts in the API response. OpenAI returns usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens. Anthropic returns similar fields. Capture and log these on every call.

Step 2: Tag requests with context

Raw token counts without context are nearly useless. Tag each request with:

  • Feature or product area (e.g. chat, summarisation, search)
  • User ID or session ID
  • Model name and version
  • Environment (production, staging)

Step 3: Translate tokens to cost

Apply each provider's published pricing to convert token counts into dollar amounts in real time. Store both the raw tokens and the computed cost so you can recompute when prices change.

Step 4: Aggregate and alert

Roll up costs by feature, user cohort, and time window. Set threshold alerts for when per-request costs exceed a baseline, when daily spend trends above forecast, or when a specific feature's cost-per-call spikes unexpectedly.

Key metrics every LLM monitoring dashboard should show

Once you are capturing per-request data, build visibility around these metrics:

Metric Why it matters Alert threshold
Cost per request (p50/p95) Spot expensive outlier calls >2× baseline p50
Input/output token ratio High output ratio means verbose responses Ratio > 3× historical avg
Daily / monthly spend rate Forecast overage before it hits Spend pace > 80% of budget by mid-month
Cost per feature / endpoint Find which features drive cost Any feature >40% of total spend
Tokens per user session Identify power users and abuse Session cost >10× median

Proven strategies to reduce LLM API costs

Monitoring tells you where money goes. These techniques help you spend less without degrading quality.

Prompt compression

Remove redundant instructions, whitespace, and examples from system prompts. Tools like LLMLingua can compress prompts by 3–5× with minimal quality loss, directly cutting input token costs.

Semantic caching

Cache responses for semantically similar queries. For applications with repetitive queries (FAQs, product descriptions), cache hit rates of 30–60% are achievable, eliminating those API calls entirely.

Model routing

Route simple queries to cheaper, smaller models and reserve frontier models for complex tasks. A routing layer that sends 70% of traffic to a smaller model can cut costs by 50% or more.

max_tokens limits

Always set max_tokens appropriate to the task. An uncapped chat response can balloon to 4,000 tokens when 400 would suffice. Per-endpoint limits prevent runaway output costs.

LLM token monitoring tools and platforms

Several categories of tooling address token monitoring:

  • LLM observability platforms — tools like Langfuse, Helicone, and Phoenix provide per-request token tracking, cost attribution, and dashboards purpose-built for LLM workloads.
  • APM tools with LLM support — Datadog, New Relic, and Dynatrace have added LLM monitoring modules that integrate token tracking into broader infrastructure observability.
  • Custom logging pipelines — teams with existing data infrastructure often build lightweight middleware that intercepts API calls, logs token counts, and pushes them to ClickHouse or BigQuery.
  • Provider dashboards — OpenAI, Anthropic, and Google provide usage dashboards, but they lack per-feature attribution and real-time alerting, making them insufficient as the sole monitoring layer.

For teams that also need endpoint uptime, latency, and error-rate monitoring alongside cost tracking, a unified monitoring platform reduces the number of tools in the stack.

Setting budgets and automated alerts

Reactive cost management — checking the bill at month-end — is too slow. Effective LLM cost governance uses proactive budget controls:

  • Set soft limits that trigger a Slack/PagerDuty alert at 70% of monthly budget
  • Set hard limits that throttle or disable non-critical features at 90% of budget
  • Alert on cost-per-request anomalies (e.g. p95 exceeds 3× the 7-day rolling average)
  • Alert when a new deployment causes a sudden increase in average token usage
  • Track cost alongside latency — a cheaper response that is also slower may indicate a degraded model or routing error

Monitor your LLM costs and API uptime in one place

MonitorGiant gives AI engineering teams per-request token tracking, cost attribution by feature, anomaly alerts, and endpoint uptime monitoring — all from a single dashboard. Stop discovering cost spikes in your cloud bill.

Start free monitoring →

Conclusion

LLM token monitoring is no longer optional for teams running AI features in production. As model API spending scales with usage, the gap between teams that have per-request visibility and those relying on monthly invoices grows wider — in both cost efficiency and reliability.

Start by capturing raw token counts on every API call, tag them with feature context, translate them to cost, and build alerting around the metrics that matter most. Combined with prompt optimisation and model routing, most teams can reduce their LLM API spend by 30–50% without changing their product experience.

Written by

Dileep KK, MonitorGiant

LinkedIn

21+ years in IT infrastructure management and observability. Built monitoring dashboards, custom alerting pipelines, and AI token-tracking systems across cloud platforms — AWS, GCP, and Azure — and for organisations spanning defence IT, IoT manufacturing, digital marketing, SaaS email, insurance broking, parliamentary digital services, and educational ERP. Active directory, SIEM, WAF, Cloudflare, MSSQL, Linux, Windows, Entra ID — operated at every layer of the stack.

IIM Shillong Management MBA – Information Systems ITIL v4 Foundation Lean Six Sigma GB Google PMP