Monitoring & DevOps Glossary
Plain-English definitions of 42 key terms across uptime monitoring, API monitoring, SLAs, AI observability, cloud costs, and DevOps reliability.
Alert Fatigue
The phenomenon where monitoring teams begin ignoring or disabling alerts because too many false positives or low-priority notifications have eroded trust in the alerting system. Alert fatigue is one of the leading causes of missed real incidents during production outages. Good monitoring tools reduce noise through multi-region consensus checks and intelligent thresholds that only fire when a genuine problem is confirmed.
AI Observability
AI Monitoring →The practice of monitoring AI-powered systems for performance, cost, and response quality — going beyond traditional server uptime to track token usage, quality drift, and model behaviour over time. AI observability includes metrics like token burn rate, semantic similarity scores, and provider availability. MonitorGiant's AI monitoring suite covers token burn, circuit breaking, quality monitoring, and MCP health.
AI Status Feed
AI Monitoring →A live aggregation of the operational status of major AI providers — including OpenAI, Anthropic, Google AI, Cloudflare AI, and Hugging Face — surfaced within a monitoring dashboard. The AI status feed allows teams to instantly distinguish between their own service failures and provider-side incidents, preventing unnecessary debugging of problems outside their control. MonitorGiant provides an AI status feed free for all users.
AI Token
AI Token Monitoring →The unit of consumption charged by AI language model providers such as OpenAI, Anthropic, and Google Gemini. Tokens correspond roughly to word fragments — a typical English word is 1–2 tokens. Costs are billed per thousand tokens consumed across both prompts and completions, making token monitoring essential for teams that need to control AI infrastructure spend.
API Monitoring
Monitor Types →The practice of continuously checking API endpoints to ensure they return correct status codes, valid response bodies, and acceptable response times. API monitoring goes beyond simple uptime checks by validating that an API is not just reachable but functionally correct — for example, confirming that a payment API returns a valid confirmation field. MonitorGiant supports HTTP, Keyword, and Journey monitors for thorough API monitoring.
Availability
SLA Calculator →A measure of how often a system or service is operational and accessible, typically expressed as a percentage over a defined period. Availability of 99.9% means the service can be down for no more than 43.8 minutes per month. It is closely related to uptime and forms the core metric of most SLAs.
Circuit Breaker (AI)
AI Circuit Breaker →An automated safeguard that fires a configured action — such as a signed webhook POST or an email alert — when AI token usage spikes beyond a defined threshold relative to a rolling average. Named after the electrical component that interrupts a circuit to prevent damage, an AI circuit breaker gives your systems the signal they need to respond automatically to runaway costs. MonitorGiant's AI Circuit Breaker is paired with the Token Burn Monitor.
Cloud Cost Spike
Cloud Monitoring →An unexpected and significant increase in cloud infrastructure spend, often caused by misconfigured auto-scaling, a runaway process, a data transfer surge, or an error in a cost-intensive service. Cloud cost monitoring detects spikes in near real time by comparing daily spend against a rolling average, allowing teams to act before the monthly bill arrives. MonitorGiant supports cloud cost monitoring for AWS, GCP, and Azure.
Cron Job Health
Platform Monitors →The operational status of scheduled background tasks (cron jobs) that run at fixed intervals on a server or within a platform. A cron job that silently fails — completing without an error but producing no expected output — is one of the hardest failures to detect without dedicated monitoring. Platform health monitors for self-hosted tools like WordPress and ERPNext check cron job status as part of their deep health assessment.
Deep Monitoring
Features →A monitoring philosophy that goes beyond peripheral checks — such as whether a server responds with a 200 OK — to verify that an application actually works end-to-end. Deep monitoring includes response body validation, AI token spend tracking, cloud cost alerts, self-hosted platform health signals, and multi-step Journey monitoring. MonitorGiant is built entirely around the deep monitoring philosophy.
Downtime
SLA Calculator →Any period during which a service, application, or API is unavailable or not functioning as expected. Downtime can be complete — total unavailability — or partial, such as degraded performance, regional failures, or silent errors returning incorrect responses. The financial impact of downtime is calculated by multiplying hourly revenue at risk by the duration of the incident.
Error Rate
Monitor Types →The percentage of requests to a service that result in an error response, typically defined as HTTP 4xx or 5xx status codes. A rising error rate is often the first detectable signal of an emerging incident, even before full downtime occurs. Monitoring error rate alongside availability gives a more complete picture of service health than uptime alone.
FinOps
Cloud Monitoring →A cloud financial management discipline that brings together finance, engineering, and business teams to understand and optimise cloud spending. FinOps teams use monitoring data, cost allocation tags, and rolling spend baselines to govern infrastructure costs across the organisation. MonitorGiant's cloud cost monitoring supports FinOps workflows with daily spend tracking across AWS, GCP, and Azure.
HTTP Status Code
HTTP Monitor →A three-digit numeric code returned by a web server in response to an HTTP request, indicating the outcome of that request. Common codes include 200 (OK), 301 (Redirect), 404 (Not Found), and 500 (Internal Server Error). HTTP monitors use status codes as the primary signal for uptime monitoring — an unexpected code triggers an alert.
Incident Post-Mortem
Features →A structured review conducted after a service incident to understand what happened, why it happened, and how to prevent recurrence. A good post-mortem is blameless, factual, and action-oriented — capturing timeline, root cause, impact, and follow-up tasks. MonitorGiant's full incident history provides the timeline data needed to construct accurate post-mortems.
Journey Monitor
Journey Monitor →A monitor type that walks through a sequence of HTTP requests in order, simulating a real user navigating through an application. Each step can validate a URL, an expected status code, and an optional keyword in the response body. Journey monitors catch breakages in checkout flows, sign-up sequences, and API workflows that standard HTTP monitors cannot detect on their own. Available on paid MonitorGiant plans.
Keyword Monitor
Keyword Monitor →A monitor type that fetches the full response body of a URL and checks for the presence or absence of a specific text string. Keyword monitors catch silent failures — pages that return a 200 OK status code but serve error messages, blank content, or missing critical elements like "Add to Cart." This is one of the most valuable forms of deep monitoring for e-commerce and SaaS applications.
Latency
Monitor Types →The time elapsed between a client sending a request and receiving a full response from a server, typically measured in milliseconds. Latency is distinct from downtime — a service can be fully available but so slow that it is functionally unusable. Monitoring p95 and p99 latency (the 95th and 99th percentile response times) reveals performance issues affecting a subset of requests that averages systematically conceal.
LLM (Large Language Model)
AI Monitoring →A type of AI model trained on large amounts of text data to generate, summarise, translate, classify, and reason about language. LLMs such as GPT-4, Claude, and Gemini are the foundation of modern AI-powered applications. Teams building with LLMs need to monitor token consumption, response quality, and provider availability — all of which are covered by MonitorGiant's AI monitoring suite.
MCP Server
MCP Health Monitor →A server implementing the Model Context Protocol, a standard that enables AI agents to interact with tools and external data sources in a structured, predictable way. MCP servers power tool-calling in AI agent frameworks. Monitoring an MCP server requires verifying three layers: network reachability, the initialize handshake, and tool availability — all three are checked by MonitorGiant's MCP Health Monitor.
MTTD (Mean Time to Detect)
Features →The average time elapsed between when a failure occurs and when it is first detected by a monitoring system or an engineer. Shorter check intervals and multi-region monitoring reduce MTTD. For services with revenue SLAs, every undetected minute carries a measurable cost, making MTTD one of the most business-critical reliability metrics.
MTTR (Mean Time to Recover)
Features →The average time taken to restore a service after an incident, from the moment the failure is first detected to the moment it is fully resolved. MTTR is a key reliability metric — lower is better. Faster detection through automated monitoring directly reduces MTTR by shortening the gap between failure and the start of active remediation.
Multi-Region Monitoring
Features →A monitoring approach where health checks are run from multiple geographic locations simultaneously rather than a single origin. Multi-region monitoring eliminates false positives caused by regional network issues by only alerting when all configured regions agree that the service is unreachable. MonitorGiant uses multi-region consensus as a core part of its check logic across all monitor types.
On-Call Rotation
Notification Channels →A schedule that assigns responsibility for responding to incidents to specific team members on a rotating basis, ensuring 24/7 coverage without burning out any individual. Effective on-call rotations are backed by clear alert routing, defined escalation paths, and monitoring tools that deliver actionable, low-noise alerts. MonitorGiant's notification profiles support per-monitor on-call routing.
p95 / p99 Latency
Monitor Types →Statistical measures of response time representing the 95th and 99th percentile of all observed requests. A p99 latency of 2 seconds means that 1% of requests take longer than 2 seconds. These percentiles are more meaningful for SLA monitoring than averages because they capture the worst-case experience for a real subset of users, which averages routinely hide.
Ping Monitor
Ping Monitor →A monitor type that checks whether a host is reachable on the network by attempting a TCP connection, confirming the server exists and is responding at the network level. Ping monitors work across all cloud environments, including those that restrict raw ICMP packets. They are the most basic form of server monitoring and are available on MonitorGiant's free plan.
Platform Health Monitoring
Platform Monitors →Deep monitoring of self-hosted software platforms — such as WordPress, WooCommerce, Ghost, Nextcloud, and Keycloak — using the platform's own built-in health APIs rather than generic uptime checks. Platform health monitors check signals like database connectivity, cron job status, checkout function, queue depth, and plugin health. MonitorGiant supports 24 platforms with no plugin installation or code changes required.
Port Monitoring
Port Monitor →A monitor type that verifies a specific TCP port is accepting connections on a given host, confirming that the service running on that port is actually listening and responsive. Port monitors go further than a ping check — a server can be reachable at the network level while a crashed database is no longer accepting connections on its port. Available on MonitorGiant's free plan for any TCP port.
Real User Monitoring (RUM)
Monitor Types →A monitoring approach that collects performance and availability data from the actual browsers and devices of real end users as they interact with an application. RUM captures metrics like page load time, JavaScript errors, and navigation timing from live traffic, complementing synthetic monitoring's scheduled checks. Unlike synthetic monitoring, RUM requires instrumentation code deployed within the application.
Revenue Loss Tracker
Features →A monitoring feature that estimates the financial impact of an ongoing incident in real time, based on the estimated value of transactions or revenue at risk during the downtime period. Displaying revenue impact on the incident dashboard creates urgency and supports business-level decision making during an outage. MonitorGiant calculates and displays estimated revenue loss for every open incident.
Rolling Average
AI Token Monitoring →A statistical baseline calculated from a sliding window of historical data — for example, the last 15 days of AI token spend or cloud cost. Rolling averages adapt automatically as usage patterns change over time, making them more reliable for anomaly detection than static fixed thresholds. MonitorGiant uses a 15-day rolling dynamic average for AI token burn and cloud cost spike detection.
Root Cause Analysis (RCA)
Incident History →The process of identifying the underlying cause of an incident, rather than treating only its symptoms or immediate trigger. A thorough RCA traces the chain of events from the root failure through to the observable impact, enabling teams to address the source and prevent recurrence. Monitoring data, incident timelines, and response time charts are key inputs into any meaningful RCA.
Sitemap Link Monitor
Sitemap Monitor →A monitor type that reads a sitemap.xml file and checks every listed URL on a scheduled basis using lightweight HTTP HEAD requests. Sitemap link monitoring catches broken links, 4xx errors, and redirect chains across an entire site in one consolidated scan, without loading each page in full. It is particularly valuable after CMS migrations, plugin updates, or large content restructures.
SLA (Service Level Agreement)
SLA Calculator →A formal commitment between a service provider and a customer defining the minimum acceptable level of service, most commonly expressed as an uptime percentage such as 99.9%. SLAs typically carry financial consequences — credits or penalties — when targets are missed. Uptime monitoring provides the evidence base needed to measure and defend SLA compliance.
SSL Certificate
SSL Monitor →A digital certificate that authenticates a website's identity and enables encrypted HTTPS connections using the TLS protocol. SSL certificates expire on a fixed date — typically after 90 days for Let's Encrypt certificates or up to one year for commercial certificates. An expired certificate causes browsers to display security warnings and block user access, making SSL expiry monitoring a critical reliability practice.
SSL Certificate Monitoring
SSL Monitor →The automated practice of continuously checking SSL certificates to report how many days remain before expiry and to alert teams before certificates lapse. SSL monitoring is distinct from uptime monitoring — a service can be fully running but become inaccessible the moment its certificate expires without renewal. MonitorGiant's SSL Certificate Expiry Monitor is available on paid plans with configurable alert thresholds.
Synthetic Monitoring
Monitor Types →A monitoring approach that simulates user interactions with scheduled, scripted test requests from external locations rather than waiting for real user traffic to reveal issues. Synthetic monitors run continuously, providing coverage even during low-traffic periods and detecting problems before any real user is affected. MonitorGiant's HTTP, Keyword, Port, and Journey monitors are all forms of synthetic monitoring.
TCP Port
Port Monitor →A numbered endpoint on a server that identifies a specific service or application process. Common ports include 443 (HTTPS), 5432 (PostgreSQL), 6379 (Redis), and 587 (SMTP). Port monitoring verifies that a service is actively listening on its assigned port, providing service-level health confirmation that goes beyond what an HTTP uptime check can reveal.
Token Burn
Token Burn Monitor →The rate at which an AI application consumes tokens from a provider such as OpenAI, Anthropic, or Google Gemini — measured as daily spend or usage volume relative to an established baseline. Abnormal token burn — a sudden spike relative to the 15-day rolling average — signals a potential incident such as a leaked API key, a runaway loop, or unexpected traffic. MonitorGiant's AI Token Burn Monitor tracks this continuously.
Uptime
SLA Calculator →The proportion of time that a service, application, or system is fully operational and accessible to users, typically expressed as a percentage calculated over a rolling period such as 30 days or 90 days. A service with 99.9% uptime may be down for a maximum of 43.8 minutes per month. Any time beyond that window causes the service to fall below its SLA target.
Uptime Monitoring
Monitor Types →The automated practice of checking whether a website, API, or service is reachable and responding correctly, typically by sending periodic HTTP requests from external locations at configured intervals. Uptime monitoring provides continuous visibility into service availability and generates instant alerts when checks fail. MonitorGiant combines uptime monitoring with keyword validation, multi-region checks, and instant alerts on its free plan.
Webhook
Notification Channels →An HTTP callback that sends a POST request to a configured URL when a specific event occurs — such as a monitoring alert firing, an incident opening, or an AI circuit breaker triggering. Webhooks allow monitoring tools to integrate directly with incident management systems, Slack channels, and custom automation workflows without the need for polling. MonitorGiant supports webhook delivery for alert and circuit breaker events.
Put the theory into practice.
MonitorGiant monitors your websites, APIs, AI token spend, cloud costs, and self-hosted platforms — free to start, no credit card required.