Operations · 12 min read · Feb 17, 2026

AI Agent Monitoring: How to Know If Your Agent Is Actually Working

You deployed an AI agent. It's running 24/7. It sends you occasional updates. Everything seems fine.

But here's the uncomfortable question: how do you actually know it's doing its job?

Most people who run AI agents in production have no monitoring at all. They set it up, it runs, and they assume it works until something visibly breaks. That's like hiring an employee and never checking their output.

After running 4 AI agents in production for over a year, we've built a monitoring system that catches problems before they become expensive. Here's the exact framework.

The Silent Failure Problem

AI agents fail differently than traditional software. A web server crashes and you get a 500 error. An AI agent fails silently — it keeps running but produces garbage output, skips tasks, or loops on the same action forever.

Here are the failure modes we've seen in production:

Hallucination drift — the agent starts making up data instead of fetching it
Task abandonment — it silently drops tasks when it hits errors
Infinite loops — retrying the same failed action forever, burning tokens
Quality degradation — output slowly gets worse but never completely breaks
Context overflow — the agent loses track of what it's doing mid-task
API dependency failures — external APIs change and the agent adapts poorly

⚠️ The most dangerous failure: Your agent appears to work fine but is actually producing subtly wrong outputs. We had an agent send "personalized" emails with the wrong company names for 3 days before we caught it.

None of these trigger a crash. None of these send an error email. Your agent keeps running, keeps burning tokens, and keeps producing bad output — unless you're monitoring.

7 Metrics That Actually Matter

Forget vanity metrics. Here are the 7 numbers that tell you if your agent is healthy:

Metric #1

Task Completion Rate

What percentage of assigned tasks does your agent actually complete? Not start — complete.

Target: >95% for simple tasks, >85% for complex tasks
Red flag: Drops below 80% or suddenly changes by >10%
How to measure: Log task-start and task-complete events. Calculate the ratio daily.

Metric #2

Token Spend per Task

How many tokens (and dollars) does each task cost? This catches infinite loops and over-thinking before they drain your budget.

Target: Establish a baseline, then alert on 2x deviation
Red flag: Sudden spike = infinite loop. Gradual increase = prompt bloat.
How to measure: Track API usage per task. Most providers report tokens in response headers.

Metric #3

Error Rate

How often does your agent hit errors — tool failures, API timeouts, malformed responses?

Target: <5% of all actions
Red flag: >10% or clustered errors (3+ in a row)
How to measure: Count error events vs. total actions per hour.

Metric #4

Response Latency

How long does each task take from start to finish? Slow agents waste your time even if they eventually succeed.

Target: Set per-task-type baselines (research: 5 min, email: 30 sec, etc.)
Red flag: 3x baseline = something's wrong
How to measure: Timestamp task start and end. Track p50, p95, p99.

Metric #5

Retry Count

How many times does the agent retry before succeeding? High retries = fragile setup.

Target: Average <2 retries per task
Red flag: >5 retries on any single task
How to measure: Log retry events with task IDs. Aggregate daily.

Metric #6

Uptime / Heartbeat

Is your agent actually running? Sounds obvious, but agents crash silently more than you'd think.

Target: 99.5%+ uptime
Red flag: Missed heartbeat = agent is down or stuck
How to measure: Send a heartbeat every 5-15 minutes. Alert after 2 missed beats.

Metric #7

Output Quality Score

The hardest one — but the most important. Is the output actually good? We'll cover how to measure this in detail below.

Target: >8/10 average quality score
Red flag: Trending downward over a week
How to measure: Spot-check sampling + automated checks (see section 6).

Building Your Monitoring Dashboard

You don't need Datadog or Grafana for this. A simple dashboard that shows your 7 metrics at a glance is all you need to start.

Here's what ours looks like:

AGENT HEALTH DASHBOARD — Last 24h

97%

Task Completion

$2.40

Avg Cost/Task

3.1%

Error Rate

4.2m

Avg Latency

1.3

Avg Retries

99.8%

Uptime

8.4

Quality Score

142

Tasks Today

All systems nominal ✓

Implementation options (simplest to most complex):

Google Sheets + Apps Script — Agent logs to a sheet, formulas calculate metrics. Free, simple, works.
Notion database — Agent creates entries via API. Built-in charts. Good for non-technical users.
Custom HTML dashboard — Agent writes a JSON file, a static page reads it. No backend needed.
PostHog / Mixpanel — Fire events from your agent. Free tier is usually enough. Real analytics.

💡 Start with option 1. Seriously. A Google Sheet with 7 columns, updated hourly, beats a sophisticated monitoring stack that you never build. You can always upgrade later.

⚡ Quick Shortcut

Skip months of trial and error

The AI Employee Playbook gives you production-ready templates, prompts, and workflows — everything in this guide and more, ready to deploy.

Get the Playbook — €29

Alert Rules That Don't Cry Wolf

Bad alert rules are worse than no alerts. If you get 20 notifications a day, you'll ignore them all — including the ones that matter.

Here's our alert hierarchy:

🔴 Critical (immediate notification)

Agent heartbeat missed for >30 minutes
Token spend >5x daily average
Task completion rate drops below 50%
Agent sends external communication with error

🟡 Warning (daily digest)

Error rate >10% for 2+ hours
Any single task costs >$10
Retry count >5 on 3+ tasks
Quality score drops below 7/10

🟢 Info (weekly review)

Gradual latency increase (>20% week-over-week)
Token spend trending up
New error types appearing

The rule of thumb: You should get a critical alert at most once a week. If you're getting one daily, either fix the root cause or downgrade it to a warning.

Automated Health Checks

Don't just monitor passively. Run active health checks that verify your agent is actually capable of doing work.

The Heartbeat Pattern

Every 15 minutes, send your agent a trivial task. If it responds correctly, it's alive and functional. If it doesn't, something's wrong.

# Heartbeat check (runs every 15 min via cron)
# 1. Send agent a test message: "heartbeat check"
# 2. Agent should respond: "HEARTBEAT_OK"
# 3. If no response in 60 seconds → alert

# Implementation:
# - Set up a cron job that pings the agent
# - Log response time and status
# - Alert on 2 consecutive failures

The Canary Task

Once per day, give your agent a task with a known correct answer. Compare the output. This catches quality degradation that metrics alone miss.

Examples:

"Summarize this article" (with a pre-written ideal summary to compare against)
"Look up the weather in Amsterdam" (verify against actual weather API)
"Calculate the TCO for this truck" (with known correct result)

The Dependency Check

Test every external dependency your agent uses:

Can it reach all APIs it needs?
Are credentials still valid?
Is the database accessible?
Are file paths still correct?

Run these checks at startup and every 6 hours. Log results.

Output Quality Scoring

This is where most people give up. "How do I measure quality? It's subjective!" Not entirely. Here are three approaches, from simple to sophisticated:

Level 1: Spot-Check Sampling (5 min/day)

Every day, randomly pick 3 outputs and rate them 1-10. Track the scores over time. Takes 5 minutes and catches most quality problems within a few days.

Level 2: Rule-Based Checks (automated)

Write simple rules that check for common quality issues:

Is the output longer than 50 characters? (catches empty/stub responses)
Does it contain required fields? (catches incomplete outputs)
Is it different from the last output? (catches copy-paste loops)
Does it contain forbidden phrases? (catches hallucination markers like "I don't have access to...")
Is the format correct? (JSON valid? Email has subject?)

Level 3: LLM-as-Judge (semi-automated)

Use a cheaper model to evaluate your agent's output. Send the task description + output to GPT-4o-mini and ask for a quality score.

Prompt for the judge:
"Rate this agent output 1-10 on:
- Accuracy (facts correct?)
- Completeness (all parts addressed?)
- Usefulness (would a human find this valuable?)
- Format (well-structured?)

Task: {original_task}
Output: {agent_output}

Score (1-10) and one-line explanation:"

This costs pennies per evaluation and correlates well with human judgment. Run it on 10% of outputs.

Our Real Monitoring Setup

Here's exactly what we run for our 4 production agents:

Heartbeat — Every agent has a cron heartbeat check every 15 minutes. Missed heartbeats alert on Telegram immediately.
Status files — Each agent writes a JSON status file after every task: timestamp, task name, result, duration, token count, errors.
Daily digest — A scheduled job aggregates all status files and sends a morning report: tasks completed, errors, total spend, quality scores.
Cost tracking — Token usage tracked per-model, per-agent, per-day. We know exactly what each agent costs and can spot anomalies.
Output sampling — Every day we spot-check 3-5 outputs from each agent. Takes 15 minutes total. Log scores in a spreadsheet.
Canary tasks — Weekly canary tests with known answers to verify core capabilities haven't degraded.

Total setup time: About 4 hours. Daily maintenance: 15-20 minutes. Problems caught early: Dozens.

💡 The monitoring pays for itself the first time it catches a runaway agent burning $50/hour in tokens. That happened to us in month 2. The alert fired within 20 minutes instead of letting it run all night.

The Quick-Start Checklist

Don't try to implement everything at once. Here's the order:

Week 1: Add heartbeat checks (15 min to implement)
Week 1: Start logging task completion (add timestamps to your agent logs)
Week 2: Track token spend per task (most APIs report this)
Week 2: Set up critical alerts (heartbeat + cost spike)
Week 3: Start daily spot-checks (5 min/day)
Week 3: Build your dashboard (Google Sheets is fine)
Week 4: Add rule-based quality checks
Week 4: Implement canary tasks

After 4 weeks, you'll have better monitoring than 99% of people running AI agents. And you'll sleep better knowing your agents are actually doing what they're supposed to.

Want the Complete Agent Setup?

The AI Employee Playbook includes monitoring templates, health check scripts, and the exact alert configuration we use in production.

Get the Playbook — €29 →

AI Agent Monitoring: How to Know If Your Agent Is Actually Working

What's Inside

The Silent Failure Problem

7 Metrics That Actually Matter

Task Completion Rate

Token Spend per Task

Error Rate

Response Latency

Retry Count

Uptime / Heartbeat

Output Quality Score

Building Your Monitoring Dashboard

Skip months of trial and error

Alert Rules That Don't Cry Wolf

🔴 Critical (immediate notification)

🟡 Warning (daily digest)

🟢 Info (weekly review)

Automated Health Checks

The Heartbeat Pattern

The Canary Task

The Dependency Check

Output Quality Scoring

Level 1: Spot-Check Sampling (5 min/day)

Level 2: Rule-Based Checks (automated)

Level 3: LLM-as-Judge (semi-automated)

Our Real Monitoring Setup

The Quick-Start Checklist

Want the Complete Agent Setup?

Related Posts

How to Run an AI Agent 24/7

AI Agent ROI Calculator

AI Agent Workflows

AI Agent Security Guide

📡 The Operator Signal