AI Agent Monitoring: How to Know If Your Agent Is Actually Working
You deployed an AI agent. It's running 24/7. It sends you occasional updates. Everything seems fine.
But here's the uncomfortable question: how do you actually know it's doing its job?
Most people who run AI agents in production have no monitoring at all. They set it up, it runs, and they assume it works until something visibly breaks. That's like hiring an employee and never checking their output.
After running 4 AI agents in production for over a year, we've built a monitoring system that catches problems before they become expensive. Here's the exact framework.
What's Inside
The Silent Failure Problem
AI agents fail differently than traditional software. A web server crashes and you get a 500 error. An AI agent fails silently — it keeps running but produces garbage output, skips tasks, or loops on the same action forever.
Here are the failure modes we've seen in production:
- Hallucination drift — the agent starts making up data instead of fetching it
- Task abandonment — it silently drops tasks when it hits errors
- Infinite loops — retrying the same failed action forever, burning tokens
- Quality degradation — output slowly gets worse but never completely breaks
- Context overflow — the agent loses track of what it's doing mid-task
- API dependency failures — external APIs change and the agent adapts poorly
⚠️ The most dangerous failure: Your agent appears to work fine but is actually producing subtly wrong outputs. We had an agent send "personalized" emails with the wrong company names for 3 days before we caught it.
None of these trigger a crash. None of these send an error email. Your agent keeps running, keeps burning tokens, and keeps producing bad output — unless you're monitoring.
7 Metrics That Actually Matter
Forget vanity metrics. Here are the 7 numbers that tell you if your agent is healthy:
Task Completion Rate
What percentage of assigned tasks does your agent actually complete? Not start — complete.
Target: >95% for simple tasks, >85% for complex tasks
Red flag: Drops below 80% or suddenly changes by >10%
How to measure: Log task-start and task-complete events. Calculate the ratio daily.
Token Spend per Task
How many tokens (and dollars) does each task cost? This catches infinite loops and over-thinking before they drain your budget.
Target: Establish a baseline, then alert on 2x deviation
Red flag: Sudden spike = infinite loop. Gradual increase = prompt bloat.
How to measure: Track API usage per task. Most providers report tokens in response headers.
Error Rate
How often does your agent hit errors — tool failures, API timeouts, malformed responses?
Target: <5% of all actions
Red flag: >10% or clustered errors (3+ in a row)
How to measure: Count error events vs. total actions per hour.
Response Latency
How long does each task take from start to finish? Slow agents waste your time even if they eventually succeed.
Target: Set per-task-type baselines (research: 5 min, email: 30 sec, etc.)
Red flag: 3x baseline = something's wrong
How to measure: Timestamp task start and end. Track p50, p95, p99.
Retry Count
How many times does the agent retry before succeeding? High retries = fragile setup.
Target: Average <2 retries per task
Red flag: >5 retries on any single task
How to measure: Log retry events with task IDs. Aggregate daily.
Uptime / Heartbeat
Is your agent actually running? Sounds obvious, but agents crash silently more than you'd think.
Target: 99.5%+ uptime
Red flag: Missed heartbeat = agent is down or stuck
How to measure: Send a heartbeat every 5-15 minutes. Alert after 2 missed beats.
Output Quality Score
The hardest one — but the most important. Is the output actually good? We'll cover how to measure this in detail below.
Target: >8/10 average quality score
Red flag: Trending downward over a week
How to measure: Spot-check sampling + automated checks (see section 6).
Building Your Monitoring Dashboard
You don't need Datadog or Grafana for this. A simple dashboard that shows your 7 metrics at a glance is all you need to start.
Here's what ours looks like:
AGENT HEALTH DASHBOARD — Last 24h
All systems nominal ✓
Implementation options (simplest to most complex):
- Google Sheets + Apps Script — Agent logs to a sheet, formulas calculate metrics. Free, simple, works.
- Notion database — Agent creates entries via API. Built-in charts. Good for non-technical users.
- Custom HTML dashboard — Agent writes a JSON file, a static page reads it. No backend needed.
- PostHog / Mixpanel — Fire events from your agent. Free tier is usually enough. Real analytics.
💡 Start with option 1. Seriously. A Google Sheet with 7 columns, updated hourly, beats a sophisticated monitoring stack that you never build. You can always upgrade later.
⚡ Quick Shortcut
Skip months of trial and error
The AI Employee Playbook gives you production-ready templates, prompts, and workflows — everything in this guide and more, ready to deploy.
Get the Playbook — €29Alert Rules That Don't Cry Wolf
Bad alert rules are worse than no alerts. If you get 20 notifications a day, you'll ignore them all — including the ones that matter.
Here's our alert hierarchy:
🔴 Critical (immediate notification)
- Agent heartbeat missed for >30 minutes
- Token spend >5x daily average
- Task completion rate drops below 50%
- Agent sends external communication with error
🟡 Warning (daily digest)
- Error rate >10% for 2+ hours
- Any single task costs >$10
- Retry count >5 on 3+ tasks
- Quality score drops below 7/10
🟢 Info (weekly review)
- Gradual latency increase (>20% week-over-week)
- Token spend trending up
- New error types appearing
The rule of thumb: You should get a critical alert at most once a week. If you're getting one daily, either fix the root cause or downgrade it to a warning.
Automated Health Checks
Don't just monitor passively. Run active health checks that verify your agent is actually capable of doing work.
The Heartbeat Pattern
Every 15 minutes, send your agent a trivial task. If it responds correctly, it's alive and functional. If it doesn't, something's wrong.
# Heartbeat check (runs every 15 min via cron)
# 1. Send agent a test message: "heartbeat check"
# 2. Agent should respond: "HEARTBEAT_OK"
# 3. If no response in 60 seconds → alert
# Implementation:
# - Set up a cron job that pings the agent
# - Log response time and status
# - Alert on 2 consecutive failures
The Canary Task
Once per day, give your agent a task with a known correct answer. Compare the output. This catches quality degradation that metrics alone miss.
Examples:
- "Summarize this article" (with a pre-written ideal summary to compare against)
- "Look up the weather in Amsterdam" (verify against actual weather API)
- "Calculate the TCO for this truck" (with known correct result)
The Dependency Check
Test every external dependency your agent uses:
- Can it reach all APIs it needs?
- Are credentials still valid?
- Is the database accessible?
- Are file paths still correct?
Run these checks at startup and every 6 hours. Log results.
Output Quality Scoring
This is where most people give up. "How do I measure quality? It's subjective!" Not entirely. Here are three approaches, from simple to sophisticated:
Level 1: Spot-Check Sampling (5 min/day)
Every day, randomly pick 3 outputs and rate them 1-10. Track the scores over time. Takes 5 minutes and catches most quality problems within a few days.
Level 2: Rule-Based Checks (automated)
Write simple rules that check for common quality issues:
- Is the output longer than 50 characters? (catches empty/stub responses)
- Does it contain required fields? (catches incomplete outputs)
- Is it different from the last output? (catches copy-paste loops)
- Does it contain forbidden phrases? (catches hallucination markers like "I don't have access to...")
- Is the format correct? (JSON valid? Email has subject?)
Level 3: LLM-as-Judge (semi-automated)
Use a cheaper model to evaluate your agent's output. Send the task description + output to GPT-4o-mini and ask for a quality score.
Prompt for the judge:
"Rate this agent output 1-10 on:
- Accuracy (facts correct?)
- Completeness (all parts addressed?)
- Usefulness (would a human find this valuable?)
- Format (well-structured?)
Task: {original_task}
Output: {agent_output}
Score (1-10) and one-line explanation:"
This costs pennies per evaluation and correlates well with human judgment. Run it on 10% of outputs.
Our Real Monitoring Setup
Here's exactly what we run for our 4 production agents:
- Heartbeat — Every agent has a cron heartbeat check every 15 minutes. Missed heartbeats alert on Telegram immediately.
- Status files — Each agent writes a JSON status file after every task: timestamp, task name, result, duration, token count, errors.
- Daily digest — A scheduled job aggregates all status files and sends a morning report: tasks completed, errors, total spend, quality scores.
- Cost tracking — Token usage tracked per-model, per-agent, per-day. We know exactly what each agent costs and can spot anomalies.
- Output sampling — Every day we spot-check 3-5 outputs from each agent. Takes 15 minutes total. Log scores in a spreadsheet.
- Canary tasks — Weekly canary tests with known answers to verify core capabilities haven't degraded.
Total setup time: About 4 hours. Daily maintenance: 15-20 minutes. Problems caught early: Dozens.
💡 The monitoring pays for itself the first time it catches a runaway agent burning $50/hour in tokens. That happened to us in month 2. The alert fired within 20 minutes instead of letting it run all night.
The Quick-Start Checklist
Don't try to implement everything at once. Here's the order:
- Week 1: Add heartbeat checks (15 min to implement)
- Week 1: Start logging task completion (add timestamps to your agent logs)
- Week 2: Track token spend per task (most APIs report this)
- Week 2: Set up critical alerts (heartbeat + cost spike)
- Week 3: Start daily spot-checks (5 min/day)
- Week 3: Build your dashboard (Google Sheets is fine)
- Week 4: Add rule-based quality checks
- Week 4: Implement canary tasks
After 4 weeks, you'll have better monitoring than 99% of people running AI agents. And you'll sleep better knowing your agents are actually doing what they're supposed to.
Want the Complete Agent Setup?
The AI Employee Playbook includes monitoring templates, health check scripts, and the exact alert configuration we use in production.
Get the Playbook — €29 →