AI Agent Monitoring: How to Know If Your Agent Is Actually Working

You deployed an AI agent. It's running 24/7. It sends you occasional updates. Everything seems fine.

But here's the uncomfortable question: how do you actually know it's doing its job?

Most people who run AI agents in production have no monitoring at all. They set it up, it runs, and they assume it works until something visibly breaks. That's like hiring an employee and never checking their output.

After running 4 AI agents in production for over a year, we've built a monitoring system that catches problems before they become expensive. Here's the exact framework.

The Silent Failure Problem

AI agents fail differently than traditional software. A web server crashes and you get a 500 error. An AI agent fails silently — it keeps running but produces garbage output, skips tasks, or loops on the same action forever.

Here are the failure modes we've seen in production:

⚠️ The most dangerous failure: Your agent appears to work fine but is actually producing subtly wrong outputs. We had an agent send "personalized" emails with the wrong company names for 3 days before we caught it.

None of these trigger a crash. None of these send an error email. Your agent keeps running, keeps burning tokens, and keeps producing bad output — unless you're monitoring.

7 Metrics That Actually Matter

Forget vanity metrics. Here are the 7 numbers that tell you if your agent is healthy:

Metric #1

Task Completion Rate

What percentage of assigned tasks does your agent actually complete? Not start — complete.

Target: >95% for simple tasks, >85% for complex tasks
Red flag: Drops below 80% or suddenly changes by >10%
How to measure: Log task-start and task-complete events. Calculate the ratio daily.

Metric #2

Token Spend per Task

How many tokens (and dollars) does each task cost? This catches infinite loops and over-thinking before they drain your budget.

Target: Establish a baseline, then alert on 2x deviation
Red flag: Sudden spike = infinite loop. Gradual increase = prompt bloat.
How to measure: Track API usage per task. Most providers report tokens in response headers.

Metric #3

Error Rate

How often does your agent hit errors — tool failures, API timeouts, malformed responses?

Target: <5% of all actions
Red flag: >10% or clustered errors (3+ in a row)
How to measure: Count error events vs. total actions per hour.

Metric #4

Response Latency

How long does each task take from start to finish? Slow agents waste your time even if they eventually succeed.

Target: Set per-task-type baselines (research: 5 min, email: 30 sec, etc.)
Red flag: 3x baseline = something's wrong
How to measure: Timestamp task start and end. Track p50, p95, p99.

Metric #5

Retry Count

How many times does the agent retry before succeeding? High retries = fragile setup.

Target: Average <2 retries per task
Red flag: >5 retries on any single task
How to measure: Log retry events with task IDs. Aggregate daily.

Metric #6

Uptime / Heartbeat

Is your agent actually running? Sounds obvious, but agents crash silently more than you'd think.

Target: 99.5%+ uptime
Red flag: Missed heartbeat = agent is down or stuck
How to measure: Send a heartbeat every 5-15 minutes. Alert after 2 missed beats.

Metric #7

Output Quality Score

The hardest one — but the most important. Is the output actually good? We'll cover how to measure this in detail below.

Target: >8/10 average quality score
Red flag: Trending downward over a week
How to measure: Spot-check sampling + automated checks (see section 6).

Building Your Monitoring Dashboard

You don't need Datadog or Grafana for this. A simple dashboard that shows your 7 metrics at a glance is all you need to start.

Here's what ours looks like:

AGENT HEALTH DASHBOARD — Last 24h

97%
Task Completion
$2.40
Avg Cost/Task
3.1%
Error Rate
4.2m
Avg Latency
1.3
Avg Retries
99.8%
Uptime
8.4
Quality Score
142
Tasks Today

All systems nominal ✓

Implementation options (simplest to most complex):

  1. Google Sheets + Apps Script — Agent logs to a sheet, formulas calculate metrics. Free, simple, works.
  2. Notion database — Agent creates entries via API. Built-in charts. Good for non-technical users.
  3. Custom HTML dashboard — Agent writes a JSON file, a static page reads it. No backend needed.
  4. PostHog / Mixpanel — Fire events from your agent. Free tier is usually enough. Real analytics.

💡 Start with option 1. Seriously. A Google Sheet with 7 columns, updated hourly, beats a sophisticated monitoring stack that you never build. You can always upgrade later.

⚡ Quick Shortcut

Skip months of trial and error

The AI Employee Playbook gives you production-ready templates, prompts, and workflows — everything in this guide and more, ready to deploy.

Get the Playbook — €29

Alert Rules That Don't Cry Wolf

Bad alert rules are worse than no alerts. If you get 20 notifications a day, you'll ignore them all — including the ones that matter.

Here's our alert hierarchy:

🔴 Critical (immediate notification)

🟡 Warning (daily digest)

🟢 Info (weekly review)

The rule of thumb: You should get a critical alert at most once a week. If you're getting one daily, either fix the root cause or downgrade it to a warning.

Automated Health Checks

Don't just monitor passively. Run active health checks that verify your agent is actually capable of doing work.

The Heartbeat Pattern

Every 15 minutes, send your agent a trivial task. If it responds correctly, it's alive and functional. If it doesn't, something's wrong.

# Heartbeat check (runs every 15 min via cron)
# 1. Send agent a test message: "heartbeat check"
# 2. Agent should respond: "HEARTBEAT_OK"
# 3. If no response in 60 seconds → alert

# Implementation:
# - Set up a cron job that pings the agent
# - Log response time and status
# - Alert on 2 consecutive failures

The Canary Task

Once per day, give your agent a task with a known correct answer. Compare the output. This catches quality degradation that metrics alone miss.

Examples:

The Dependency Check

Test every external dependency your agent uses:

Run these checks at startup and every 6 hours. Log results.

Output Quality Scoring

This is where most people give up. "How do I measure quality? It's subjective!" Not entirely. Here are three approaches, from simple to sophisticated:

Level 1: Spot-Check Sampling (5 min/day)

Every day, randomly pick 3 outputs and rate them 1-10. Track the scores over time. Takes 5 minutes and catches most quality problems within a few days.

Level 2: Rule-Based Checks (automated)

Write simple rules that check for common quality issues:

Level 3: LLM-as-Judge (semi-automated)

Use a cheaper model to evaluate your agent's output. Send the task description + output to GPT-4o-mini and ask for a quality score.

Prompt for the judge:
"Rate this agent output 1-10 on:
- Accuracy (facts correct?)
- Completeness (all parts addressed?)
- Usefulness (would a human find this valuable?)
- Format (well-structured?)

Task: {original_task}
Output: {agent_output}

Score (1-10) and one-line explanation:"

This costs pennies per evaluation and correlates well with human judgment. Run it on 10% of outputs.

Our Real Monitoring Setup

Here's exactly what we run for our 4 production agents:

  1. Heartbeat — Every agent has a cron heartbeat check every 15 minutes. Missed heartbeats alert on Telegram immediately.
  2. Status files — Each agent writes a JSON status file after every task: timestamp, task name, result, duration, token count, errors.
  3. Daily digest — A scheduled job aggregates all status files and sends a morning report: tasks completed, errors, total spend, quality scores.
  4. Cost tracking — Token usage tracked per-model, per-agent, per-day. We know exactly what each agent costs and can spot anomalies.
  5. Output sampling — Every day we spot-check 3-5 outputs from each agent. Takes 15 minutes total. Log scores in a spreadsheet.
  6. Canary tasks — Weekly canary tests with known answers to verify core capabilities haven't degraded.

Total setup time: About 4 hours. Daily maintenance: 15-20 minutes. Problems caught early: Dozens.

💡 The monitoring pays for itself the first time it catches a runaway agent burning $50/hour in tokens. That happened to us in month 2. The alert fired within 20 minutes instead of letting it run all night.

The Quick-Start Checklist

Don't try to implement everything at once. Here's the order:

After 4 weeks, you'll have better monitoring than 99% of people running AI agents. And you'll sleep better knowing your agents are actually doing what they're supposed to.

Want the Complete Agent Setup?

The AI Employee Playbook includes monitoring templates, health check scripts, and the exact alert configuration we use in production.

Get the Playbook — €29 →

📡 The Operator Signal

Weekly field notes on building AI agents that actually work. No hype, no spam.

🚀 Build your first AI agent in a weekend Get the Playbook — €29