How to Build an AI Agent with Google Gemini: Complete Guide

Google Gemini is the dark horse of the AI agent world. While Claude and GPT-4 get the headlines, Gemini quietly ships features that matter for agents: native grounding with Google Search, a 2 million token context window, and the most generous free tier in the industry.

This guide shows you how to build a production agent with Gemini. We'll cover the Gemini API, function calling, grounding, multimodal input, and the patterns that work in practice. This is the third part of our "Build an Agent" trilogy — read the Claude guide and OpenAI guide for comparison.

2M
Token context window
Free
Generous free tier
Native
Google Search grounding
Multi
Text + image + video + audio

Why Gemini for AI Agents?

Gemini has unique advantages that Claude and GPT-4 can't match:

Honest caveat: Gemini's function calling is less reliable than Claude's tool use or GPT-4's function calling. It works, but you'll need more robust error handling. The tradeoff is worth it for the context window and grounding features.

Getting Started: API Setup

Option 1: Google AI Studio (simpler)

npm install @google/generative-ai
import { GoogleGenerativeAI } from '@google/generative-ai';
import fs from 'fs';

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);

// Load the 3-file framework
const soul = fs.readFileSync('./SOUL.md', 'utf8');
const agents = fs.readFileSync('./AGENTS.md', 'utf8');
const user = fs.readFileSync('./USER.md', 'utf8');
const systemPrompt = `${soul}\n\n${agents}\n\n${user}`;

const model = genAI.getGenerativeModel({
  model: 'gemini-2.0-flash',
  systemInstruction: systemPrompt,
});

Option 2: Vertex AI (enterprise)

import { VertexAI } from '@google-cloud/vertexai';

const vertex = new VertexAI({
  project: process.env.GCP_PROJECT_ID,
  location: 'us-central1',
});

const model = vertex.getGenerativeModel({
  model: 'gemini-2.0-flash',
  systemInstruction: systemPrompt,
});

The Agent Loop

Gemini's chat interface handles conversation history for you. The agent loop pattern:

async function agentLoop(userMessage) {
  const chat = model.startChat({
    tools: [{ functionDeclarations: getToolDeclarations() }],
  });

  let response = await chat.sendMessage(userMessage);

  while (true) {
    const parts = response.response.candidates[0].content.parts;

    // Check for function calls
    const functionCalls = parts.filter(p => p.functionCall);

    if (functionCalls.length === 0) {
      // No function calls — return the text response
      const textPart = parts.find(p => p.text);
      return textPart?.text || '';
    }

    // Execute each function call
    const functionResponses = [];
    for (const part of functionCalls) {
      const { name, args } = part.functionCall;
      const result = await executeTool(name, args);
      functionResponses.push({
        functionResponse: {
          name,
          response: result,
        },
      });
    }

    // Send results back to Gemini
    response = await chat.sendMessage(functionResponses);
  }
}

Define Tools

function getToolDeclarations() {
  return [
    {
      name: 'web_search',
      description: 'Search the web for current information, news, and data.',
      parameters: {
        type: 'object',
        properties: {
          query: {
            type: 'string',
            description: 'Search query — be specific for better results',
          },
        },
        required: ['query'],
      },
    },
    {
      name: 'read_file',
      description: 'Read a file from the workspace.',
      parameters: {
        type: 'object',
        properties: {
          path: {
            type: 'string',
            description: 'File path relative to workspace root',
          },
        },
        required: ['path'],
      },
    },
    {
      name: 'write_file',
      description: 'Write content to a file. Creates it if it does not exist.',
      parameters: {
        type: 'object',
        properties: {
          path: { type: 'string', description: 'File path' },
          content: { type: 'string', description: 'Content to write' },
        },
        required: ['path', 'content'],
      },
    },
    {
      name: 'analyze_data',
      description: 'Analyze structured data (CSV, JSON) and return insights.',
      parameters: {
        type: 'object',
        properties: {
          data_path: { type: 'string', description: 'Path to the data file' },
          question: { type: 'string', description: 'What to analyze' },
        },
        required: ['data_path', 'question'],
      },
    },
  ];
}

Gemini's Killer Feature: Grounding

Grounding lets Gemini search Google before answering. No separate search API needed, no RAG pipeline, no vector database. Just enable it:

const modelWithGrounding = genAI.getGenerativeModel({
  model: 'gemini-2.0-flash',
  systemInstruction: systemPrompt,
  tools: [{
    googleSearchRetrieval: {
      dynamicRetrievalConfig: {
        mode: 'MODE_DYNAMIC',
        dynamicThreshold: 0.3, // Lower = more searching
      },
    },
  }],
});

// Gemini automatically searches when it needs current information
const result = await modelWithGrounding.generateContent(
  'What are the latest developments in AI agent frameworks this week?'
);

// Response includes grounding metadata
const groundingMeta = result.response.candidates[0].groundingMetadata;
if (groundingMeta?.searchEntryPoint) {
  console.log('Sources:', groundingMeta.webSearchQueries);
}
Why this matters for agents: Most agents need web search. With Claude or GPT, you build a separate search integration (Brave, Serper, Tavily). With Gemini, it's built-in. One less integration to maintain, and Google's search quality is hard to beat.

Multimodal Agents: Process Images, Video, Audio

Gemini is natively multimodal. Your agent can process images, videos, and audio without separate APIs:

// Analyze an image
const imageData = fs.readFileSync('./screenshot.png');
const result = await model.generateContent([
  'Analyze this dashboard screenshot. What metrics are trending down?',
  {
    inlineData: {
      mimeType: 'image/png',
      data: imageData.toString('base64'),
    },
  },
]);

// Analyze a video (upload to File API first)
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY);
const uploadResult = await fileManager.uploadFile('./meeting.mp4', {
  mimeType: 'video/mp4',
});

const videoResult = await model.generateContent([
  'Summarize the key decisions from this meeting recording. List action items.',
  {
    fileData: {
      mimeType: uploadResult.file.mimeType,
      fileUri: uploadResult.file.uri,
    },
  },
]);

// Process audio
const audioResult = await model.generateContent([
  'Transcribe this voice memo and extract any task items mentioned.',
  {
    inlineData: {
      mimeType: 'audio/mp3',
      data: fs.readFileSync('./memo.mp3').toString('base64'),
    },
  },
]);

Use cases for multimodal agents:

The 2M Context Window: What It Enables

A 2 million token context window changes what's possible for agents:

Reality check: Just because you can stuff 2M tokens doesn't mean you should. Larger contexts mean higher costs, slower responses, and potentially less focused outputs. Use the window strategically — load what's relevant, not everything.

Making It Autonomous

import cron from 'node-cron';

const memory = new AgentMemory();

cron.schedule('0 * * * *', async () => {
  const recentContext = memory.getContext(2);
  const tasks = fs.readFileSync('./tasks.md', 'utf8');

  const model = genAI.getGenerativeModel({
    model: 'gemini-2.0-flash',
    systemInstruction: `${systemPrompt}\n\n## Recent Memory\n${recentContext}`,
    tools: [
      { functionDeclarations: getToolDeclarations() },
      { googleSearchRetrieval: { dynamicRetrievalConfig: { mode: 'MODE_DYNAMIC' } } },
    ],
  });

  try {
    const chat = model.startChat();
    const response = await chat.sendMessage(`
      Autonomous run. Time: ${new Date().toISOString()}
      Pending tasks:\n${tasks}
      Pick the highest priority task and execute it.
    `);

    const result = response.response.text();
    memory.log(`Auto-run: ${result.substring(0, 200)}`);
  } catch (error) {
    memory.log(`ERROR: ${error.message}`);
  }
});

Gemini vs Claude vs GPT-4: The Full Picture

Feature Gemini 2.0 Claude GPT-4o
Context window 2M tokens 200K tokens 128K tokens
Built-in search Google Search grounding No (need separate tool) Bing (ChatGPT only)
Multimodal Text+image+video+audio Text+image Text+image+audio
Tool calling reliability Good Excellent Excellent
System prompt adherence Good Excellent Good
Free tier Very generous None (API) Limited
Cost (per 1M input) $0.075 (Flash) $3-15 $2.50
Best for Large context, multimodal, budget Complex reasoning, personality General purpose, structured output

When to Choose Gemini

Gemini is the right choice when:

Gemini is not the best choice when:

Cost Breakdown

Model Input (per 1M) Output (per 1M) Best use
Gemini 2.0 Flash $0.075 $0.30 Most agent tasks — incredible value
Gemini 1.5 Pro $1.25 $5.00 Complex reasoning, long context
Gemini 2.0 Flash (free) Free Free Development, prototyping, low-volume

Monthly cost example: An agent running hourly with Gemini Flash, processing ~50K tokens per run: approximately $2.70/month. That's not a typo. Compare to GPT-4o at ~$90/month for the same workload.

Gemini-Specific Gotchas

  1. Function call arguments can be weird. Gemini occasionally wraps arguments in unexpected structures or omits required fields. Validate aggressively.
  2. Grounding isn't always triggered. Dynamic grounding uses a threshold to decide when to search. If your agent isn't searching when it should, lower the threshold.
  3. The free tier has strict rate limits. 15 RPM is fine for development but will fail under any real load. Budget for the paid tier in production.
  4. Safety filters are aggressive. Gemini's safety filters can block legitimate requests. You can adjust them, but you can't fully disable them.
  5. Video processing is slow. Uploading and processing video files takes time. Don't expect real-time video analysis.
  6. System instructions have less impact. Compared to Claude, Gemini tends to drift from system instructions in long conversations. Reinforce key rules.

Complete Your Agent Setup

Every AI agent — Claude, OpenAI, or Gemini — needs a strong personality foundation. Generate your SOUL.md in 5 minutes.

Generate Your SOUL.md

The "Big 3" Strategy

The smartest agent builders don't pick one model — they use the right model for each task:

Build your agent framework to support multiple models, then route tasks to the most cost-effective option.

Go deeper with the AI Employee Playbook

The complete system: 3-file framework, memory architecture, autonomy levels, and 15 production templates.

Get the Playbook — €29