How to Build an AI Agent with Google Gemini: Complete Guide
Google Gemini is the dark horse of the AI agent world. While Claude and GPT-4 get the headlines, Gemini quietly ships features that matter for agents: native grounding with Google Search, a 2 million token context window, and the most generous free tier in the industry.
This guide shows you how to build a production agent with Gemini. We'll cover the Gemini API, function calling, grounding, multimodal input, and the patterns that work in practice. This is the third part of our "Build an Agent" trilogy — read the Claude guide and OpenAI guide for comparison.
Why Gemini for AI Agents?
Gemini has unique advantages that Claude and GPT-4 can't match:
- 2 million token context window — Gemini 1.5 Pro handles 2M tokens. That's 10x Claude's 200K and 16x GPT-4's 128K. Entire codebases, full databases, long video transcripts — no chunking needed.
- Google Search grounding — Gemini can ground its responses in real-time Google Search results. No separate search API, no RAG pipeline. Just tell it to search and it does.
- True multimodal — process text, images, video, and audio natively in the same request. No separate vision API. Upload a video and ask questions about it.
- Free tier — 15 RPM, 1M TPM, 1500 RPD for free. No other major model offers this. Perfect for development and low-volume agents.
- Vertex AI integration — if you're already on Google Cloud, Gemini plugs directly into your infrastructure.
Getting Started: API Setup
Option 1: Google AI Studio (simpler)
npm install @google/generative-ai
import { GoogleGenerativeAI } from '@google/generative-ai';
import fs from 'fs';
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
// Load the 3-file framework
const soul = fs.readFileSync('./SOUL.md', 'utf8');
const agents = fs.readFileSync('./AGENTS.md', 'utf8');
const user = fs.readFileSync('./USER.md', 'utf8');
const systemPrompt = `${soul}\n\n${agents}\n\n${user}`;
const model = genAI.getGenerativeModel({
model: 'gemini-2.0-flash',
systemInstruction: systemPrompt,
});
Option 2: Vertex AI (enterprise)
import { VertexAI } from '@google-cloud/vertexai';
const vertex = new VertexAI({
project: process.env.GCP_PROJECT_ID,
location: 'us-central1',
});
const model = vertex.getGenerativeModel({
model: 'gemini-2.0-flash',
systemInstruction: systemPrompt,
});
The Agent Loop
Gemini's chat interface handles conversation history for you. The agent loop pattern:
async function agentLoop(userMessage) {
const chat = model.startChat({
tools: [{ functionDeclarations: getToolDeclarations() }],
});
let response = await chat.sendMessage(userMessage);
while (true) {
const parts = response.response.candidates[0].content.parts;
// Check for function calls
const functionCalls = parts.filter(p => p.functionCall);
if (functionCalls.length === 0) {
// No function calls — return the text response
const textPart = parts.find(p => p.text);
return textPart?.text || '';
}
// Execute each function call
const functionResponses = [];
for (const part of functionCalls) {
const { name, args } = part.functionCall;
const result = await executeTool(name, args);
functionResponses.push({
functionResponse: {
name,
response: result,
},
});
}
// Send results back to Gemini
response = await chat.sendMessage(functionResponses);
}
}
Define Tools
function getToolDeclarations() {
return [
{
name: 'web_search',
description: 'Search the web for current information, news, and data.',
parameters: {
type: 'object',
properties: {
query: {
type: 'string',
description: 'Search query — be specific for better results',
},
},
required: ['query'],
},
},
{
name: 'read_file',
description: 'Read a file from the workspace.',
parameters: {
type: 'object',
properties: {
path: {
type: 'string',
description: 'File path relative to workspace root',
},
},
required: ['path'],
},
},
{
name: 'write_file',
description: 'Write content to a file. Creates it if it does not exist.',
parameters: {
type: 'object',
properties: {
path: { type: 'string', description: 'File path' },
content: { type: 'string', description: 'Content to write' },
},
required: ['path', 'content'],
},
},
{
name: 'analyze_data',
description: 'Analyze structured data (CSV, JSON) and return insights.',
parameters: {
type: 'object',
properties: {
data_path: { type: 'string', description: 'Path to the data file' },
question: { type: 'string', description: 'What to analyze' },
},
required: ['data_path', 'question'],
},
},
];
}
Gemini's Killer Feature: Grounding
Grounding lets Gemini search Google before answering. No separate search API needed, no RAG pipeline, no vector database. Just enable it:
const modelWithGrounding = genAI.getGenerativeModel({
model: 'gemini-2.0-flash',
systemInstruction: systemPrompt,
tools: [{
googleSearchRetrieval: {
dynamicRetrievalConfig: {
mode: 'MODE_DYNAMIC',
dynamicThreshold: 0.3, // Lower = more searching
},
},
}],
});
// Gemini automatically searches when it needs current information
const result = await modelWithGrounding.generateContent(
'What are the latest developments in AI agent frameworks this week?'
);
// Response includes grounding metadata
const groundingMeta = result.response.candidates[0].groundingMetadata;
if (groundingMeta?.searchEntryPoint) {
console.log('Sources:', groundingMeta.webSearchQueries);
}
Multimodal Agents: Process Images, Video, Audio
Gemini is natively multimodal. Your agent can process images, videos, and audio without separate APIs:
// Analyze an image
const imageData = fs.readFileSync('./screenshot.png');
const result = await model.generateContent([
'Analyze this dashboard screenshot. What metrics are trending down?',
{
inlineData: {
mimeType: 'image/png',
data: imageData.toString('base64'),
},
},
]);
// Analyze a video (upload to File API first)
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY);
const uploadResult = await fileManager.uploadFile('./meeting.mp4', {
mimeType: 'video/mp4',
});
const videoResult = await model.generateContent([
'Summarize the key decisions from this meeting recording. List action items.',
{
fileData: {
mimeType: uploadResult.file.mimeType,
fileUri: uploadResult.file.uri,
},
},
]);
// Process audio
const audioResult = await model.generateContent([
'Transcribe this voice memo and extract any task items mentioned.',
{
inlineData: {
mimeType: 'audio/mp3',
data: fs.readFileSync('./memo.mp3').toString('base64'),
},
},
]);
Use cases for multimodal agents:
- Monitor dashboards — take periodic screenshots, have Gemini analyze trends and anomalies
- Process meeting recordings — auto-generate summaries, action items, and follow-ups
- Analyze documents — OCR + understanding in one step, no preprocessing
- Quality control — inspect product images, construction sites, manufacturing output
The 2M Context Window: What It Enables
A 2 million token context window changes what's possible for agents:
- Entire codebases in context — feed your whole project (not just relevant files) and ask questions about architecture, bugs, or refactoring
- Full document analysis — 500-page reports, legal contracts, research papers — no chunking, no summarization, no information loss
- Extended conversation history — weeks of conversation history without summarization. The agent truly remembers everything
- Multi-source synthesis — load multiple data sources simultaneously and ask Gemini to find patterns across them
Making It Autonomous
import cron from 'node-cron';
const memory = new AgentMemory();
cron.schedule('0 * * * *', async () => {
const recentContext = memory.getContext(2);
const tasks = fs.readFileSync('./tasks.md', 'utf8');
const model = genAI.getGenerativeModel({
model: 'gemini-2.0-flash',
systemInstruction: `${systemPrompt}\n\n## Recent Memory\n${recentContext}`,
tools: [
{ functionDeclarations: getToolDeclarations() },
{ googleSearchRetrieval: { dynamicRetrievalConfig: { mode: 'MODE_DYNAMIC' } } },
],
});
try {
const chat = model.startChat();
const response = await chat.sendMessage(`
Autonomous run. Time: ${new Date().toISOString()}
Pending tasks:\n${tasks}
Pick the highest priority task and execute it.
`);
const result = response.response.text();
memory.log(`Auto-run: ${result.substring(0, 200)}`);
} catch (error) {
memory.log(`ERROR: ${error.message}`);
}
});
Gemini vs Claude vs GPT-4: The Full Picture
| Feature | Gemini 2.0 | Claude | GPT-4o |
|---|---|---|---|
| Context window | 2M tokens | 200K tokens | 128K tokens |
| Built-in search | Google Search grounding | No (need separate tool) | Bing (ChatGPT only) |
| Multimodal | Text+image+video+audio | Text+image | Text+image+audio |
| Tool calling reliability | Good | Excellent | Excellent |
| System prompt adherence | Good | Excellent | Good |
| Free tier | Very generous | None (API) | Limited |
| Cost (per 1M input) | $0.075 (Flash) | $3-15 | $2.50 |
| Best for | Large context, multimodal, budget | Complex reasoning, personality | General purpose, structured output |
When to Choose Gemini
Gemini is the right choice when:
- You need to process large documents or entire codebases (2M context)
- Your agent needs real-time web search without building a separate integration
- You're processing video or audio (meeting recordings, surveillance, quality control)
- You're on a tight budget (Gemini Flash at $0.075/1M tokens is 33x cheaper than GPT-4o)
- You're already on Google Cloud and want native integration
Gemini is not the best choice when:
- You need rock-solid tool calling (Claude and GPT-4 are more reliable here)
- Agent personality and tone matter a lot (Claude's system prompt adherence is superior)
- You need extended thinking/reasoning (Claude and o1/o3 are better for multi-step logic)
Cost Breakdown
| Model | Input (per 1M) | Output (per 1M) | Best use |
|---|---|---|---|
| Gemini 2.0 Flash | $0.075 | $0.30 | Most agent tasks — incredible value |
| Gemini 1.5 Pro | $1.25 | $5.00 | Complex reasoning, long context |
| Gemini 2.0 Flash (free) | Free | Free | Development, prototyping, low-volume |
Monthly cost example: An agent running hourly with Gemini Flash, processing ~50K tokens per run: approximately $2.70/month. That's not a typo. Compare to GPT-4o at ~$90/month for the same workload.
Gemini-Specific Gotchas
- Function call arguments can be weird. Gemini occasionally wraps arguments in unexpected structures or omits required fields. Validate aggressively.
- Grounding isn't always triggered. Dynamic grounding uses a threshold to decide when to search. If your agent isn't searching when it should, lower the threshold.
- The free tier has strict rate limits. 15 RPM is fine for development but will fail under any real load. Budget for the paid tier in production.
- Safety filters are aggressive. Gemini's safety filters can block legitimate requests. You can adjust them, but you can't fully disable them.
- Video processing is slow. Uploading and processing video files takes time. Don't expect real-time video analysis.
- System instructions have less impact. Compared to Claude, Gemini tends to drift from system instructions in long conversations. Reinforce key rules.
Complete Your Agent Setup
Every AI agent — Claude, OpenAI, or Gemini — needs a strong personality foundation. Generate your SOUL.md in 5 minutes.
Generate Your SOUL.mdThe "Big 3" Strategy
The smartest agent builders don't pick one model — they use the right model for each task:
- Gemini Flash for high-volume, cost-sensitive tasks (data processing, classification, simple queries)
- GPT-4o for structured outputs, general-purpose tasks, and when you need the OpenAI ecosystem
- Claude for complex reasoning, long documents, and tasks where personality and instruction-following matter
Build your agent framework to support multiple models, then route tasks to the most cost-effective option.
Go deeper with the AI Employee Playbook
The complete system: 3-file framework, memory architecture, autonomy levels, and 15 production templates.
Get the Playbook — €29