Technical Guide 11 min read · Updated June 2026

robots.txt for AI Bots: The Complete 2025 Guide

Your robots.txt controls which AI engines can cite you. Most sites have it misconfigured — here's exactly which user agents matter, what the common mistakes look like, and how to fix them.

Why robots.txt matters more now

robots.txt is one of the oldest standards on the web. It tells crawlers what they can and can't access. For most of the web's history, the meaningful bots were Googlebot, Bingbot, and a handful of others. Webmasters who didn't configure robots.txt carefully were fine. The default is permissive.

That changed in 2023 when OpenAI, Anthropic, Perplexity, and others started deploying crawlers at scale to feed their AI systems. The key issue: these aren't just training crawlers building a one-time knowledge base. They're search and retrieval crawlers that index content continuously to power real-time citations. Blocking them has an immediate cost: your content doesn't appear when AI engines answer relevant questions.

The robots.txt Exclusion Protocol gives you per-bot control. But that control only works in your favor if you use it intentionally.

Training crawlers vs. search/retrieval crawlers

This distinction is the single most important thing to understand about AI bots and robots.txt.

Training crawlers fetch content to build the AI model's underlying knowledge. When you block a training crawler, you're limiting what the model "knows" from pre-training. It affects the model's general knowledge, not whether it can find and cite your content in real time.

Search/retrieval crawlers index content continuously so AI engines can retrieve it when answering live queries — the same role Googlebot plays for Google Search. When you block a retrieval crawler, you disappear from that AI engine's citation pool entirely. No matter how good your content is, that AI system cannot cite you.

Critical: Blocking GPTBot (OpenAI's training crawler) is a common and defensible choice. Blocking OAI-SearchBot (OpenAI's retrieval crawler) means ChatGPT cannot cite you. These are two different bots. Many sites that meant to block only GPTBot have accidentally blocked OAI-SearchBot as well.

The complete AI bot directory

The major AI bots you need to know for robots.txt configuration, and what each one does:

User-agent	Platform	Type	Blocks citations if blocked?
OAI-SearchBot	ChatGPT	Search / Retrieval	Yes — ChatGPT cannot cite you
GPTBot	OpenAI training	Training	No direct citation impact
ChatGPT-User	ChatGPT browse	Browsing (on-demand)	Partial — affects browsing plugin
Claude-SearchBot	Claude citations	Search / Retrieval	Yes — Claude cannot cite you
ClaudeBot	Anthropic training	Training	No direct citation impact
PerplexityBot	Perplexity	Search / Retrieval	Yes — Perplexity cannot cite you
Googlebot	Gemini / Google AI	Search / Retrieval	Yes — Gemini answers use this index
Google-Extended	Google Bard training	Training	No direct citation impact
Bingbot	Copilot	Search / Retrieval	Yes — Copilot cannot cite you
cohere-ai	Cohere	Training / RAG	Depends on implementation
anthropic-ai	Anthropic research	Training / Research	No direct citation impact

Reading your current robots.txt

Your robots.txt lives at yourdomain.com/robots.txt. It's a plain text file. The syntax is simple:

# Example robots.txt structure

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent specifies which bot the rule applies to. Disallow: / blocks everything. Allow: / explicitly allows everything (useful when you need to override a broader block). Rules apply from top to bottom, and the most specific match for a bot wins.

The wildcard User-agent: * matches all bots that don't have a specific rule. It's commonly used by SEO plugins to generate blanket rules, and it's the source of the most common AI visibility problem on the web.

The most common mistake: wildcard Disallow

Here's what a dangerous robots.txt looks like:

# Auto-generated by SEO plugin v3.2
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php

# Block AI training bots (added by plugin)
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow: /

The second wildcard block at the end — User-agent: * / Disallow: / — blocks every crawler that doesn't already have an explicit rule. In this file, OAI-SearchBot doesn't have an explicit rule, so it hits the wildcard and gets blocked. Same for Claude-SearchBot. Same for PerplexityBot. The site owner intended to block training crawlers; they ended up blocking all AI citations.

What a correct configuration looks like

Here's a robots.txt that blocks training crawlers (a legitimate choice) while keeping all search/retrieval crawlers open:

# Allow all standard search crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Allow AI search/retrieval crawlers (these power citations)
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Block training-only crawlers if you prefer
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Everything else: allow by default
User-agent: *
Allow: /

If you want to allow everything — training crawlers included — the simplest valid robots.txt is:

User-agent: *
Allow: /

Or just an empty file. The robots.txt protocol is permissive by default: if a bot has no matching rule, it's allowed.

Blocking specific paths, not entire crawlers

You can block AI crawlers from specific sections while allowing the rest:

# Allow OAI-SearchBot everywhere except /private/
User-agent: OAI-SearchBot
Allow: /
Disallow: /private/
Disallow: /internal/

This is useful if you have sections with customer data, internal documentation, or premium content you don't want indexed.

The Crawl-delay directive

Some robots.txt files include Crawl-delay: 10 to rate-limit crawlers. Most major AI crawlers respect this directive. If your server is under crawl load from any bot, Crawl-delay is the right lever. It's separate from access control.

What robots.txt doesn't control

robots.txt controls crawler access. It doesn't control:

How your content is used in training. Even if you allow a training crawler, whether it uses your content depends on the AI company's policies and filtering decisions.
Server-level blocks. WAF rules, CDN configurations, and IP-based rate limiting can block crawlers even when robots.txt permits them. This is a separate GEO signal we check at letthebots.in.
Content extraction. Allowing crawl access doesn't mean the crawler can read your content. JavaScript-rendered pages and login gates are separate problems.

How to test your robots.txt right now

Paste your domain into letthebots.in. The scan reads your robots.txt, evaluates every AI crawler user agent against it, and returns a per-bot verdict: blocked, allowed, or conditionally allowed. It's the fastest way to see what each AI engine actually sees when it knocks on your door.

For manual testing, most AI companies publish documentation with their crawler IP ranges. Cross-reference your server logs to confirm which bots are actually reaching you.

Check your site's GEO score — free

Paste any URL and find out whether ChatGPT, Claude, Perplexity, and Gemini can reach, read, and cite your site. Score, Crawler Gate, and six sub-scores are instant and free.

Check my site →