Your robots.txt controls which AI engines can cite you. Most sites have it misconfigured — here's exactly which user agents matter, what the common mistakes look like, and how to fix them.
robots.txt is one of the oldest standards on the web. It tells crawlers what they can and can't access. For most of the web's history, the meaningful bots were Googlebot, Bingbot, and a handful of others. Webmasters who didn't configure robots.txt carefully were fine. The default is permissive.
That changed in 2023 when OpenAI, Anthropic, Perplexity, and others started deploying crawlers at scale to feed their AI systems. The key issue: these aren't just training crawlers building a one-time knowledge base. They're search and retrieval crawlers that index content continuously to power real-time citations. Blocking them has an immediate cost: your content doesn't appear when AI engines answer relevant questions.
The robots.txt Exclusion Protocol gives you per-bot control. But that control only works in your favor if you use it intentionally.
This distinction is the single most important thing to understand about AI bots and robots.txt.
Training crawlers fetch content to build the AI model's underlying knowledge. When you block a training crawler, you're limiting what the model "knows" from pre-training. It affects the model's general knowledge, not whether it can find and cite your content in real time.
Search/retrieval crawlers index content continuously so AI engines can retrieve it when answering live queries — the same role Googlebot plays for Google Search. When you block a retrieval crawler, you disappear from that AI engine's citation pool entirely. No matter how good your content is, that AI system cannot cite you.
Critical: Blocking GPTBot (OpenAI's training crawler) is a common and defensible choice. Blocking OAI-SearchBot (OpenAI's retrieval crawler) means ChatGPT cannot cite you. These are two different bots. Many sites that meant to block only GPTBot have accidentally blocked OAI-SearchBot as well.
The major AI bots you need to know for robots.txt configuration, and what each one does:
| User-agent | Platform | Type | Blocks citations if blocked? |
|---|---|---|---|
| OAI-SearchBot | ChatGPT | Search / Retrieval | Yes — ChatGPT cannot cite you |
| GPTBot | OpenAI training | Training | No direct citation impact |
| ChatGPT-User | ChatGPT browse | Browsing (on-demand) | Partial — affects browsing plugin |
| Claude-SearchBot | Claude citations | Search / Retrieval | Yes — Claude cannot cite you |
| ClaudeBot | Anthropic training | Training | No direct citation impact |
| PerplexityBot | Perplexity | Search / Retrieval | Yes — Perplexity cannot cite you |
| Googlebot | Gemini / Google AI | Search / Retrieval | Yes — Gemini answers use this index |
| Google-Extended | Google Bard training | Training | No direct citation impact |
| Bingbot | Copilot | Search / Retrieval | Yes — Copilot cannot cite you |
| cohere-ai | Cohere | Training / RAG | Depends on implementation |
| anthropic-ai | Anthropic research | Training / Research | No direct citation impact |
Your robots.txt lives at yourdomain.com/robots.txt. It's a plain text file. The syntax is simple:
# Example robots.txt structure
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
User-agent specifies which bot the rule applies to. Disallow: / blocks everything. Allow: / explicitly allows everything (useful when you need to override a broader block). Rules apply from top to bottom, and the most specific match for a bot wins.
The wildcard User-agent: * matches all bots that don't have a specific rule. It's commonly used by SEO plugins to generate blanket rules, and it's the source of the most common AI visibility problem on the web.
Here's what a dangerous robots.txt looks like:
# Auto-generated by SEO plugin v3.2
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
# Block AI training bots (added by plugin)
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: *
Disallow: /
The second wildcard block at the end — User-agent: * / Disallow: / — blocks every crawler that doesn't already have an explicit rule. In this file, OAI-SearchBot doesn't have an explicit rule, so it hits the wildcard and gets blocked. Same for Claude-SearchBot. Same for PerplexityBot. The site owner intended to block training crawlers; they ended up blocking all AI citations.
Here's a robots.txt that blocks training crawlers (a legitimate choice) while keeping all search/retrieval crawlers open:
# Allow all standard search crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# Allow AI search/retrieval crawlers (these power citations)
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
# Block training-only crawlers if you prefer
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Everything else: allow by default
User-agent: *
Allow: /
If you want to allow everything — training crawlers included — the simplest valid robots.txt is:
User-agent: *
Allow: /
Or just an empty file. The robots.txt protocol is permissive by default: if a bot has no matching rule, it's allowed.
You can block AI crawlers from specific sections while allowing the rest:
# Allow OAI-SearchBot everywhere except /private/
User-agent: OAI-SearchBot
Allow: /
Disallow: /private/
Disallow: /internal/
This is useful if you have sections with customer data, internal documentation, or premium content you don't want indexed.
Some robots.txt files include Crawl-delay: 10 to rate-limit crawlers. Most major AI crawlers respect this directive. If your server is under crawl load from any bot, Crawl-delay is the right lever. It's separate from access control.
robots.txt controls crawler access. It doesn't control:
Paste your domain into letthebots.in. The scan reads your robots.txt, evaluates every AI crawler user agent against it, and returns a per-bot verdict: blocked, allowed, or conditionally allowed. It's the fastest way to see what each AI engine actually sees when it knocks on your door.
For manual testing, most AI companies publish documentation with their crawler IP ranges. Cross-reference your server logs to confirm which bots are actually reaching you.
Paste any URL and find out whether ChatGPT, Claude, Perplexity, and Gemini can reach, read, and cite your site. Score, Crawler Gate, and six sub-scores are instant and free.
Check my site →