Deep Dive 10 min read · Updated June 2026

How AI Citations Actually Work: ChatGPT, Perplexity, Claude, and Gemini

When ChatGPT or Perplexity answers a question and cites a website, what actually drove that decision? Here's the retrieval pipeline behind AI citations, and what your site needs to get included.

Training vs. retrieval: the distinction that changes everything

There are two completely different ways an AI system can "know" something from your site, and they work through entirely separate pipelines.

Training happens before the model is deployed. During training, crawlers collect billions of pages from the web and the model learns patterns, facts, and language from this data. When you block a training crawler (GPTBot, ClaudeBot, Google-Extended), you're limiting what the model absorbs during that phase. This affects the model's general knowledge, but not whether the model cites your site in real-time answers.

Retrieval happens during inference — when a user asks a question. A separate crawler has indexed your site, and when a relevant query comes in, the AI system retrieves matching content, reads it, and cites it in its response. This is the pipeline that drives the citations you see in ChatGPT search results, Perplexity answers, and Claude's web search mode.

Most sites that want to appear in AI answers are optimizing the wrong thing. You can have the best-trained model in the world knowing your brand, but if the retrieval crawler is blocked, you simply won't appear when users ask questions your content answers.

Retrieval-Augmented Generation (RAG)

The technical name for retrieval-based AI answers is Retrieval-Augmented Generation, or RAG. Here's how the pipeline works when a user asks a question in ChatGPT's search mode:

Query processing. The model interprets the user's question and generates a search query (or multiple queries).
Retrieval. The system searches a pre-built index of crawled web content. This is the index that OAI-SearchBot builds continuously.
Ranking. Retrieved documents are ranked by relevance to the query. Freshness, authority signals, and content quality all influence ranking.
Reading and synthesis. The AI reads the top-ranked documents, extracts relevant passages, and synthesizes an answer.
Citation. The AI cites the sources it drew from. Not all retrieved documents get cited — only those the AI actually used in its answer.

Steps 1–3 require that your site appears in the retrieval index at all. Steps 4–5 require that your content is readable, relevant, and quotable when the AI reads it.

Per-engine breakdown

ChatGPT (OpenAI)

ChatGPT in "Search" mode (enabled by default since late 2024) retrieves content using OAI-SearchBot. This is separate from GPTBot (training). The index is proprietary (OpenAI doesn't publish its full methodology), but the patterns are consistent with standard RAG: recency, relevance, and source authority are the primary ranking signals.

ChatGPT cites sources inline with numbered footnotes. The selection of which sources to cite is model-determined, not purely algorithmic — the AI decides which retrieved content it actually used to construct the answer. Sites with clear, quotable prose tend to appear more often than sites with dense, hedged, or marketing-heavy copy.

Blocking OAI-SearchBot in robots.txt removes you from this pipeline entirely. Blocking GPTBot does not.

Perplexity

Perplexity is the most transparent of the major AI citation engines about its retrieval behavior. It uses PerplexityBot for crawling and supplements this with the Bing index (licensed from Microsoft). Queries that hit Perplexity trigger a real-time web search, and the cited sources appear in the answer with numbered citations and a source list.

Perplexity's citation selection is especially visible to test: ask a specific question that your content answers, and if you're in their index with relevant content, you'll often see yourself cited within seconds. The most common reason sites don't appear is robots.txt blocking PerplexityBot — which happens frequently with sites that used blanket AI-block configurations.

Perplexity also indexes llms.txt and respects it when present.

Claude (Anthropic)

Claude's web search capability (available in Claude.ai and through the API) uses Claude-SearchBot for its retrieval index. The index is separate from Anthropic's training crawlers (ClaudeBot, anthropic-ai).

Claude's citation behavior tends toward fewer, more confident citations than Perplexity. It's more selective about which sources it quotes, which means that when it does cite you, it's because your content had a strong relevance match and sufficient clarity for the AI to extract a precise answer. Entity signals — Wikidata presence, authorship attribution, organization schema — appear to influence Claude's citation confidence significantly.

Gemini (Google)

Google's AI features (AI Overviews, Gemini, Search Generative Experience) draw from Google's existing web index, built primarily by Googlebot. This is both an advantage and a constraint: if you're already indexed and ranking well in Google Search, you're likely already in the retrieval pool for Google's AI features. But Google also considers traditional SEO signals — PageRank, E-E-A-T — when selecting sources for AI Overviews, not just raw content relevance.

Google has also deployed Google-Extended as a training-only crawler, separate from Googlebot. Blocking Google-Extended doesn't affect your Google Search ranking or your appearance in AI Overviews. Blocking Googlebot affects both.

Microsoft Copilot

Copilot draws on Bing's index, built by Bingbot. If you're indexed in Bing, you're in Copilot's retrieval pool. Microsoft has expanded Copilot's web access significantly since 2024, and the citation behavior mirrors Bing's existing indexing signals. Sites with strong Bing presence tend to appear in Copilot answers; sites that have historically deprioritized Bing optimization often have gaps here.

What all citation pipelines have in common

Despite the differences between engines, four factors consistently influence whether and how sites get cited:

Access: is the crawler allowed in?

Every pipeline starts with crawler access. robots.txt is the first gate. Server-level blocks (WAF rules, Cloudflare Bot Fight Mode, IP-based rate limiting) are a second gate that operates independently of robots.txt. Both need to be clear for any retrieval to happen.

Readability: can the crawler extract your content?

AI crawlers mostly don't render JavaScript. A React SPA that loads content via client-side API calls will look like an empty page to most AI retrieval crawlers. Server-side rendered content, static HTML, and hybrid frameworks configured for SSR or static generation fare much better.

Relevance: does your content actually answer the query?

Retrieval systems match content to queries semantically, not just by keyword. Answering questions directly — stating the answer in the first sentence of a section, using question-format headings, organizing content in discrete logical units — significantly improves relevance matching and extractability.

Trust: does the system consider you a reliable source?

This is where entity signals, authorship, and authority footprint matter. AI systems are more willing to cite sources they can verify — organizations with Wikidata entries, named authors with online presence, sites with clear sameAs links to authoritative directories. Anonymous sources get cited less confidently and less often, especially for factual or medical/legal/financial queries where trust matters most.

Why some sites get cited far more than others

Content quality is the obvious answer, but it's not the whole picture. Two sites with equally accurate, well-written content on the same topic will have different citation rates if one has:

A Wikidata entity and Wikipedia presence
Named author attribution on every article
Organization schema with verified sameAs links
Content structured for extraction (H2/H3 hierarchy matching user questions)
Consistent publication and update signals (datePublished, dateModified in JSON-LD)

These are the signals that tip the balance when two sources are roughly equivalent in relevance. They're also the signals most commonly absent from mid-tier content sites, which is why improving entity authority is one of the highest-leverage GEO moves a site can make.

How to see where you stand

The fastest check is to paste your domain into letthebots.in. The scan tells you, per bot, whether each AI crawler's search/retrieval agent can access your site — the first gate in every citation pipeline. The results page unlocks a prioritized fix list when you enter your email.

For a manual check, search Perplexity directly for a question your content answers. If you don't appear as a source, check whether PerplexityBot is blocked in your robots.txt and whether your content is rendered in static HTML.

Check your site's GEO score — free

Paste any URL and find out whether ChatGPT, Claude, Perplexity, and Gemini can reach, read, and cite your site. Score, Crawler Gate, and six sub-scores are instant and free.

Check my site →