A

AI citations

When an AI system like ChatGPT, Claude, or Perplexity references a specific website as the source of information in a generated answer. Citations happen through retrieval pipelines, not from the model's training data — which is why retrieval crawler access is critical.

AI crawler

A bot that automatically fetches web pages on behalf of an AI platform. AI crawlers fall into two categories: training crawlers (build the model's underlying knowledge base) and search/retrieval crawlers (index content for real-time citation). They are controlled by robots.txt directives.

AI Overviews

Google's feature that shows an AI-generated summary at the top of search results for certain queries. AI Overviews draw from Google's main web index (built by Googlebot), not a separate AI-specific index. Traditional SEO signals like PageRank and E-E-A-T influence which sources appear.

authority (entity authority)

The degree to which AI systems recognize and trust an entity — a person, organization, or brand — as a credible source on a topic. Authority is built through Wikidata presence, sameAs links to authoritative directories, named authorship, and an organization's footprint across the web. Higher authority leads to more confident AI citations.

B

Bingbot

Microsoft's primary web crawler, used to build the Bing search index. Copilot draws on this index for retrieval-augmented generation, so Bingbot is the retrieval crawler that determines Copilot citation eligibility. Blocking Bingbot removes your site from both Bing search results and Copilot answers.

C

ChatGPT-User

The user agent string for ChatGPT's on-demand browsing feature (Browse with Bing / ChatGPT browse). Unlike OAI-SearchBot, which builds a continuous index, ChatGPT-User fetches specific pages in real time when the user triggers a browse action. Blocking it affects in-session browsing but not the general citation pipeline.

citation pipeline

The end-to-end process by which a URL goes from existing on the web to being cited in an AI-generated answer. Stages: crawl → extract content → index → retrieve on query match → cite in response. A failure at any stage prevents citation.

Claude-SearchBot

Anthropic's search and retrieval crawler. It builds the index Claude draws on when answering queries in web search mode. Separate from ClaudeBot (training) and anthropic-ai (research). Blocking Claude-SearchBot in robots.txt prevents Claude from citing your content.

ClaudeBot

Anthropic's training crawler. It fetches web content to train Claude's underlying models. Blocking ClaudeBot limits what Claude's models learn during training, but it does not affect real-time citation eligibility (that's controlled by Claude-SearchBot).

Crawl-delay

A robots.txt directive that tells crawlers to wait a specified number of seconds between page fetches. Used to limit crawl load on servers. Most major AI crawlers respect this directive. Example: Crawl-delay: 10 instructs bots to fetch no more than one page every 10 seconds.

D

dateModified

A property in Article and other schema.org types that records when content was last updated. AI systems use this to assess freshness — an article with a stale dateModified is treated as old content even if the text itself is accurate. Keep dateModified current whenever content is meaningfully updated.

Disallow

A robots.txt directive that tells a crawler not to fetch a specific path or URL pattern. Disallow: / blocks the entire site; Disallow: /admin/ blocks only the admin section. The most common GEO mistake is an unintended Disallow: / applied to AI retrieval crawlers via wildcard rules.

E

E-E-A-T

Experience, Expertise, Authoritativeness, and Trustworthiness — Google's framework for evaluating content quality. It overlaps significantly with GEO entity authority signals: named authorship, organizational credibility, and accuracy all factor into both. Sites with strong E-E-A-T tend to also have the entity signals AI systems favor.

entity

In the context of AI and structured data, an entity is a distinct, identifiable thing: a person, organization, place, product, or concept. Entities have unique identities across the web and can be cross-referenced between data sources. AI systems treat known entities with verified identities more favorably as citation sources.

entity footprint

The total set of authoritative web records that confirm an entity's existence and identity. For an organization, this includes its Wikidata entry, Wikipedia article, LinkedIn company page, official social profiles, and domain-verified presence. A larger entity footprint means AI systems can cross-verify the entity's existence from multiple independent sources.

extractability

One of the six GEO scoring categories. Measures how easily AI systems can pull a coherent, quotable passage from your content. High extractability = answer-first structure, clear heading hierarchy, use of lists and tables, question-format headings. Low extractability = dense flowing prose, answers buried in the middle of paragraphs, no clear structure.

F

FAQPage schema

A schema.org type for pages that contain question-and-answer pairs. FAQPage structured data is one of the most direct signals you can give AI engines: it puts your answers in a machine-readable format that maps directly to user queries. AI systems can extract and cite these answers with high confidence.

freshness

One of the six GEO scoring categories. Refers to signals that indicate your content is current: dateModified in schema, sitemap lastmod accuracy, llms.txt presence, HTTPS, fast TTFB, and canonical tag hygiene. AI systems deprioritize content with stale date signals when answering queries where recency matters.

G

GEO (Generative Engine Optimization)

The practice of optimizing your web presence so that AI-powered search engines and chat assistants can find, read, and cite your content. GEO covers six signals: access (robots.txt, server blocks), readability (JS rendering, paywalls), structured data (JSON-LD coverage), authority (entity signals, authorship), extractability (content format), and freshness (date signals, hygiene).

Google-Extended

Google's training-only crawler for Google's Bard/Gemini AI models. Blocking Google-Extended affects model training but does not affect your appearance in Google Search results or Google AI Overviews — those draw from the main Googlebot index. This distinction parallels OpenAI's GPTBot vs. OAI-SearchBot separation.

Googlebot

Google's primary web crawler, used to build the Google Search index. Google's AI features — including AI Overviews and Gemini answers with web grounding — draw from this same index. Blocking Googlebot removes your site from both Google Search and Google AI answers. One of the highest-impact access restrictions possible.

GPTBot

OpenAI's training crawler. Fetches web content to train GPT models. Blocking GPTBot (via Disallow in robots.txt) limits what OpenAI's models learn from your site during pre-training. It does not affect whether ChatGPT cites your site in real-time answers — that's OAI-SearchBot's job.

J

JSON-LD

JavaScript Object Notation for Linked Data. The preferred format for schema.org structured data. Embedded in a <script type="application/ld+json"> tag in the page <head>. AI crawlers parse it independently of page content, making it the most reliable way to communicate structured metadata.

L

llms.txt

A plain-text file placed at yourdomain.com/llms.txt that gives AI language models a structured overview of a site: what it's for, who runs it, what the key pages are, and what to ignore. Uses Markdown format. Proposed by Jeremy Howard in 2024. Increasingly checked by AI systems as part of site indexing.

N

noindex

A meta robots tag (<meta name="robots" content="noindex">) or HTTP header that instructs crawlers not to include a page in their index. If set on a page that should be publicly citable, it prevents AI citation engines from indexing it even if crawl access is permitted. Common source of silent AI visibility gaps.

O

OAI-SearchBot

OpenAI's search and retrieval crawler. Builds the real-time index that ChatGPT draws on when citing sources in web search mode. Distinct from GPTBot (training). Blocking OAI-SearchBot in robots.txt means ChatGPT cannot cite your site, regardless of how well your content matches a user's query.

Organization schema

A schema.org type that describes an organization: its name, URL, logo, description, and — critically for GEO — sameAs links to external authoritative records. Organization schema with verified sameAs links (Wikidata, Wikipedia, LinkedIn) is the highest-leverage structured data for entity authority.

P

PerplexityBot

Perplexity's web crawler, used to build its search index. Perplexity supplements this with a licensed Bing index. Since Perplexity's entire product is citation-driven, PerplexityBot is the retrieval crawler that most directly determines whether your site appears in Perplexity answers.

R

RAG (Retrieval-Augmented Generation)

The technical architecture behind AI systems that retrieve web content before generating answers. The AI system indexes your content using a retrieval crawler, then when a user asks a question, the system retrieves relevant indexed content and uses it to generate a cited response. Most AI citation pipelines are built on RAG.

retrieval crawler

A web crawler that builds the real-time index used for AI citations. Examples: OAI-SearchBot (ChatGPT), Claude-SearchBot (Claude), PerplexityBot (Perplexity), Googlebot (Gemini), Bingbot (Copilot). Distinct from training crawlers. Blocking a retrieval crawler removes your site from that AI engine's citation pool.

robots.txt

A text file at yourdomain.com/robots.txt that specifies crawl permissions per user agent. Controls which bots can access which URLs on your site. Each AI platform has distinct user agent strings for its retrieval crawlers — allowing or blocking these determines AI citation eligibility. The default (if no file exists or a bot has no matching rule) is to allow.

S

sameAs

A schema.org property that links an entity to external, authoritative records of that entity's identity. An Organization schema with sameAs pointing to Wikidata (https://www.wikidata.org/wiki/Q...), Wikipedia, and LinkedIn tells AI systems that this organization has a verified identity across multiple independent sources. Strongest single structured data signal for entity authority.

schema.org

A collaborative vocabulary for structured data on the web, maintained by Google, Microsoft, Yahoo, and Yandex. Defines hundreds of types (Organization, Article, FAQPage, Person, etc.) and properties (name, author, dateModified, sameAs, etc.) that can be embedded in web pages to give machine-readable context to content.

sitemap.xml

An XML file that lists the URLs on your site along with metadata like lastmod (last modified date) and changefreq. Helps crawlers discover your pages and prioritize crawl frequency. AI retrieval crawlers use sitemaps for discovery; accurate lastmod signals contribute to freshness scoring.

T

training crawler

A web crawler that fetches content to train an AI model's underlying knowledge base during pre-training. Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google), CCBot (Common Crawl). Blocking training crawlers is a legitimate choice that limits what the model learns from your site, but does not directly affect real-time citation eligibility.

TTFB (Time to First Byte)

The time between a crawler making an HTTP request and receiving the first byte of the response. A GEO freshness and hygiene signal. High TTFB can cause crawlers to time out before receiving your content, effectively making the page invisible. Also correlated with overall server health.

U

user-agent

A string that identifies the software making an HTTP request. Web crawlers set their user-agent when fetching pages, and robots.txt uses the User-agent directive to match rules to specific crawlers. For example, User-agent: OAI-SearchBot applies rules specifically to OpenAI's search crawler.

W

Wikidata

A free, structured knowledge base maintained by the Wikimedia Foundation. Wikidata assigns unique identifiers (Q-numbers) to entities and stores verified properties about them. An organization's Wikidata entry, linked via sameAs in Organization schema, is one of the strongest entity authority signals for AI citations.

wildcard user-agent (* )

In robots.txt, User-agent: * matches all crawlers that don't have a more specific rule. A Disallow: / rule under User-agent: * blocks every bot without an explicit exception. This is the most common cause of unintended AI citation blocking — SEO plugins often add wildcard rules that catch AI retrieval crawlers.

GEO Glossary