robots.txt for AI — Prominara Documentation

robots.txt is a plain text file at your website root that controls which crawlers can access your content. With the rise of AI search engines, your robots.txt configuration directly impacts whether your brand appears in AI-generated responses from ChatGPT, Perplexity, Claude, and Google AI Overviews. A 2025 study by Originality.ai found that over 35% of the top 1,000 websites block at least one AI crawler, often unintentionally reducing their AI visibility.

What Is robots.txt and Why Does It Matter for AI?

robots.txt is a standard protocol (defined in RFC 9309) that tells web crawlers which parts of a website they are allowed to access. Traditionally used for search engines like Googlebot and Bingbot, it now plays a critical role in Generative Engine Optimization (GEO) because AI platforms use dedicated crawlers to index content for their models. If your robots.txt blocks these AI crawlers, your pages will not be included in AI training data or real-time retrieval, meaning AI assistants cannot cite your content.

Check your current robots.txt immediately. Many sites have a blanket Disallow: / for unknown user agents, which silently blocks all AI crawlers. Use Prominara's AI Visibility Audit to verify your configuration.

AI Crawler User Agents in 2026

Each AI platform operates its own web crawler with a unique user agent string. Here are the active AI crawlers you need to configure:

User Agent	Platform	Purpose	First Seen
GPTBot	OpenAI (ChatGPT)	Indexes content for ChatGPT responses and model training	2023
ChatGPT-User	OpenAI (ChatGPT)	Real-time web browsing during ChatGPT conversations	2023
ClaudeBot	Anthropic (Claude)	Indexes content for Claude AI responses	2024
PerplexityBot	Perplexity AI	Real-time search and citation for Perplexity answers	2024
Google-Extended	Google (Gemini)	Indexes content for Gemini and AI Overviews	2023
Googlebot	Google Search	Traditional search indexing plus AI Overview content	2004
Bingbot	Microsoft (Copilot)	Indexes content for Bing and Microsoft Copilot	2012
meta-externalagent	Meta AI	Indexes content for Meta AI assistant	2024

The distinction between GPTBot and ChatGPT-User is important: GPTBot crawls content for training and pre-indexing, while ChatGPT-User fetches pages in real-time when a user asks ChatGPT to browse a specific URL.

Recommended robots.txt Configuration

For maximum AI visibility, we recommend explicitly allowing all AI crawlers. An explicit Allow: / is stronger than simply not mentioning the bot, because some CDNs and WAFs (like Cloudflare) may block unknown user agents by default.

# === AI Crawlers ===
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: meta-externalagent
Allow: /

# === Traditional Search Engines ===
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# === Default Rule ===
User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /dashboard/

# === Discovery Files ===
Sitemap: https://yoursite.com/sitemap.xml

How Should I Handle Selective Access?

Not all content should be exposed to AI crawlers. Authentication pages, admin panels, API endpoints, and user-generated private content should remain blocked. Here is a selective access configuration:

User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Allow: /pricing/
Allow: /about/
Allow: /features/
Disallow: /admin/
Disallow: /api/
Disallow: /dashboard/
Disallow: /settings/
Disallow: /portal/

What Should You Allow vs. Block?

Action	Path Examples	Reason
Allow	/blog/, /docs/, /pricing/, /about/	Public content you want AI to cite
Allow	/glossary/, /guides/, /case-studies/	Educational content that builds authority
Block	/api/, /admin/, /dashboard/	Internal application routes
Block	/portal/, /settings/, /account/	Authenticated user areas
Block	/checkout/, /payment/	Sensitive transaction pages

How Does robots.txt Affect AI Visibility Scores?

Prominara's AI visibility scoring algorithm includes Technical Readiness as one of four scoring categories (weighted at 20% of the overall score). Within Technical Readiness, AI Crawler Access is the highest-weighted factor at 25%. Blocking any of the 4 major AI crawlers (GPTBot, ClaudeBot, PerplexityBot, GoogleBot) deducts 15 points each from the Technical Readiness score. A blanket noindex meta tag costs 50 points.

In practice, a site blocking all 4 AI crawlers loses approximately 12 points from its overall AI visibility score (60 points × 0.20 weight = 12 points).

Common Mistakes to Avoid

Wildcard blocking unknown bots: Rules like User-agent: * Disallow: / block every crawler including AI bots. Instead, use a permissive wildcard rule and block only specific bad bots.
Forgetting CDN-level bot blocking: Cloudflare, Akamai, and other CDNs have their own bot management that may block AI crawlers regardless of your robots.txt. Check your CDN settings separately.
Using noindex instead of robots.txt: A noindex meta tag tells crawlers not to index the page, but the crawler still visits the page. robots.txt prevents the visit entirely, saving your server resources.
Not including a Sitemap directive: Always include a Sitemap: line pointing to your XML sitemap. AI crawlers use sitemaps to discover content efficiently.
Blocking crawl-delay for AI bots: Unlike traditional crawlers, AI bots often respect Crawl-delay directives. Setting a high delay reduces how much of your site gets indexed.

How to Verify Your Configuration

Run a scan on your site using Prominara's AI Visibility Audit to check crawler access status
Check the Technical Readiness section of your scan results for AI Crawler Access scores
Review the scan recommendations for any robots.txt-related warnings
Use Google's robots.txt Tester in Search Console for syntax validation
Verify that your llms.txt file is also accessible alongside robots.txt

Platform-Specific Considerations

Next.js and Vercel

In Next.js applications, create a public/robots.txt file or use the app/robots.ts route handler for dynamic generation. Vercel automatically serves files from the public/ directory at the domain root.

WordPress

WordPress generates a virtual robots.txt by default. To customize it, use a plugin like Yoast SEO or Rank Math, or create a physical robots.txt file in your WordPress root directory (this overrides the virtual file).

Shopify

Shopify auto-generates robots.txt and does not allow direct editing. Use the robots.txt.liquid template (available since 2021) to customize rules for AI crawlers.

Pair your robots.txt with an llms.txt file and proper Schema markup for maximum AI discoverability. Together, these three files form the foundation of technical GEO implementation. See our Content Optimization guide for content-level best practices.