robots.txt for AI
Configure robots.txt to control AI crawler access and maximize your visibility across ChatGPT, Perplexity, Claude, and Google AI Overviews.
robots.txt is a plain text file at your website root that controls which crawlers can access your content. With the rise of AI search engines, your robots.txt configuration directly impacts whether your brand appears in AI-generated responses from ChatGPT, Perplexity, Claude, and Google AI Overviews. A 2025 study by Originality.ai found that over 35% of the top 1,000 websites block at least one AI crawler, often unintentionally reducing their AI visibility.
What Is robots.txt and Why Does It Matter for AI?
robots.txt is a standard protocol (defined in RFC 9309) that tells web crawlers which parts of a website they are allowed to access. Traditionally used for search engines like Googlebot and Bingbot, it now plays a critical role in Generative Engine Optimization (GEO) because AI platforms use dedicated crawlers to index content for their models. If your robots.txt blocks these AI crawlers, your pages will not be included in AI training data or real-time retrieval, meaning AI assistants cannot cite your content.
Disallow: / for unknown user agents, which silently blocks all AI crawlers. Use Prominara's AI Visibility Audit to verify your configuration.AI Crawler User Agents in 2026
Each AI platform operates its own web crawler with a unique user agent string. Here are the active AI crawlers you need to configure:
| User Agent | Platform | Purpose | First Seen |
|---|---|---|---|
| GPTBot | OpenAI (ChatGPT) | Indexes content for ChatGPT responses and model training | 2023 |
| ChatGPT-User | OpenAI (ChatGPT) | Real-time web browsing during ChatGPT conversations | 2023 |
| ClaudeBot | Anthropic (Claude) | Indexes content for Claude AI responses | 2024 |
| PerplexityBot | Perplexity AI | Real-time search and citation for Perplexity answers | 2024 |
| Google-Extended | Google (Gemini) | Indexes content for Gemini and AI Overviews | 2023 |
| Googlebot | Google Search | Traditional search indexing plus AI Overview content | 2004 |
| Bingbot | Microsoft (Copilot) | Indexes content for Bing and Microsoft Copilot | 2012 |
| meta-externalagent | Meta AI | Indexes content for Meta AI assistant | 2024 |
The distinction between GPTBot and ChatGPT-User is important: GPTBot crawls content for training and pre-indexing, while ChatGPT-User fetches pages in real-time when a user asks ChatGPT to browse a specific URL.
Recommended robots.txt Configuration
For maximum AI visibility, we recommend explicitly allowing all AI crawlers. An explicit Allow: / is stronger than simply not mentioning the bot, because some CDNs and WAFs (like Cloudflare) may block unknown user agents by default.
# === AI Crawlers ===
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: meta-externalagent
Allow: /
# === Traditional Search Engines ===
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# === Default Rule ===
User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /dashboard/
# === Discovery Files ===
Sitemap: https://yoursite.com/sitemap.xmlHow Should I Handle Selective Access?
Not all content should be exposed to AI crawlers. Authentication pages, admin panels, API endpoints, and user-generated private content should remain blocked. Here is a selective access configuration:
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Allow: /pricing/
Allow: /about/
Allow: /features/
Disallow: /admin/
Disallow: /api/
Disallow: /dashboard/
Disallow: /settings/
Disallow: /portal/What Should You Allow vs. Block?
| Action | Path Examples | Reason |
|---|---|---|
| Allow | /blog/, /docs/, /pricing/, /about/ | Public content you want AI to cite |
| Allow | /glossary/, /guides/, /case-studies/ | Educational content that builds authority |
| Block | /api/, /admin/, /dashboard/ | Internal application routes |
| Block | /portal/, /settings/, /account/ | Authenticated user areas |
| Block | /checkout/, /payment/ | Sensitive transaction pages |
How Does robots.txt Affect AI Visibility Scores?
Prominara's AI visibility scoring algorithm includes Technical Readiness as one of four scoring categories (weighted at 20% of the overall score). Within Technical Readiness, AI Crawler Access is the highest-weighted factor at 25%. Blocking any of the 4 major AI crawlers (GPTBot, ClaudeBot, PerplexityBot, GoogleBot) deducts 15 points each from the Technical Readiness score. A blanket noindex meta tag costs 50 points.
In practice, a site blocking all 4 AI crawlers loses approximately 12 points from its overall AI visibility score (60 points × 0.20 weight = 12 points).
Common Mistakes to Avoid
- Wildcard blocking unknown bots: Rules like
User-agent: * Disallow: /block every crawler including AI bots. Instead, use a permissive wildcard rule and block only specific bad bots. - Forgetting CDN-level bot blocking: Cloudflare, Akamai, and other CDNs have their own bot management that may block AI crawlers regardless of your robots.txt. Check your CDN settings separately.
- Using noindex instead of robots.txt: A
noindexmeta tag tells crawlers not to index the page, but the crawler still visits the page. robots.txt prevents the visit entirely, saving your server resources. - Not including a Sitemap directive: Always include a
Sitemap:line pointing to your XML sitemap. AI crawlers use sitemaps to discover content efficiently. - Blocking crawl-delay for AI bots: Unlike traditional crawlers, AI bots often respect
Crawl-delaydirectives. Setting a high delay reduces how much of your site gets indexed.
How to Verify Your Configuration
- Run a scan on your site using Prominara's AI Visibility Audit to check crawler access status
- Check the Technical Readiness section of your scan results for AI Crawler Access scores
- Review the scan recommendations for any robots.txt-related warnings
- Use Google's robots.txt Tester in Search Console for syntax validation
- Verify that your llms.txt file is also accessible alongside robots.txt
Platform-Specific Considerations
Next.js and Vercel
In Next.js applications, create a public/robots.txt file or use the app/robots.ts route handler for dynamic generation. Vercel automatically serves files from the public/ directory at the domain root.
WordPress
WordPress generates a virtual robots.txt by default. To customize it, use a plugin like Yoast SEO or Rank Math, or create a physical robots.txt file in your WordPress root directory (this overrides the virtual file).
Shopify
Shopify auto-generates robots.txt and does not allow direct editing. Use the robots.txt.liquid template (available since 2021) to customize rules for AI crawlers.