Here is an uncomfortable scenario we keep seeing: a business ranks on page one of Google, gets a healthy slice of classic search traffic — and is completely absent from ChatGPT, Claude, and Perplexity answers about exactly the topics it owns. Not because its content is weak. Because a single line in robots.txt tells the AI crawlers to go away.
This is the part of "AI search" nobody put in the onboarding docs. Answer engines don't read your site through Googlebot. They send their own named crawlers — and a surprising number of sites quietly block them, usually because a host, a privacy plugin, or a well-meaning "block the AI scrapers" blog post added the rule months ago.
Today we're shipping three new monitor checks that catch the failures that hide behind a healthy-looking 200 OK. The site monitor now runs 54 checks on every scan (42 SEO + 12 AEO). Here's what's new and why each one quietly costs you traffic.
1. Are you blocking the AI answer engines? (new AEO check)
AI answer engines fetch the web with named user-agents. If your robots.txt disallows them, you opt out of being read, summarized, and cited in their answers — even while Googlebot sails through. The ones that matter most right now:
- GPTBot — OpenAI's training crawler.
- OAI-SearchBot and ChatGPT-User — OpenAI's search/live-fetch agents (these are the ones that decide whether ChatGPT can cite you right now).
- ClaudeBot — Anthropic's crawler for Claude.
- PerplexityBot — Perplexity's crawler.
- Google-Extended — controls whether your content trains Gemini and feeds AI Overviews.
The new aeo_crawler_access check is deliberately narrow: it only flags a crawler when your site allows Googlebot for a URL but disallows the AI bot. In other words, it isolates the specific own-goal — "you welcome Google but lock out the answer engines" — instead of nagging you about a site-wide block (that's a different, louder problem our robots_txt rule already reports).
If you find you're blocking them and didn't mean to, the fix is a few lines:
# Allow AI answer engines in robots.txt
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
One nuance worth deciding on purpose: training crawlers (GPTBot, CCBot, Google-Extended) feed model knowledge, while live-fetch crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User) fetch in real time to cite you. Blocking the live-fetch bots is what removes you from today's answers — not just tomorrow's training runs. Plenty of people want to opt out of training but stay citable. That's a legitimate choice; the point is to make it deliberately, not by accident.
2. Soft 404s: pages that say "200 OK" but are really dead
A soft 404 is the sneakiest entry in the silent-killer catalog. The page returns 200 OK — so every "check for broken links" tool says it's fine — but it's actually gone: a stale URL that quietly serves your homepage, a single-page app that falls back to its shell for any unknown path, or a "page not found" screen rendered with a 200 status by mistake.
Google wastes crawl budget re-fetching these dead URLs that insist they're alive, and over time it can drop the real pages those URLs were supposed to be. Because the status code looks healthy, nothing in a normal audit catches it.
Our new soft_404 check reuses the exact classifier we built for our indexing pre-flight probe, so "soft 404" means the same thing across the whole product. It flags a 200-OK page when its canonical disowns the URL, when it redirected to your site root, when it renders your homepage's title with no canonical of its own, or when a short page's title/heading literally reads "page not found." The fix is almost always the same: if the page is gone, return a real 404/410 and drop it from your sitemap; if it should exist, restore real content with a self-referential canonical.
3. Duplicate titles and meta descriptions
The third addition, duplicate_metadata, is unglamorous and extremely common. When two or more pages share an identical <title> or meta description — usually from an un-customized CMS template — Google rewrites or suppresses your snippet, and your own pages cannibalize each other for the same query. Our existing duplicate-HTML check only caught pages that were byte-for-byte identical; two genuinely different pages that merely reuse the same title slipped right past it. Now they don't.
Why these three, together
They're the same kind of bug wearing different clothes: a page that looks healthy on the surface — 200 OK, renders fine in a browser — while silently bleeding SEO or AEO value. They're exactly the "silent traffic killers" our free Site Health Check was built to surface, and now the continuous monitor catches all three on every scan, with email alerts when a new one appears.
Check your own site in 30 seconds
You don't need an account to see where you stand. Our free AI Visibility Checker runs the answer-engine tier — including the new AI-crawler-access check — on any URL, and the free Site Health Check runs the full 54-check scan. No signup, results saved to a permanent URL. If you're blocking ChatGPT without knowing it, better to find out from us than to never find out at all.