We Audited 8,000 sites. Here's why AI isn't citing you.
Over the past year, we ran AI-driven analyses on 8,000+ sites to see how machines actually read the web. When you strip away buzzwords and look at the markup, the same issues keep showing up—simple, structural things that make models decide, “I can’t trust this.”
This isn’t about the latest “AI SEO” hack. It’s about fundamentals that should’ve been solved years ago. Today, those basics are the ante for even being considered in AI-generated answers.
Below are the four patterns we kept seeing—what they are, why they matter, and how to fix them without rewriting your entire site.
1) The Semantic Void: 70%+ Hide Their Main Content
<main> is the clearest HTML signal for “this is the important stuff.” It lets crawlers separate your actual content from navigation, footer, and sidebar noise. Most sites still treat it like an optional extra.
The data (from our crawl):
- In 70%+ of pages, 0% of the primary body copy lived inside a
\<main>tag—just floating in generic\<div>s. - Nearly 65% of homepages used multiple
\<h1>tags, flattening hierarchy and removing a single topical anchor.
Why this tanks trust:
Think of an AI crawler like a researcher. \<main> is the book’s core text. Without it, your page looks like a stack of footnotes—no focal point, no clear narrative, no way to separate argument from boilerplate. Models don’t waste time inferring hierarchy you didn’t declare. They move on.
What we looked for:
- Presence of a single
\<main>with meaningful, non-nav content. - One
\<h1>that reflects the primary topic, with\<h2>–\<h4>used logically underneath.
Quick fix checklist:
- Wrap your core article, product description, or service copy in
\<main>. - Enforce one
\<h1>per page. - Use nested headings for scannability and topical clarity.
2) The Identity Crisis: 80% Send Conflicting Signals
Before a model cites you, it needs to confirm who you are. That means clean canonicalization and machine-readable links to your broader entity footprint.
The data:
- ~80% of domains had a canonical conflict (both
wwwand non-wwwlive; missing or mispointedrel="canonical"). - <15% used
sameAslinks in Organization schema to connect the site to authoritative profiles (Wikipedia, LinkedIn, official social accounts, etc.).
Why this blocks citation:
Canonical conflicts are like having two passports with slightly different names—both get downgraded. Missing sameAs creates an entity vacuum: there’s no reliable, machine-readable bridge between your site and the brand the model sees in news, profiles, and social. To the model, you’re a stranger—not a safe source to quote.
What we looked for:
- A single, consistently enforced canonical (domain + protocol) across the site.
- Organization schema with
sameAsURLs that map to high-trust profiles you actually control.
Quick fix checklist:
- Pick your canonical (
https://www.vshttps://) and enforce it at DNS, server, and CMS levels. - Audit
rel="canonical"for every template. - Add
Organizationschema withsameAsto Wikipedia (if applicable), LinkedIn, Crunchbase, YouTube, and official social handles.
3) The Ghostwriter Problem: 90% Publish “Anonymous” Content
In a world drowning in AI-written text, provenance is currency. Models weight content from named, verifiable experts more heavily—and they need to read that attribution in structured data, not just see a byline in pixels.
The data:
- 90% of articles and key service pages lacked explicit author or publication date metadata in structured data.
- Even when a byline was visible,
PersonandArticleschema were usually missing.
Why this loses to your competitor’s post:
Without an author and date the model can parse, yesterday’s piece by a ten-year veteran looks identical to a 2017 article by an intern. Faced with a choice, models cite the source tied to a real person and a recent, verifiable publish date. It’s a risk-reduction move, not a style preference.
What we looked for:
Article(or relevant) schema withauthor,datePublished,dateModified.Personschema for each author with resolvable profiles (e.g.,sameAsto LinkedIn).
Quick fix checklist:
- Add
Articleschema withauthor,datePublished,dateModified. - Create
Personschema for each author (and link to their authoritative profiles). - Keep dates current when you substantially update content.
4) The Echo Chamber: 95% Are Absent Where It Counts
Many brands have the right answers—buried on their own blogs. Meanwhile, the public conversations that shape training data happen elsewhere.
The data:
- 95% of brands with genuinely citable expertise had near-zero presence on relevant Reddit, Quora, or Stack Overflow threads.
- Those same threads were often dominated by amateurs or competitors.
Why this erases you from AI outputs:
Those communities contribute heavily to what models learn as “trusted solutions.” Accepted answers and debates there become signals of authority. If your voice isn’t in that corpus, you’re less likely to appear in model outputs—no matter how good your on-site content is.
What we looked for:
- Participation in topical subreddits, category-relevant Quora spaces, or applicable Stack Overflow tags.
- Consistent, non-promotional answers that cite primary sources (including your own content where truly relevant).
Quick fix checklist:
- Identify 3–5 communities where your audience actually asks questions.
- Show up weekly, answer directly, and link sparingly (only when the link is the answer).
- Turn your best community answers into FAQs and support docs—then add schema.
The Universal Playbook (Baseline, Not “Tips”)
This isn’t a bag of tricks—it’s table stakes if you want to be cited by models.
1) Structure everything.
- Put core content in
\<main>. - Use one
\<h1>and logical nested headings. - Add FAQ schema anywhere you’re answering questions.
2) Make your HTML answer the question.
Lead with the answer in your meta—skip the teaser copy.
- Bad: “Learn about our innovative AI-powered sales tools…”
- Good: “Our AI sales tools automate prospecting and lead qualification. This guide compares the top 10 platforms for 2025.”
3) Go static for what matters.
AI bots don’t reliably execute JavaScript. If key content is client-rendered, assume it’s invisible. Server-side render anything models must see. No exceptions.
4) Get off your own website.
Join the conversations your audience already trusts—Reddit, Quora, Stack Overflow—and answer questions with no strings attached. Become the reference everyone (including models) learns from.