Common Crawl: does AI have anything to remember you by?
Common Crawl

Common Crawl: does AI have anything to remember you by?

Na start:

When someone asks an LLM about you, the model has two sources of knowledge. The first: memory from training – what it absorbed before it ever reached a user. The second: fresh sources it reaches for at the moment of the query (RAG/web search). This article is about the first case, and about one of the largest foundations of LLMs’ training knowledge: Common Crawl.

What is Common Crawl?

Common Crawl is a non-profit organization founded in 2007 by Gil Elbaz, which since 2008 has done one thing: regularly crawl the internet and make the collected data available for free.

Every month CCBot visits billions of pages, downloads their content, and stores it in WARC files (Web ARChive – a page-archiving format that holds the raw HTML together with HTTP headers and metadata) on Amazon S3 servers. As of today that’s hundreds of billions of pages and over 10 petabytes of data, publicly available, with no licensing fees.

How much is 10 PB of data in Common Crawl? Roughly three times more than Netflix’s entire catalog of films and series, and over twenty times more than the ~100 million tracks available on Spotify (at standard streaming quality).

That’s a comparison of volume, not information density (text is incomparably denser than video or audio), but it gives you the scale: a lot.

Why does this matter? Because this data is one of the most important public sources on which many large language models were trained – especially in their filtered versions.

Common Crawl logo
https://commoncrawl.org/, Public domain, via Wikimedia Commons

A report by Stefan Baack of the Mozilla Foundation (2024), presented at the ACM FAccT conference, showed that 64% of the 47 models analyzed used at least one filtered version of Common Crawl.

In the case of GPT-3, over 80% of the training tokens came from filtered Common Crawl. Meta’s LLaMA, Google Research’s C4 dataset, Dolma, and RedPajama – all are built on Common Crawl.

You can have a perfect entity in Wikidata, exemplary structured data, a Knowledge Panel in Google – and at the same time be completely invisible to language models in their parametric layer. Because that layer depends on what was in the training data. And the training data largely comes from Common Crawl.

Wikidata, which I’ve written about before, tells the systems “who you are.” Common Crawl gives them the material from which they learn how to talk about you.

Don’t confuse presence in Common Crawl with presence in an LLM

Common Crawl is a data source. A training dataset is a filtered version of that source. The model is yet another stage. Presence in CC increases the chance that your content makes it into training, but guarantees nothing. Between your page and the model’s answer there are several filtering layers, each of which can reject you. It’s also worth knowing that some companies – OpenAI, Anthropic, Google – also use their own crawlers, licensed sources, or other data-acquisition mechanisms independent of Common Crawl.

How to check whether your domain is in Common Crawl?

Common Crawl provides an index of every crawl (the so-called cc-index) through a simple interface. You go to index.commoncrawl.org, pick a crawl from the list (the newest is at the top), and enter your domain in the URL field. You can search a whole domain (e.g. podrez.pl/*) or a specific URL. You’ll get the same result with this query:

https://index.commoncrawl.org/CC-MAIN-2026-17-index?url=yourdomain.com/*&output=json

The crawl numbers (here CC-MAIN-2026-17) change roughly every month, and you’ll find the full list at index.commoncrawl.org/collinfo.json.

index.commoncrawl.org search interface
the index.commoncrawl.org search interface

If your domain is in the index, you’ll get a response in JSON format. The most important fields:

  • timestamp – exactly when CCBot downloaded your page (format: year, month, day, hour).
  • status – the HTTP response code. 200 means the page was available and downloaded correctly.
  • languages – the language CC categorized your page under. This matters because multilingual models may treat content in different languages with different priority.
  • filename – the path to the WARC file where your page’s raw HTML was stored. Technical, but useful if you want to see exactly what CCBot “saw.”

If the query returns no response, then your domain isn’t in that crawl. Remember too that Common Crawl doesn’t archive entire sites – it downloads a selected subset of pages. The absence of a specific URL in a given crawl doesn’t mean the page is “invisible to AI.” It may mean CC only downloaded a slice of the domain in that pass.

💡 Worth knowing

Common Crawl isn’t only fuel for AI – it’s also one of the largest internet archives. As Mateusz Godzic (Head of Growth at Whitepress and one of the few specialists in Poland regularly working with Common Crawl data) noted during our private conversation at SEO Vibes on Tour in Wrocław, people often struggle with the limitations of the archive.org API (60 requests per minute, IP blocks once you exceed the limit, ever-tighter restrictions after the 2024 attacks) while there’s a vast resource of raw HTML available in Common Crawl.

The difference: the Wayback Machine shows you what a page looked like. Common Crawl gives you raw HTML to analyze.

It’s not WHETHER you’re in Common Crawl that counts, but WHAT Common Crawl pulls from your site [podrez.pl case]

The podrez.pl domain (the one you’re visiting right now) has been registered since 2008 – roughly since the start of my SEO career. At first it held a “placeholder,” then a redirect to my agency’s site, and around 2017 it began functioning as a one-pager, eventually becoming a full-fledged website.

When I started checking my presence in Common Crawl, I assumed there wasn’t much of one. I checked a few crawls at random – many had no results for podrez.pl at all. It looked as though the domain had only appeared in CC recently. But Mateusz Godzic, after reading a draft of this article, suggested searching the cc-index more broadly (thank you!), and it turned out the story is far more interesting than “zero in Common Crawl.”

Podrez.pl has been in Common Crawl since January 2017. Uninterrupted. Across dozens of crawls. Except that what CCBot pulled from it changed a great deal over the years.

Nine years in Common Crawl in a nutshell

  • 2017–2018: podrez.pl is a one-pager – the homepage and robots.txt. CCBot pulls 2 URLs. There’s nothing more to crawl.
  • 2019: CCBot catches the portfolio subpages – conference talks from SEO events, interviews, Fox Strategy agency case studies. The site grows, but it’s still the calling card of a specialist with a track record, not a content hub.
  • 2020–2021: something at the hosting level starts returning a 406 (Not Acceptable) response to CCBot. Not just podrez.pl – the same code was returned to ms-fox.pl, my culinary blog, on the same server. Robots.txt was open, CCBot wasn’t blocked. But the server rejected it. For a year and a half, both domains were effectively invisible to Common Crawl.
  • 2022: the first genuinely deeper crawl – ~36 URLs. But what does CCBot pull? The portfolio. Talks from semKRK, Festiwal SEO, workshops. Content that says “I was at a conference,” not “I’m an SEO expert.”
  • 2023: 58 URLs. The site had grown with video sections, podcasts, educational materials, books. CCBot went deep. But when I reviewed exactly what it was pulling, I saw a problem:

What CCBot pulled from podrez.pl in 2023 vs 2026

May 2023 (58 URLs): video – 19, podcasts – 7, educational materials – 7, books – 5, expert SEO/AI articles – 0

April 2026 (62 URLs): expert SEO/AI articles – 9, events – 7, services – 4, video/podcasts – 1

I classified the URLs manually, based on the addresses and the content of the downloaded pages, so I treat this as a qualitative analysis, not a full statistical study.

See the difference? In 2023, not a single article that would build the picture of an SEO strategist. A model trained on that data would see someone who appears a lot on podcasts and at conferences, but not an expert who writes about entities, E-E-A-T, or generative search.

  • Only from late 2025 did the crawl start to reflect my actual expertise. In the April 2026 crawl, CCBot pulled articles on brand entity, E-E-A-T, GEO vs SEO, personal branding in LLMs, brand mentions, and AI and SEO.

    For the first time in its history, Common Crawl had material from podrez.pl from which a model could build the picture of “an SEO strategist who also specializes in AI visibility.”
Fragment of podrez.pl server logs - record of CCBot/2.0 visits
Fragment of the podrez.pl server logs – a record of CCBot/2.0 visits in 2026. The visible sequence of requests shows how the bot moves from robots.txt to deep content subpages, exploring the domain’s structure.

To see the scale of the problem, just compare podrez.pl with my other domains:

Other domains, other stories

  • Ms-fox.pl – my culinary blog – has been in Common Crawl since 2019, with a deep crawl from day one. From the very first pass, CCBot pulled dozens of recipes, cuisine categories, lifestyle articles.
    In 2022, CC had ~90 recipe URLs from ms-fox.pl. It got a 406 too in 2020–2021 on the same server – but once unblocked, CC went straight back to deep crawling.
  • Bornholm-online.pl – a guide to Bornholm – launched in February 2024. It appeared in CC two months later, and from summer 2025 CC pulled ~20 URLs per crawl: towns, attractions, trails, accommodation, cuisine.
  • Madera-online.pl – a guide to Madeira – launched in February 2026. It appeared in CC after just a month. In the April 2026 crawl, 31 URLs – almost the entire, very young site: towns, attractions, trails, cuisine, history.

One thing connects these domains: from the start, I built them as a single ecosystem. Bornholm-online.pl and madera-online.pl were linked from podrez.pl and ms-fox.pl – domains CC had been crawling for years. They shared structured data (schema.org connecting Person with each domain through sameAs) and a shared author entity.

I don’t know whether that sped up their appearance in CC, but I do know that CCBot discovers new URLs through links. If you link to a new domain from a page CC already knows, you’re giving it a path.

For those who know the context

If you’ve read the book Marka osobista w czasach AI i generatywnego wyszukiwania (currently available in Polish) or seen my talk at I Love Marketing (The personal-brand entity – what Google and AI know about you) – here’s where a symbolic “lightbulb” comes on. Yes: this is one of the sources of the problem I describe when merging my culinary and SEO entity. Two domains, one person, and for years Common Crawl “knew” the culinary one better.

What does this mean for AI models?

Models trained on data from 2023 had podcasts, video, and conference talks from podrez.pl. From ms-fox.pl – dozens of keto recipes. A model asked about me could describe me as a cookbook author, often skipping over more than 15 years of work in SEO.

This isn’t a hallucination in the classic sense. The model tells the truth, but… an incomplete one – that’s all it had in the training data. And that picture stays frozen until the next round of training.

This is exactly what I “fought” with in the context of my own entity, and that work became research material for my book (available in Polish)

Two domains, one person, two different pictures in AI

ms-fox.pl – in Common Crawl since 2019, a deep crawl of recipes from the start. For models: cookbook author, keto blog.

podrez.pl – in Common Crawl since 2017, but with expert SEO/AI articles only since 2025/2026. For models trained earlier: a far smaller chance of presence as an SEO expert.

The takeaway for you? If your site right now is a calling card with three subpages, or a portfolio of conference recordings, that portrait will stay with you for years – even if CCBot crawls it.

A few unknowns

I’d like to write that I know exactly why CCBot pulled video and podcasts from podrez.pl for years instead of articles. I don’t – but in fairness, it has to be said that there were few expert articles on the site; I only started writing more of them from 2024. Before that, the site offered CCBot what it had: a conference portfolio, recordings, and educational materials.

What I do know for certain is that:

  • In 2020–2021 the hosting returned a 406 response to CCBot – on both podrez.pl and ms-fox.pl. I don’t know the reason, and it’s hard to establish now. Robots.txt wasn’t blocking CCBot, but the server rejected it.

    This confirms that an open robots.txt alone doesn’t guarantee a page is accessible to crawlers.
  • The newer domains in my ecosystem (bornholm-online.pl, madera-online.pl) made it into CC within 1–2 months of launch – with a deep crawl right away.

    CC doesn’t need years to find you if you have content and linking.
  • The only meaningful change I can point to on my side is the appearance of expert articles about SEO and AI. CC had been crawling podrez.pl deeply for years, but only now did it have something real to pull.

Common Crawl publishes web graphs and rankings based, among other things, on Harmonic Centrality, and its technical materials mention that this kind of data can be used to steer subsequent crawls. Does the same mechanism affect crawl depth? I’m not sure. But I’m observing something else: once articles forming a cluster around a single topic – personal brand × SEO × AI – appeared on the site, the crawl started to reflect that specialization.

I’m not claiming, by the same token, that CCBot measures a page’s topical depth. But I do know that what you publish shapes what CCBot pulls from your site.

Correlation? Strong. Causation? Impossible to prove without access to Common Crawl’s internal mechanisms.

But as a practitioner, I don’t treat this case as proof – only as a signal:

Being in Common Crawl is only the beginning. What counts is what CCBot pulls from your site – that’s the material from which training datasets can be built.

Common Crawl and the memory of LLMs

As a reminder, language models have two sources of knowledge about you:

  • Parametric memory – everything the model “absorbed” during training. If your content wasn’t in Common Crawl at the moment the model was trained, the model has a far smaller chance of knowing you. Even if Google sees you. Even if you have a Knowledge Panel. Even if your Wikidata is filled in flawlessly.
  • RAG (Retrieval-Augmented Generation) – systems like Perplexity, ChatGPT with web search, or Google AI Overviews reach for current sources at the moment they generate an answer. Here what counts is the page’s indexability and visibility, not its presence in the training data.

And these two channels work independently.

⚠️ This matters: Common Crawl ≠ training data

Being in Common Crawl is only half the road. Model creators don’t use raw CC data directly – first they filter and clean it into so-called derived datasets (e.g. FineWeb from Hugging Face, Dolma from AI2, C4 from Google Research). These filters reject low-quality pages: full of ads, with a low content-to-boilerplate ratio, with large numbers of duplicates.

The scale of filtering can be brutal: in the case of Hugging Face’s FineWeb-Edu, an AI-based classifier rejected 92% of the Common Crawl data as not valuable enough – out of 15 trillion tokens, 1.3 trillion remained.

If your page passes through CCBot but doesn’t pass the quality filters, it still won’t make it into the training data. The volume of content opens the door to CC, but quality decides whether that content is included in the model’s training.

From your site to the AI model’s answer A pipeline: your site, then Common Crawl, then a derived dataset, then the AI model’s parametric memory. At each stage your content can be rejected. Presence in Common Crawl increases the chance but does not guarantee presence in the model. From your site to the AI model’s answer Your site HTML, content, code Common Crawl raw HTML dump Derived dataset FineWeb, C4, Dolma AI model parametric memory CCBot crawls and stores in WARC quality filters clean and select model training on selected data At every stage your content can be rejected robots.txt blocks firewall returns 403 too few links bio in the sidebar too many ads messy code duplicates low text quality language filtered out Presence in Common Crawl increases the chance, but doesn’t guarantee presence in the model.

I asked about filtering Common Crawl data at the source

On a discussion panel during the SEO Vibes Summit in Zakopane (May 2026), I had the chance to ask about data filtering with Stephen Burns, Web Intelligence Lead at the Common Crawl Foundation – someone working at exactly the intersection of SEO, GEO, and web-scale data. I asked whether it’s true that CC data is heavily filtered before it reaches training, and what proportion of it actually ends up there. He confirmed the fact of filtering (“it’s true“), but didn’t comment on specific statistics.

And that’s what I find most interesting: even in conversation with someone from inside the organization, the exact thresholds and selection criteria remain semi-opaque. You can’t “hack” it – because no one on the outside knows the exact gate. You can only publish so that your content passes that gate: valuable, clean, with a high content-to-boilerplate ratio.

What can you do with knowledge about Common Crawl?

Common Crawl isn’t Wikidata – you don’t create an entry there. You can, however, influence whether CCBot finds you and whether your content has a better chance of appearing in the public index. Remember, though, that this isn’t a guide to “how to get into an LLM,” but to how not to fall out at the first stage – crawlability, accessibility, and the quality of the source HTML.

  • Check robots.txt. Don’t block CCBot. If you have User-agent: CCBot with Disallow: /, that’s a deliberate decision to be absent from Common Crawl – one of the most important public data sources used in model training datasets.

    Many companies do this (the New York Times, Reddit). In most cases, if you’re building a personal brand and care about presence in public web datasets, blocking CCBot works against that goal.
  • Check your firewall. Your hosting or Cloudflare may block CCBot at the firewall level, even if robots.txt is open. Aggressive anti-bot settings (e.g. Bot Fight Mode in Cloudflare) can return a 403 error without your knowledge.

    In my case, the hosting returned a 406 response to CCBot for a year and a half – on both of my domains at once. I didn’t know about it until I checked the cc-index. A 406 code often results from Mod_Security rules on the server that reject requests without certain headers – if you see it in the cc-index, contact your hosting support.

And if you don’t want to be in Common Crawl?

CCBot respects robots.txt – you can block it at any time. That gives a chance the rule will be honored, but data already collected is hard to remove from existing archives. Common Crawl launched an Opt-Out Registry, but in practice it’s mainly the big players who use it. The takeaway: before you start publishing, think about what you want AI to “remember” about you.

  • Publish regularly. My case suggests that a long time can pass between CCBot’s first visit and the domain appearing in the public index. Unless the site develops dynamically and is linked from pages already present in CC.
  • Take care of internal linking. CCBot discovers subpages through links, so if your most important articles are buried three clicks from the homepage, the bot may not reach them.

    A “Latest posts” section or a well-organized menu isn’t only UX – it’s the path the bot takes to your content.
  • Take care of clean HTML. Common Crawl stores your page’s raw HTML – together with everything on it: navigation, footer, sidebars, cookie bars, and all the “wrapping” that isn’t the actual content (i.e. boilerplate).

    But note: AI models don’t train directly on that raw dump. Before your page reaches the training data, it passes through cleaning scripts like trafilatura (and given my ties to the culinary world, I have to stress that I mean text extraction here, not the pasta-shaping process 😉) or resiliparse, which try to separate the actual content from the ornaments – cut out the menu, footer, ads, and leave only what’s the article.

    These filters aren’t perfect, though. If your most important information appears in a sidebar or widget, the parser may treat it as a navigation element and filter it out.

    Even though it was in the crawl, it won’t make it into the training data. And one more thing: CCBot doesn’t render JavaScript – it pulls raw HTML. If your content loads dynamically through JS or sits behind a paywall, CCBot won’t see it.
  • Expose your key content. Keep it in the main article block. Use HTML tags that clearly say “this is content” (article, section, headings) – in other words, take care of semantic HTML.

    And use BLUF – expose the most important conclusion at the start of the text, not at the end. The easier it is for the parser to find the actual content, the lower the risk that key information gets treated as boilerplate and filtered out.
  • Be linked from domains CC crawls regularly. Common Crawl discovers new URLs through links, like a typical crawler.

    If no one links to you from domains present in Common Crawl, CCBot may simply never find you. Links from industry media, guest posts, citations with a link to your site are the paths by which CCBot can reach you.
  • Monitor your presence. Check the cc-index after every new crawl (updated monthly). If your domain has started to appear, track which URLs are being crawled. If it isn’t appearing, work on the volume and quality of your content. CCBot doesn’t execute JavaScript, so you won’t see it in Google Analytics – the only reliable sources are your server access logs or the cc-index.

What Common Crawl doesn’t give you

Common Crawl is raw data – text, HTML, metadata. CCBot doesn’t know you’re an SEO expert. It doesn’t know you wrote a book. It doesn’t know your company has existed for ten years. It just… collects text.

The AI visibility ecosystem Common Crawl and Wikidata feed your brand in AI from above; your site with structured data feeds it from below. Common Crawl supplies raw text the model knows from training, Wikidata supplies structured facts, and the site supplies currency through RAG. Each element strengthens the others. The AI visibility ecosystem the model “knows” from training Common Crawl parametric memory Wikidata entity identity Your brand in AI a coherent expert picture Site + structured data currency (RAG) raw text structured facts the model “finds” in real time Each element strengthens the others. One without the other is an incomplete picture.

That’s why Common Crawl and Wikidata are two complementary channels:

  • Common Crawl gives models raw fuel – the text from which they learn language and context.
  • Wikidata gives algorithms structured facts – a label on that fuel that says who you are and what confirms your expertise.
  • Your site with structured data connects both worlds – it provides both content to crawl and structure to recognize.

One without the other is an incomplete picture

An expert visible in Wikidata but absent from Common Crawl may have a Knowledge Panel – but LLMs may have far weaker, fragmentary knowledge of them in the parametric layer. An expert present in Common Crawl but without structured data will simply be one of the millions of pages the model absorbed – with no distinct identity.

Visibility in 2026 isn’t one place. It’s an ecosystem: Common Crawl, Wikidata, structured data, content on your site, mentions across the web. Each element strengthens the others.

Common Crawl FAQ

How do I check whether my domain is in Common Crawl?

Go to index.commoncrawl.org, pick a crawl from the list, and enter your domain (e.g. yourdomain.com/*). You’ll get the same result with an API query: https://index.commoncrawl.org/CC-MAIN-2026-17-index?url=yourdomain.com/*&output=json. If the domain is in the index, you’ll get a JSON response with the fields timestamp, status, languages, and filename. No response means your domain isn’t in that particular crawl (which doesn’t mean it isn’t in others).

My site is in Common Crawl, but AI doesn’t know me. Why?

Because presence in Common Crawl isn’t the same as presence in a model’s training data. Between your page and an LLM’s answer there are several layers: raw CC data → a filtered dataset (FineWeb, C4, Dolma) → model training. Each of them can reject you – for example, the FineWeb-Edu quality filters rejected 92% of the CC data. On top of that, some companies use their own crawlers independent of Common Crawl.

Should I block CCBot in robots.txt?

It depends on your goal. If you’re building a personal or expert brand and care about presence in public training datasets, blocking CCBot works against that. The big players protecting their content do it (NYT, Reddit), but for most specialists presence is more beneficial. Remember too that data once collected is hard to remove from existing archives, so make the decision before you start publishing.

How does Common Crawl differ from the Wayback Machine?

The Wayback Machine (archive.org) shows what a page looked like at a given moment – it’s a visual archive. Common Crawl gives raw HTML for machine analysis and serves as a training-data source for AI. They’re two different tools for two different purposes.

I see a 406 or 403 code for my domain in the cc-index. What does it mean?

Your server or firewall is rejecting CCBot, even if robots.txt is open. A 406 code often results from Mod_Security rules rejecting requests without certain headers; a 403 may come from aggressive anti-bot settings (e.g. Bot Fight Mode in Cloudflare). In my case, the hosting returned a 406 to CCBot for a year and a half – I only found out from the cc-index. If you see this, contact your hosting support.

Sources and further reading

UdostępnijFacebookX
Avatar of Ewelina Podrez-Siama
Napisane przez
Ewelina Podrez-Siama
Dołącz do dyskusji

Index