Inclusion check for an AI-native company domain (neogenesis.app) not in recent CC-MAIN snapshots

13 views

Skip to first unread message

허예솔

unread,

May 13, 2026, 10:06:58 PM (6 days ago) May 13

to Common Crawl

Hi Common Crawl maintainers and community,

I run Neo Genesis (https://neogenesis.app), an AI-native automation company publishing 9 CC-BY-4.0 datasets on Hugging Face plus research output at /data/research/. The site has been publicly published since 2026-04-27.

Quick sanity check: our robots.txt explicitly allows CCBot, the site responds with HTTP 200 to a CCBot user-agent, and we have no Cloudflare WAF block. However, neogenesis.app does not appear in CC-MAIN-2026-08, CC-MAIN-2026-12, or CC-MAIN-2026-17 according to the CDX index (No Captures found).

We published a longitudinal GEO benchmark (Hugging Face: neogenesislab/ai-brand-mention-baseline-2026) showing a measured 0% canonical-URL citation rate from frontier LLMs. We suspect Common Crawl inclusion is the upstream blocker since FineWeb, Dolma, RedPajama, and similar derivative corpora all start from CC.

Two questions:

1. Is there anything in our setup we are missing that would prevent CCBot from picking up neogenesis.app? We have inbound links from huggingface.co/neogenesislab (9 datasets), github.com/Yesol-Pilot (multiple repos), and wikidata.org (P856 official website on Q139569680), all of which appear in CC.

2. Is there a way to suggest neogenesis.app as a seed for the next crawl cycle, or is the standard expectation that it will appear organically once enough inbound links accumulate?

For reference:

- Site: https://neogenesis.app

- robots.txt: https://neogenesis.app/robots.txt (25+ AI bots explicitly allowed)

- llms.txt: https://neogenesis.app/llms.txt

- llms-full.txt: https://neogenesis.app/llms-full.txt (84KB AI corpus)

- HF datasets: https://huggingface.co/neogenesislab

- Wikidata: Q139569680 (parent organization), 13 registered entities total

Thanks for the work you do - the entire AI ecosystem depends on it, and we want to make sure our corner of the open web is reachable.

Neo Genesis Research

Greg Lindahl

unread,

May 13, 2026, 10:50:24 PM (6 days ago) May 13

to common...@googlegroups.com, CC Info

In the latest 3 crawls, we have seen 0 links to your website. We don't
crawl github or hugging face that deeply, so it's no surprise that we
haven't seen the links there.

So yes, it's lack of incoming links. We prefer to let that process be organic.

-- greg

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/common-crawl/fda8a59f-e8e0-40a1-83fb-142017589333n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages