Hi Common Crawl maintainers and community,
I run Neo Genesis (
https://neogenesis.app), an AI-native automation company publishing 9 CC-BY-4.0 datasets on Hugging Face plus research output at /data/research/. The site has been publicly published since 2026-04-27.
Quick sanity check: our robots.txt explicitly allows CCBot, the site responds with HTTP 200 to a CCBot user-agent, and we have no Cloudflare WAF block. However,
neogenesis.app does not appear in CC-MAIN-2026-08, CC-MAIN-2026-12, or CC-MAIN-2026-17 according to the CDX index (No Captures found).
We published a longitudinal GEO benchmark (Hugging Face: neogenesislab/ai-brand-mention-baseline-2026) showing a measured 0% canonical-URL citation rate from frontier LLMs. We suspect Common Crawl inclusion is the upstream blocker since FineWeb, Dolma, RedPajama, and similar derivative corpora all start from CC.
Two questions:
2. Is there a way to suggest
neogenesis.app as a seed for the next crawl cycle, or is the standard expectation that it will appear organically once enough inbound links accumulate?
For reference:
- Wikidata: Q139569680 (parent organization), 13 registered entities total
Thanks for the work you do - the entire AI ecosystem depends on it, and we want to make sure our corner of the open web is reachable.
Neo Genesis Research