Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Building a niche search engine

62 views
Skip to first unread message

Lukas Bogacz

unread,
Mar 16, 2025, 2:59:30 PMMar 16
to Common Crawl
I'm considering the idea of building a niche search engine. The goal would be to only index programming related information, starting with a particular tech stack. We could do this by starting with a curated list of domains and carefully crafting a set of crawling rules. Could the group please give me an idea as to how difficult I should expect this to be, and what are the largest issues I should expect to run into? Is it a matter of paying for bandwidth or more of a technical/engineering problem?
For context this isn't intended as a solo project, but rather as a startup. We're trying to estimate how much we should expect it to cost us and how long it would take in order to decide if to pursue the idea. The engine would be used by LLMs in order to provide them with background information, so we would make design choices around that.
I would appreciate any guidance in the matter.
Lukas

Rich Skrenta

unread,
Mar 16, 2025, 3:56:40 PMMar 16
to common...@googlegroups.com, Lukas Bogacz
I have died on this hill. Happy to chat.

Rich


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/00dda90f-87a0-421d-a3a3-3d5f99636188n%40googlegroups.com.

Daniel Halstrom

unread,
Mar 17, 2025, 2:08:52 PMMar 17
to common...@googlegroups.com, Lukas Bogacz
We got a full backlink index working with roughly 12 months of crawls and the ability to add more crawls working a few years back.
Server cost to keep it up was roughly 400$ for 2 .   Ran 2, update 1 at a time and swap them.   Could index anchors, titles and backlinks only.    No permutations or content indexing.

Sandra Borst

unread,
Mar 17, 2025, 2:40:36 PMMar 17
to common...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages