Hi everyone,
I'm working on solving the Spider-Trap problem in my personal project.
My goal is to analyze offline URL-MD5 data to generate regex rules that prevent our crawler from downloading different links that yield identical content.
Currently, I'm applying a Trie data structure (drawing from my experience in ICPC contests) to analyze the data. specifically, I build a Trie from the URL paths, calculate statistics for each node, and mark a node as a "BadNode" if its sons contains a high volume of duplicate content (identical MD5s). Finally, I extract the path prefixes from these BadNodes to generate the blocking rules.
However, this strategy doesn't seem to be very effective in practice.
I am writing to ask for suggestions. What are the industry-standard methods for detecting these types of traps using offline data analysis?
Thanks, Bohan Mao
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/de0cf9d9-99a3-44a6-bff0-21ca9ad96b75n%40googlegroups.com.