Hello, Common Crawlers!
Apologies for the n00b question but I'm looking to return a list of URLs that contain at least one HTML fragment with an email input entry, e.g. of the form <input type="email"... or <input ... autocomplete="email">. It's okay if it's incomplete (this simplistic approach is obviously going to miss sites that synthesize the DOM with JS vs include these fragments in statically-served HTML), I'm just looking for a first pass.
For those curious, the attempt is to start to learn a mapping of "places where an email can be entered" to HTTP POST destinations and ultimately mailing lists (and to then score such mailing lists for quality).
It wasn't clear to me after going through the Athena ccindex tutorials how I could perform this task; it looked like I'd just need to download, decode, and grep the full WARC content of the crawl, which looks like quite a large task. Are there "shortcuts" recommended here or techniques that have served others well (e.g. just querying those web pages that have POST destinations which I could filter down from the full crawl set)?
Thanks so much - and apologies if this is the wrong place to post; happy to be pointed in the right direction.
Yours,
David E. Weekly