Query Task: Looking to Find Email HTML5 INPUT Fields

56 views
Skip to first unread message

David E. Weekly

unread,
Nov 8, 2021, 8:11:36 PM11/8/21
to Common Crawl
Hello, Common Crawlers!

Apologies for the n00b question but I'm looking to return a list of URLs that contain at least one HTML fragment with an email input entry, e.g. of the form <input type="email"... or <input ... autocomplete="email">. It's okay if it's incomplete (this simplistic approach is obviously going to miss sites that synthesize the DOM with JS vs include these fragments in statically-served HTML), I'm just looking for a first pass.

For those curious, the attempt is to start to learn a mapping of "places where an email can be entered" to HTTP POST destinations and ultimately mailing lists (and to then score such mailing lists for quality).

It wasn't clear to me after going through the Athena ccindex tutorials how I could perform this task; it looked like I'd just need to download, decode, and grep the full WARC content of the crawl, which looks like quite a large task. Are there "shortcuts" recommended here or techniques that have served others well (e.g. just querying those web pages that have POST destinations which I could filter down from the full crawl set)?

Thanks so much - and apologies if this is the wrong place to post; happy to be pointed in the right direction.

Yours,
  David E. Weekly

Sebastian Nagel

unread,
Nov 15, 2021, 3:49:45 AM11/15/21
to common...@googlegroups.com
Hi David,

the columnar index does not include the HTML, so you cannot use
it for your task. But you could use it to pick and sample URLs
which look like a contact form by matching a regular expression
on the "url_path" column. Via WARC file path, offset and legth
it's possible to pick only the selected records.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/7065d3ad-85cf-4184-b6e1-178cf855ec3bn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/7065d3ad-85cf-4184-b6e1-178cf855ec3bn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Greg Lindahl

unread,
Nov 15, 2021, 10:54:43 AM11/15/21
to common...@googlegroups.com
Sebastian and David,

I think David's willing to pay to "grep the web" by iterating over all
of the crawl content. Sebastian, is there a modern example of this
somewhere in the CC website's list of examples? And does anyone
have a modern $$ amount for doing this? The number in my head is
$40 for a standard-sized CC crawl, but I think that was years ago.

-- greg
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/1a8721be-33f7-af2d-4f18-67228b615bd8%40commoncrawl.org.

Colin Dellow

unread,
Nov 15, 2021, 11:59:48 AM11/15/21
to Common Crawl
I have a framework I've been meaning to clean up and open source. It spins up a self-contained environment using CloudFormation which dispatches work to spot instances via an SQS queue and writes JSON lines files to S3. IIRC, grepping an entire crawl of WARCs is about $5-10 depending on how many hits you get. It can be much cheaper if you also filter on language or URL, as in those cases we can use the metadata warc entry to skip decompressing the request/response entries.

@David, I could grant you access to the repos and walk you through how to get it set up. On a sample of 3M URLs, about 17% have such an input field. Here's an example of 100 hits: https://gist.github.com/cldellow/e53399b9da10d13bbecda447230f5c00

Sebastian Nagel

unread,
Nov 15, 2021, 12:14:46 PM11/15/21
to common...@googlegroups.com
Hi David, hi Greg,

Colin Dellow shared some code and benchmarks:
https://code402.com/blog/hello-warc-common-crawl-code-samples/
https://github.com/code402/warc-benchmark

Examples based on AWS Lambda

https://aws.amazon.com/blogs/apn/analyzing-performance-and-cost-of-large-scale-data-processing-with-aws-lambda/
https://github.com/candidpartners/lambda-at-scale
or
https://github.com/andresriancho/cc-lambda

And there is a quite similar task done using cc-pyspark on EMR:
https://psuter.net/2019/07/07/z-index

> And does anyone
> have a modern $$ amount for doing this? The number in my head is
> $40 for a standard-sized CC crawl, but I think that was years ago.

The $40 grep was back in 2015

https://engineeringblog.yelp.com/2015/03/analyzing-the-web-for-the-price-of-a-sandwich.html

It was done on the WET files which contain the extracted text and are
only 10% the size of the WARC files which include the raw HTML.


Colin's numbers suggest that you can grep the WARC files even cheaper,
the AWS Lambda blog post mentions a total of $160. My experiences using
Spark on Bigtop Hadoop are in the same range: about $120 (but it's more
than a simple grep).

Best,
Sebastian

Sebastian Nagel

unread,
Nov 15, 2021, 12:26:23 PM11/15/21
to common...@googlegroups.com
Hi Colin,

> grepping an entire crawl of WARCs is about $5-10

That's indeed very impressive! Thanks for sharing this
number!

Sebastian
> <https://groups.google.com/d/msgid/common-crawl/7065d3ad-85cf-4184-b6e1-178cf855ec3bn%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/7065d3ad-85cf-4184-b6e1-178cf855ec3bn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send an email to common-crawl...@googlegroups.com.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/1a8721be-33f7-af2d-4f18-67228b615bd8%40commoncrawl.org
> <https://groups.google.com/d/msgid/common-crawl/1a8721be-33f7-af2d-4f18-67228b615bd8%40commoncrawl.org>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/fafb0032-736a-4fb0-8727-584c77c427a3n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/fafb0032-736a-4fb0-8727-584c77c427a3n%40googlegroups.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages