Javascript

36 views

Skip to first unread message

Olexiy Lytvynenko

unread,

Dec 6, 2016, 6:18:16 AM12/6/16

to Common Crawl

Hello,

I'd like to know, whether Common Crawl applies some Javascript engine in the crawling process to deal with dynamic content.

Thanks in advance.

Best regards,

Alex

Sebastian Nagel

unread,

Dec 6, 2016, 9:39:29 AM12/6/16

to common...@googlegroups.com

Hi Alex,

no, it does not.

It would complicate the crawling, not only because of the execution of Javascript,
but also to hold all required libraries and dependencies to avoid that they
are refetched again with every HTML page.

The same problem appears on the other side when a page is read from the WARC archives.
The WARC simply archives the HTTP traffic between crawler and server(s),
but a dynamic page is composed of multiple requests and responses which may
not necessarily end up in the same WARC file.

Of course, there are many good reasons why a crawler should take dynamic content into
account...

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages