CommonCrawl code

192 views
Skip to first unread message

yu...@academia.edu

unread,
Mar 21, 2017, 12:47:41 AM3/21/17
to Common Crawl
Where can I find the code for CCBot as it runs in production? Earlier today CCBot was crawling www.academia.edu and we had a weird stream of errors on our side: it looked like CCBot hit a rarely used codepath that we are still trying to track down. We blocked the bot temporarily so that the alerts would stop, but we would prefer to fix the actual bug. Since we don't control when the bot hits our site, and since we don't save the full bytestream of all the requests that hit our site, the problem is proving tricky to isolate. 

Greg Lindahl

unread,
Mar 21, 2017, 12:55:10 AM3/21/17
to common...@googlegroups.com
You can use tcpdump to make a pcap file lasting for a few 10s of
seconds, and it should contain several complete requests and
responses. It's a good capability to have when it's not CCBot running
into the bug :-)

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To post to this group, send email to common...@googlegroups.com.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Yuri Niyazov

unread,
Mar 21, 2017, 1:11:40 AM3/21/17
to common...@googlegroups.com
Yes, my issue isn't that I don't know how to capture traffic; my issue is that we don't by default capture all traffic, and I don't know how to force CCBot to recrawl my site on-demand so that I can turn on the traffic capture when I am expecting the bug to reoccur. 

On Mon, Mar 20, 2017 at 9:55 PM, Greg Lindahl <lin...@pbm.com> wrote:
You can use tcpdump to make a pcap file lasting for a few 10s of
seconds, and it should contain several complete requests and
responses. It's a good capability to have when it's not CCBot running
into the bug :-)

-- greg

On Mon, Mar 20, 2017 at 09:47:40PM -0700, yu...@academia.edu wrote:
> Where can I find the code for CCBot as it runs in production? Earlier today
> CCBot was crawling www.academia.edu and we had a weird stream of errors on
> our side: it looked like CCBot hit a rarely used codepath that we are still
> trying to track down. We blocked the bot temporarily so that the alerts
> would stop, but we would prefer to fix the actual bug. Since we don't
> control when the bot hits our site, and since we don't save the full
> bytestream of all the requests that hit our site, the problem is proving
> tricky to isolate.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
> To post to this group, send email to common...@googlegroups.com.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/lfdlSq_kA4E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl+unsubscribe@googlegroups.com.

Sebastian Nagel

unread,
Mar 21, 2017, 3:30:24 AM3/21/17
to common...@googlegroups.com
Hi Yuri,

the code base for the main crawl is
https://github.com/commoncrawl/nutch/

But for the last 8 hours a small side crawl has been run
for development, testing and also to complete the fetch
lists of the main crawl. Small means here:
- starting from 9 million seeds
- restricted to max. 50-100 URLs/pages per host
- 50 million pages in total
This crawl was run with the mainline master branch of
Apache Nutch 1.x, https://github.com/apache/nutch/

Note that the crawler does nothing than following URLs found
on the web. It fully respects exclusions by robots.txt.
I'm sorry if this caused troubles on your side.

Could you send me a specific URL and the request time?
(if yes, please, send it to: in...@commoncrawl.org)
We are running on AWS spot instances and keep logs and
temporary data only for a short time. But right now we
could trace the URL back to one of the 9 million seeds.

Best,
Sebastian

On 03/21/2017 06:11 AM, Yuri Niyazov wrote:
> Yes, my issue isn't that I don't know how to capture traffic; my issue is that we don't by default
> capture all traffic, and I don't know how to force CCBot to recrawl my site on-demand so that I can
> turn on the traffic capture when I am expecting the bug to reoccur.
>
> On Mon, Mar 20, 2017 at 9:55 PM, Greg Lindahl <lin...@pbm.com <mailto:lin...@pbm.com>> wrote:
>
> You can use tcpdump to make a pcap file lasting for a few 10s of
> seconds, and it should contain several complete requests and
> responses. It's a good capability to have when it's not CCBot running
> into the bug :-)
>
> -- greg
>
> On Mon, Mar 20, 2017 at 09:47:40PM -0700, yu...@academia.edu <mailto:yu...@academia.edu> wrote:
> > Where can I find the code for CCBot as it runs in production? Earlier today
> > CCBot was crawling www.academia.edu <http://www.academia.edu> and we had a weird stream of errors on
> > our side: it looked like CCBot hit a rarely used codepath that we are still
> > trying to track down. We blocked the bot temporarily so that the alerts
> > would stop, but we would prefer to fix the actual bug. Since we don't
> > control when the bot hits our site, and since we don't save the full
> > bytestream of all the requests that hit our site, the problem is proving
> > tricky to isolate.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> > To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "Common
> Crawl" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/common-crawl/lfdlSq_kA4E/unsubscribe
> <https://groups.google.com/d/topic/common-crawl/lfdlSq_kA4E/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages