Hi Yuri,
the code base for the main crawl is
https://github.com/commoncrawl/nutch/
But for the last 8 hours a small side crawl has been run
for development, testing and also to complete the fetch
lists of the main crawl. Small means here:
- starting from 9 million seeds
- restricted to max. 50-100 URLs/pages per host
- 50 million pages in total
This crawl was run with the mainline master branch of
Apache Nutch 1.x,
https://github.com/apache/nutch/
Note that the crawler does nothing than following URLs found
on the web. It fully respects exclusions by robots.txt.
I'm sorry if this caused troubles on your side.
Could you send me a specific URL and the request time?
(if yes, please, send it to:
in...@commoncrawl.org)
We are running on AWS spot instances and keep logs and
temporary data only for a short time. But right now we
could trace the URL back to one of the 9 million seeds.
Best,
Sebastian
On 03/21/2017 06:11 AM, Yuri Niyazov wrote:
> Yes, my issue isn't that I don't know how to capture traffic; my issue is that we don't by default
> capture all traffic, and I don't know how to force CCBot to recrawl my site on-demand so that I can
> turn on the traffic capture when I am expecting the bug to reoccur.
>
> On Mon, Mar 20, 2017 at 9:55 PM, Greg Lindahl <
lin...@pbm.com <mailto:
lin...@pbm.com>> wrote:
>
> You can use tcpdump to make a pcap file lasting for a few 10s of
> seconds, and it should contain several complete requests and
> responses. It's a good capability to have when it's not CCBot running
> into the bug :-)
>
> -- greg
>
> > CCBot was crawling
www.academia.edu <
http://www.academia.edu> and we had a weird stream of errors on
> > our side: it looked like CCBot hit a rarely used codepath that we are still
> > trying to track down. We blocked the bot temporarily so that the alerts
> > would stop, but we would prefer to fix the actual bug. Since we don't
> > control when the bot hits our site, and since we don't save the full
> > bytestream of all the requests that hit our site, the problem is proving
> > tricky to isolate.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl%2Bunsu...@googlegroups.com>.
> <mailto:
common...@googlegroups.com>.
> <
https://groups.google.com/group/common-crawl>.
> > For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
> <
https://groups.google.com/d/topic/common-crawl/lfdlSq_kA4E/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl%2Bunsu...@googlegroups.com>.
> <mailto:
common...@googlegroups.com>.
> <
https://groups.google.com/group/common-crawl>.
> For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> <mailto:
common...@googlegroups.com>.