Segment 67 of 2020-34 smaller than all the others?

20 views
Skip to first unread message

Henry S. Thompson

unread,
Aug 3, 2021, 3:22:45 PM8/3/21
to common...@googlegroups.com
This segment (CC-MAIN-2020-34 1596439739104.67) is significantly
smaller than the other 99 from that crawl:

18% smaller than the mean for the other 99 segments by du
99 segs mean sd
526,103,000 18,319,400
seg 67 421,931,797

20% smaller by number of requests in the first 10 files compared to *.68
mean sd
seg 67 32,883 217
seg 68 41,168 353

I have a student working on verifying that in general the segments are
all equally representative of the whole month, and the above came as a bit
of an unpleasant surprise. Is there a reason for this? Is it
true, and if so predictable, for other months?

Thanks,

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Sebastian Nagel

unread,
Aug 4, 2021, 9:19:36 AM8/4/21
to common...@googlegroups.com
Hi Henry,

there are two factors which influence the size of a segment
(or the number of page captures in a segment):

1. the crawler runs on a Hadoop cluster built on EC2 spot instances.
This means that at any point time nodes of the cluster are lost
together with the fetcher tasks running on this node including the
temporary data (page captures) the tasks hold. The tasks are then
restarted on other nodes of the cluster but because the fetching
is bound to a fixed time frame (3 hours) usually the restarted
tasks are able to crawl less pages. Not ideal, but there is a good
chance that the missing pages are fetched in the next month.
In addition, because the loss of spot instances happens at random
you might think of it as part of the URL sampling process.
Only if a segment gets too small (<80% of the average) the entire
segment is recrawled. So, that's the reason why some segments
can be smaller occasionally than the average of a monthly crawl.

2. because fetch lists are generated ahead for all segments there
is continuous drop in the number of successfully fetched pages over time
and the latest crawled segment is 1-2% smaller than the first one.


> I have a student working on verifying that in general the segments are
> all equally representative of the whole month, and the above came as a bit
> of an unpleasant surprise.

In addition to the above limitations, the URLs are not randomly distributed
over segments during fetch list generation in order to minimize the need
for DNS look-ups and robots.txt fetching/parsing given the fully distributed
architecture of Nutch without any central caches. If there are 100 URLs
sampled from one host, these 100 URLs should end up in a single segment
and are not spread over all 100 segments. The current configuration keeps
180 URLs per host together in one segment before "opening" the next segment
for URL from the same host.

All URLs of a host end up in a single partition (per-task fetch list)
within a segment. This allows to ensure politeness and a guaranteed crawl
delay in a single Java process (one fetcher task). The section about Nutch
in Tom White's book "Hadoop: The Definitive Guide" (1st, 2nd or 3rd edition)
describes how fetch lists are generated in more detail.

Also note that the page captures are shuffled using a pseudo-random hash
of the URL before they're written into the WARC files.


Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages