there are two factors which influence the size of a segment
(or the number of page captures in a segment):
1. the crawler runs on a Hadoop cluster built on EC2 spot instances.
This means that at any point time nodes of the cluster are lost
together with the fetcher tasks running on this node including the
temporary data (page captures) the tasks hold. The tasks are then
restarted on other nodes of the cluster but because the fetching
is bound to a fixed time frame (3 hours) usually the restarted
tasks are able to crawl less pages. Not ideal, but there is a good
chance that the missing pages are fetched in the next month.
In addition, because the loss of spot instances happens at random
you might think of it as part of the URL sampling process.
Only if a segment gets too small (<80% of the average) the entire
segment is recrawled. So, that's the reason why some segments
can be smaller occasionally than the average of a monthly crawl.
2. because fetch lists are generated ahead for all segments there
is continuous drop in the number of successfully fetched pages over time
and the latest crawled segment is 1-2% smaller than the first one.
> I have a student working on verifying that in general the segments are
> all equally representative of the whole month, and the above came as a bit
> of an unpleasant surprise.
In addition to the above limitations, the URLs are not randomly distributed
over segments during fetch list generation in order to minimize the need
for DNS look-ups and robots.txt fetching/parsing given the fully distributed
architecture of Nutch without any central caches. If there are 100 URLs
sampled from one host, these 100 URLs should end up in a single segment
and are not spread over all 100 segments. The current configuration keeps
180 URLs per host together in one segment before "opening" the next segment
for URL from the same host.
All URLs of a host end up in a single partition (per-task fetch list)
within a segment. This allows to ensure politeness and a guaranteed crawl
delay in a single Java process (one fetcher task). The section about Nutch
in Tom White's book "Hadoop: The Definitive Guide" (1st, 2nd or 3rd edition)
describes how fetch lists are generated in more detail.
Also note that the page captures are shuffled using a pseudo-random hash
of the URL before they're written into the WARC files.