Incomplete crawl of a specific website?

278 views
Skip to first unread message

Tim Allison

unread,
Jan 29, 2020, 11:37:56 AM1/29/20
to Common Crawl
First, I LOVE CommonCrawl!  This is NOT a complaint!  Many, many thanks for all that you do!


I'm trying to figure out why there are so few documents for a specific site, and I'd like to make sure I'm understanding correctly what might be going on.

First, I'm looking in the December 2019 index for the pages on the site: https://jpl.nasa.gov  This is a Drupal site that relies heavily on javascript.  We had to use headless chrome to approach a useful crawl of this site on a different project.

Google estimates 1.2 million files with this query: site:jpl.nasa.gov
Bing estimates 1.8 million files with the same query.


Is the javascript likely the problem?  Does CommonCrawl limit the depth?  Are there other limitations in a crawl?

Thank you, again!

Cheers,

      Tim

Tom Morris

unread,
Jan 29, 2020, 12:52:23 PM1/29/20
to common...@googlegroups.com
On Wed, Jan 29, 2020 at 11:37 AM Tim Allison <talliso...@gmail.com> wrote:

Is the javascript likely the problem? 

It may contribute, but it's unlikely to be the main cause.
 
Does CommonCrawl limit the depth?  Are there other limitations in a crawl?

The crawl is limited by time and cost. If you check the archives, you'll find a number of explanations from Sebastien.

Tom 

Sebastian Nagel

unread,
Jan 29, 2020, 1:03:03 PM1/29/20
to common...@googlegroups.com
Hi Tim,

larger results are delivered in pages, so there are actually more page captures.

https://index.commoncrawl.org/CC-MAIN-2019-51-index?url=jpl.nasa.gov&matchType=domain&showNumPages=true

returns

{"pageSize": 5, "blocks": 43, "pages": 9}

which means 9 result pages, each containing 5 blocks, 43 blocks in total. One block contains 3,000
items - but the first and last block may also include URLs from other domains.

Please have a look at the CDX API doc
https://github.com/webrecorder/pywb/wiki/CDX-Server-API
how to iterate over pages. The CDX format is described on
https://github.com/webrecorder/pywb/wiki/CDX-Index-Format

Alternatively, you could use Greg's CDX toolkit
https://pypi.org/project/cdx-toolkit/
or Ilya's CDX client
https://github.com/ikreymer/cdx-index-client
to get everything in one turn, e.g.:

$> cdxt --cc --from 20191201 --to 20191231 iter '*.jpl.nasa.gov/*'

This will give you all 128,000 URLs from jpl.nasa.gov crawled in December 2019.

Well, this number is still too small by a factor of ten compared to those from Bing and Google:
first, it's clear we have to sample and need to set limits on the number of pages crawled per domain
and host.

But I will have a look at the coverage over time, combining multiple
monthly crawls. I've recently did this for the .edu top-level domain

it's easy to repeat it for "nasa.gov" or "jpl.nasa.gov". I'll come back to this thread tomorrow.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/2f6548b0-21c6-4b22-bf6d-6cff965825c4%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/2f6548b0-21c6-4b22-bf6d-6cff965825c4%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sebastian Nagel

unread,
Jan 29, 2020, 1:04:48 PM1/29/20
to common...@googlegroups.com

Tim Allison

unread,
Jan 29, 2020, 1:30:30 PM1/29/20
to Common Crawl
And that's exactly why I checked...to make sure I wasn't profoundly misunderstanding the index results...which I was!  THANK YOU!

To confirm I'm on the right track...

./cdx-index-client.py -c CC-MAIN-2019-51 *.jpl.nasa.gov


cat domain-jpl* | wc -l


correctly yields 128,406


If we focus on PDFs, though...


grep "application/pdf" domain-jpl* | wc -l


That yields 7 PDFs, but Google claims 51,000 and Bing claims 64,000 ("site:jpl.nasa.gov filetype:pdf"),


I'm totally ok with Common Crawl missing this content.  I just want to make sure that I'm understanding this correctly.


Thank you!

Sebastian Nagel

unread,
Jan 31, 2020, 11:36:05 AM1/31/20
to common...@googlegroups.com
Hi Tim,

> That yields 7 PDFs, ...
> I'm totally ok with Common Crawl missing this content.

And we still have dedicated datasets planned to cover PDFs and similar. Please see
https://groups.google.com/d/topic/common-crawl/cWIgP8yswzs/discussion

Comments are welcome! Meanwhile the URL indexes indicates whether the payload of a capture is
truncated or not (about 25% of PDFs are truncated). That'll help planning these crawls.

> correctly yields 128,406

Here the numbers over all 2019 crawls for "jpl.nasa.gov":
- 536 million successful page captures
- covering 361 million distinct URLs (Hyperloglog estimate)
- from 271 hosts

The top 10 hosts:
n_captures uniq_urls_estim url_host_name
125470 78038 mars.jpl.nasa.gov
81746 69361 podaac.jpl.nasa.gov
76613 33078 www.jpl.nasa.gov
73906 73956 ssd.jpl.nasa.gov
54341 32440 photojournal.jpl.nasa.gov
33279 27026 trs.jpl.nasa.gov
18304 15076 pds-imaging.jpl.nasa.gov
13143 6696 edrn.jpl.nasa.gov
5834 4130 marsprogram.jpl.nasa.gov
5468 1959 sealevel.jpl.nasa.gov


Ping me if you need the entire list or more metrics.
The listing above was calculated by querying the columnar index using Athena and:

SELECT url_host_name,
COUNT(*) AS n_captures,
cardinality(approx_set(url)) AS uniq_urls_estim
FROM "ccindex"."ccindex"
WHERE subset = 'warc'
AND crawl LIKE 'CC-MAIN-2019-%'
AND url_host_registered_domain = 'nasa.gov'
AND url_host_3rd_last_part = 'jpl'
GROUP BY url_host_name
ORDER BY COUNT(*) DESC;

Best,
Sebastian


On 1/29/20 7:30 PM, Tim Allison wrote:
> And that's exactly why I checked...to make sure I wasn't profoundly misunderstanding the index
> results...which I was!  THANK YOU!
>
> To confirm I'm on the right track...
>
> ./cdx-index-client.py -c CC-MAIN-2019-51 *.jpl.nasa.gov
>
>
> cat domain-jpl* | wc -l
>
>
> correctly yields 128,406
>
>
> If we focus on PDFs, though...
>
>
> grep "application/pdf" domain-jpl* | wc -l
>
>
> That yields 7 PDFs, but Google claims 51,000 and Bing claims 64,000 ("site:jpl.nasa.gov filetype:pdf"),
>
>
> I'm totally ok with Common Crawl missing this content.  I just want to make sure that I'm
> understanding this correctly.
>
>
> Thank you!
>
>
> On Wednesday, January 29, 2020 at 11:37:56 AM UTC-5, Tim Allison wrote:
>
> First, I LOVE CommonCrawl!  This is NOT a complaint!  Many, many thanks for all that you do!
>
>
> I'm trying to figure out why there are so few documents for a specific site, and I'd like to
> make sure I'm understanding correctly what might be going on.
>
> First, I'm looking in the December 2019 index for the pages on the site: https://jpl.nasa.gov 
> This is a Drupal site that relies heavily on javascript.  We had to use headless chrome to
> approach a useful crawl of this site on a different project.
>
> Google estimates 1.2 million files with this query: site:jpl.nasa.gov <http://jpl.nasa.gov>
> Bing estimates 1.8 million files with the same query.
>
> There are only 14,505 files returned by this
> query:https://index.commoncrawl.org/CC-MAIN-2019-51-index?url=*.jpl.nasa.gov\%2F*+&output=json
> <https://index.commoncrawl.org/CC-MAIN-2019-51-index?url=*.jpl.nasa.gov%5C%2F*+&output=json>
>
> Is the javascript likely the problem?  Does CommonCrawl limit the depth?  Are there other
> limitations in a crawl?
>
> Thank you, again!
>
> Cheers,
>
>       Tim
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/bcfbad66-d58b-4976-b9f0-39947f5fb7ee%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/bcfbad66-d58b-4976-b9f0-39947f5fb7ee%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages