Hi Tim,
> That yields 7 PDFs, ...
> I'm totally ok with Common Crawl missing this content.
And we still have dedicated datasets planned to cover PDFs and similar. Please see
https://groups.google.com/d/topic/common-crawl/cWIgP8yswzs/discussion
Comments are welcome! Meanwhile the URL indexes indicates whether the payload of a capture is
truncated or not (about 25% of PDFs are truncated). That'll help planning these crawls.
> correctly yields 128,406
Here the numbers over all 2019 crawls for "
jpl.nasa.gov":
- 536 million successful page captures
- covering 361 million distinct URLs (Hyperloglog estimate)
- from 271 hosts
The top 10 hosts:
n_captures uniq_urls_estim url_host_name
125470 78038
mars.jpl.nasa.gov
81746 69361
podaac.jpl.nasa.gov
76613 33078
www.jpl.nasa.gov
73906 73956
ssd.jpl.nasa.gov
54341 32440
photojournal.jpl.nasa.gov
33279 27026
trs.jpl.nasa.gov
18304 15076
pds-imaging.jpl.nasa.gov
13143 6696
edrn.jpl.nasa.gov
5834 4130
marsprogram.jpl.nasa.gov
5468 1959
sealevel.jpl.nasa.gov
Ping me if you need the entire list or more metrics.
The listing above was calculated by querying the columnar index using Athena and:
SELECT url_host_name,
COUNT(*) AS n_captures,
cardinality(approx_set(url)) AS uniq_urls_estim
FROM "ccindex"."ccindex"
WHERE subset = 'warc'
AND crawl LIKE 'CC-MAIN-2019-%'
AND url_host_registered_domain = '
nasa.gov'
AND url_host_3rd_last_part = 'jpl'
GROUP BY url_host_name
ORDER BY COUNT(*) DESC;
Best,
Sebastian
On 1/29/20 7:30 PM, Tim Allison wrote:
> And that's exactly why I checked...to make sure I wasn't profoundly misunderstanding the index
> results...which I was! THANK YOU!
>
> To confirm I'm on the right track...
>
> ./cdx-index-client.py -c CC-MAIN-2019-51 *.
jpl.nasa.gov
>
>
> cat domain-jpl* | wc -l
>
>
> correctly yields 128,406
>
>
> If we focus on PDFs, though...
>
>
> grep "application/pdf" domain-jpl* | wc -l
>
>
> That yields 7 PDFs, but Google claims 51,000 and Bing claims 64,000 ("site:
jpl.nasa.gov filetype:pdf"),
>
>
> I'm totally ok with Common Crawl missing this content. I just want to make sure that I'm
> understanding this correctly.
>
>
> Thank you!
>
>
> On Wednesday, January 29, 2020 at 11:37:56 AM UTC-5, Tim Allison wrote:
>
> First, I LOVE CommonCrawl! This is NOT a complaint! Many, many thanks for all that you do!
>
>
> I'm trying to figure out why there are so few documents for a specific site, and I'd like to
> make sure I'm understanding correctly what might be going on.
>
> First, I'm looking in the December 2019 index for the pages on the site:
https://jpl.nasa.gov
> This is a Drupal site that relies heavily on javascript. We had to use headless chrome to
> approach a useful crawl of this site on a different project.
>
> Google estimates 1.2 million files with this query: site:
jpl.nasa.gov <
http://jpl.nasa.gov>
> <
https://index.commoncrawl.org/CC-MAIN-2019-51-index?url=*.jpl.nasa.gov%5C%2F*+&output=json>
>
> Is the javascript likely the problem? Does CommonCrawl limit the depth? Are there other
> limitations in a crawl?
>
> Thank you, again!
>
> Cheers,
>
> Tim
>
>
https://groups.google.com/d/msgid/common-crawl/bcfbad66-d58b-4976-b9f0-39947f5fb7ee%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/bcfbad66-d58b-4976-b9f0-39947f5fb7ee%40googlegroups.com?utm_medium=email&utm_source=footer>.