Having problems with the index

Hyler Tasman

unread,

Feb 22, 2024, 10:39:02 AM2/22/24

to Common Crawl

I'm trying to look up crawled pages by the index and running the command such as:

curl 'https://index.commoncrawl.org/CC-MAIN-2023-50-index?url=https%3A%2F%2Fnews.ycombinator.com%2F*&output=json&pageSize=5&page=0' \ > test

This sometimes results in the error:

curl: (18) transfer closed with outstanding read data remaining

curl: (3) URL using bad/illegal format or missing URL

I tried using a variety of machines to access it and got other similar errors:

curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 0
curl: (3) URL rejected: Malformed input to a URL function

curl: (56) OpenSSL SSL_read: error:0A000126:SSL routines::unexpected eof while reading, errno 0
curl: (3) URL using bad/illegal format or missing URL

The downloaded file is usually an incomplete result or the 504 Gateway timeout error.

If I visit the URL directly in the browser it sometimes works, other times it gives a 504 error. It was already behaving that way before, except a few weeks ago the curl download would eventually work after retrying a few times but now it doesn't work at all.

My understanding is that this is the correct way to access the index, but it seems to not be working. I also tried wget but it's the same. Do you have any suggestions on the best way to retrieve the index?

Thanks

Greg Lindahl

unread,

Feb 22, 2024, 3:33:10 PM2/22/24

to common...@googlegroups.com

Hyler,

Unfortunately our CDX index server has been ill for quite a while, and
we haven't had time to look at it. Apparently I need some better rate
limiting for the users pounding it.

We have a second index, the columnar index, which you can either
download as Parquet files or use directly via a hosted service such as
Apache Athena. It rarely has traffic problems.

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/1bcaaffe-beb4-4d1d-98b0-895570752591n%40googlegroups.com.

Hyler Tasman

unread,

Feb 25, 2024, 10:07:54 PM2/25/24

to Common Crawl

This is very useful, thank you!

Greg Lindahl

unread,

Feb 25, 2024, 10:18:30 PM2/25/24

to common...@googlegroups.com

I took a whack at the CDX index server and while its webpages are
still ill, the cdx index kind of works.

Still, we recommend downloading the parquet or using Athena.

> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/d9905951-5fdf-49e0-91d5-b43e3f4309cen%40googlegroups.com.

Reply all

Reply to author

Forward