Having problems with the index

91 views
Skip to first unread message

Hyler Tasman

unread,
Feb 22, 2024, 10:39:02 AMFeb 22
to Common Crawl
I'm trying to look up crawled pages by the index and running the command such as:

This sometimes results in the error:

curl: (18) transfer closed with outstanding read data remaining
curl: (3) URL using bad/illegal format or missing URL

I tried using a variety of machines to access it and got other similar errors:

curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 0
curl: (3) URL rejected: Malformed input to a URL function

curl: (56) OpenSSL SSL_read: error:0A000126:SSL routines::unexpected eof while reading, errno 0
curl: (3) URL using bad/illegal format or missing URL

The downloaded file is usually an incomplete result or the 504 Gateway timeout error.

If I visit the URL directly in the browser it sometimes works, other times it gives a 504 error. It was already behaving that way before, except a few weeks ago the curl download would eventually work after retrying a few times but now it doesn't work at all.

My understanding is that this is the correct way to access the index, but it seems to not be working. I also tried wget but it's the same. Do you have any suggestions on the best way to retrieve the index?

Thanks

Greg Lindahl

unread,
Feb 22, 2024, 3:33:10 PMFeb 22
to common...@googlegroups.com
Hyler,

Unfortunately our CDX index server has been ill for quite a while, and
we haven't had time to look at it. Apparently I need some better rate
limiting for the users pounding it.

We have a second index, the columnar index, which you can either
download as Parquet files or use directly via a hosted service such as
Apache Athena. It rarely has traffic problems.
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/1bcaaffe-beb4-4d1d-98b0-895570752591n%40googlegroups.com.

Hyler Tasman

unread,
Feb 25, 2024, 10:07:54 PMFeb 25
to Common Crawl
This is very useful, thank you!

Greg Lindahl

unread,
Feb 25, 2024, 10:18:30 PMFeb 25
to common...@googlegroups.com
I took a whack at the CDX index server and while its webpages are
still ill, the cdx index kind of works.

Still, we recommend downloading the parquet or using Athena.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/d9905951-5fdf-49e0-91d5-b43e3f4309cen%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages