[Feature] Getting latest index

21 views
Skip to first unread message

mar...@sweepatic.com

unread,
Jul 4, 2017, 1:33:43 PM7/4/17
to Common Crawl
Hi guys,
I'm currently working on some automation framework that is using CC CDX server for pulling out crawled data. Since the new indexes are published frequently, I'm trying to avoid hardcoding the current latest index name into the code. I was wondering if there is an API call that could return the name of the latest fresh index on CDX server?
I looked into the CC Cli client that is parsing the HTML page to find indexes which IMHO is not very elegant solution so it looks like there's not actually an API to fetch that. If so my suggestions would be that perhaps a JSON/XML or whatever formatted API call to fetch either the list of indexes with timestamps to be retrieved which can be used to pick out the latest one or get whole list or another solution might be to use URL alias/redirection like http://index.commoncrawl.org/CC-MAIN-latest that would proxy the request to the latest index. Just my two cents that I found are not very developer friendly.

That might solve a problem of having a totally unnecessary dependency in the official CC CDX client (HTML parser used only to get list of indexes) and would make it also more developer friendly for others to use it. 
Anyway, thank you guys for running this awesome project that is really helpful when conducting research!

PS: Should this be posted on Github as feature request instead? github.com/ikreymer/cc-index-server looks a little outdated

Sebastian Nagel

unread,
Jul 4, 2017, 4:23:52 PM7/4/17
to common...@googlegroups.com
Hi Martin,

> I was wondering if there is an API call that could return
> the name of the latest fresh index on CDX server?

I don't know about an API call. But yes, of course, that's a good idea. I plan to add a similar
mechanism to find the latest crawl archive on the public data set bucket s3://commoncrawl/.

> another solution might be to use URL alias/redirection
> like http://index.commoncrawl.org/CC-MAIN-latest that would proxy the request to the latest index.

Since the client is stateful (eg. for paging) and state is hold on the client side,
this might cause troubles if the index is switched while a paging query is processed.

> PS: Should this be posted on Github as feature request instead? github.com/ikreymer/cc-index-server
> looks a little outdated

Please, report it on
https://github.com/commoncrawl/cc-index-server
I'll then push any changes upstream to Ilya's original version.

Thanks,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages