Thanks to Scott Robertson of triv.io, Common Crawl now has a URL index! Read all about it on Scott's guest blog post http://commoncrawl.org/common-crawl-url-index/This is a very valuable tool and we are very grateful to Scott for donating his time and skill to create it!
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To view this discussion on the web visit https://groups.google.com/d/msg/common-crawl/-/Ap_IQXLaOAEJ.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.
That is a great job done, thank you Scott!
Just a quick question about index size - current file is 217Gb but back-of-the envelope calculations says it should be around 437Gb.
The question is: does the file format use a compression by any means? (It is not mentioned in the docs explicitly) or what is the reason of the index file size twice less then expected?
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To view this discussion on the web visit https://groups.google.com/d/msg/common-crawl/-/ImnW7Z0rgq4J.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.
Since the records are already variable size and you have them
collated, it seems like you could easily implement a simple common
prefix compression scheme ie count of the number of characters shared
with the previous URL, followed by the remainder of the characters.
That could potentially save a ton of space.
Also, the segment date is redundant and doesn't really need to be
saved for each URL.
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
/string/string2/
or /string?=string2
) regardless of the domain? Also a list of domains featuring such a string in any of their URLs? Any suggestions appreciated on what might be a good approach or if the data schema lends itself nicely to this.I am interested in URL data only. In other words, I probably won't need anything else BUT the URL index.I would like to identify URLs including certain patterns like (/string/string2/
or/string?=string2
) regardless of the domain? Also a list of domains featuring such a string in any of their URLs? Any suggestions appreciated on what might be a good approach or if the data schema lends itself nicely to this.
Hi Scott, amazing job!When do you think the index will contain the full corpus instead of half?
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
hi,
i need dataset for web crawler. how to get it from common crawl.
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.