<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rss version="2.0">
  <channel>
  <title>Common Crawl Google Group</title>
  <link>http://groups.google.com/group/common-crawl</link>
  <description>This group is for supporters and users of Common Crawl to share ideas and information.</description>
  <language>en</language>
  <item>
  <title>Valid Segments</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/2dd97fdf0f35a8e3/524d791987125ccb?show_docid=524d791987125ccb</link>
  <description>
  Hi, &lt;br&gt; I know that this topic has been announced somewhere below but actually &lt;br&gt; since Q4 last year when there were 56 valid segment IDs in this file the &lt;br&gt; number of IDs has increase to 177. As we have worked with the &amp;quot;original&amp;quot; 56 &lt;br&gt; segments and want to continue working some questions came into our minds:
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/2dd97fdf0f35a8e3/524d791987125ccb?show_docid=524d791987125ccb</guid>
  <author>
  robert.meu...@gmail.com
  (Robert Meusel)
  </author>
  <pubDate>Wed, 22 May 2013 06:19:08 UT
</pubDate>
  </item>
  <item>
  <title>Re: Digest for common-crawl@googlegroups.com - 5 Messages in 1 Topic</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/da397f9a7a48c6b4/e44707200dcc2f29?show_docid=e44707200dcc2f29</link>
  <description>
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/da397f9a7a48c6b4/e44707200dcc2f29?show_docid=e44707200dcc2f29</guid>
  <author>
  t...@tomanthony.co.uk
  (Tom Anthony)
  </author>
  <pubDate>Thu, 16 May 2013 08:02:38 UT
</pubDate>
  </item>
  <item>
  <title>Re: Crawl Depth of Common Crawl</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/b3dfc1f9b464ff18/1e03f4ebe1a1cb0b?show_docid=1e03f4ebe1a1cb0b</link>
  <description>
  Tina, &lt;br&gt; &lt;p&gt;That&#39;s very kind and, of course, I can confirm it will not be distributed &lt;br&gt; onwards. &lt;br&gt; &lt;p&gt;regards &lt;br&gt; &lt;p&gt;Jason &lt;br&gt; &lt;p&gt;--- &lt;br&gt; Jason Duke &lt;br&gt; &lt;p&gt;Email: ja...@strangelogic.com &lt;br&gt; Mob: +44 (0)7595 924 934 &lt;br&gt; Twitter: @JasonD &lt;br&gt; LinkedIn: &lt;a target=&quot;_blank&quot; rel=nofollow href=&quot;http://uk.linkedin.com/in/jasonduke1&quot;&gt;[link]&lt;/a&gt; &lt;br&gt; &lt;p&gt;The information contained within this email along with any attachments are
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/b3dfc1f9b464ff18/1e03f4ebe1a1cb0b?show_docid=1e03f4ebe1a1cb0b</guid>
  <author>
  ja...@strangelogic.com
  (Jason Duke)
  </author>
  <pubDate>Wed, 15 May 2013 17:11:45 UT
</pubDate>
  </item>
  <item>
  <title>Re: Crawl Depth of Common Crawl</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/b3dfc1f9b464ff18/4779d4213f40771b?show_docid=4779d4213f40771b</link>
  <description>
  Jason - We have a rough draft of a paper summarizing these stats, but it &lt;br&gt; is not quite ready to publish. I will send you the draft by email and ask &lt;br&gt; you not to distribute it until we have the final version. &lt;br&gt; &lt;p&gt;Everyone else - stay tuned, the paper will be posted soon! &lt;br&gt; &lt;p&gt;Lisa
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/b3dfc1f9b464ff18/4779d4213f40771b?show_docid=4779d4213f40771b</guid>
  <author>
  l...@commoncrawl.org
  (Lisa Green)
  </author>
  <pubDate>Wed, 15 May 2013 17:08:14 UT
</pubDate>
  </item>
  <item>
  <title>Re: Crawl Depth of Common Crawl</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/b3dfc1f9b464ff18/5ca6044fba4d1684?show_docid=5ca6044fba4d1684</link>
  <description>
  It would be of interest to me :) &lt;br&gt; &lt;p&gt;Thanks &lt;br&gt; &lt;p&gt;Jason &lt;br&gt; &lt;p&gt;--- &lt;br&gt; Jason Duke &lt;br&gt; &lt;p&gt;Email: ja...@strangelogic.com &lt;br&gt; Mob: +44 (0)7595 924 934 &lt;br&gt; Twitter: @JasonD &lt;br&gt; LinkedIn: &lt;a target=&quot;_blank&quot; rel=nofollow href=&quot;http://uk.linkedin.com/in/jasonduke1&quot;&gt;[link]&lt;/a&gt; &lt;br&gt; &lt;p&gt;The information contained within this email along with any attachments are &lt;br&gt; confidential, may be legally privileged and/or protected by copyright. If
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/b3dfc1f9b464ff18/5ca6044fba4d1684?show_docid=5ca6044fba4d1684</guid>
  <author>
  ja...@strangelogic.com
  (Jason Duke)
  </author>
  <pubDate>Wed, 15 May 2013 16:28:33 UT
</pubDate>
  </item>
  <item>
  <title>Re: Crawl Depth of Common Crawl</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/b3dfc1f9b464ff18/f740c924a4278c84?show_docid=f740c924a4278c84</link>
  <description>
  Hi &lt;br&gt; I am not sure what metrics you would like to know. Sometimes people ask &lt;br&gt; questions about what percentage of the web we crawl. We crawl billions of &lt;br&gt; pages, but it is not really reasonable to talk about the percentage of the &lt;br&gt; web that we crawl because there is no consensus on just how big the web is.
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/b3dfc1f9b464ff18/f740c924a4278c84?show_docid=f740c924a4278c84</guid>
  <author>
  l...@commoncrawl.org
  (Lisa Green)
  </author>
  <pubDate>Wed, 15 May 2013 16:26:40 UT
</pubDate>
  </item>
  <item>
  <title>Crawl Depth of Common Crawl</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/b3dfc1f9b464ff18/ef29e87792bd4d27?show_docid=ef29e87792bd4d27</link>
  <description>
  I have read that Common Crawl claims that it is not a comprehensive crawl. &lt;br&gt; I would like to know how deep does their crawl go. Thank you for your &lt;br&gt; input.
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/b3dfc1f9b464ff18/ef29e87792bd4d27?show_docid=ef29e87792bd4d27</guid>
  <author>
  jhn.wood...@gmail.com
  </author>
  <pubDate>Wed, 15 May 2013 12:57:37 UT
</pubDate>
  </item>
  <item>
  <title>Re: 403 Forbidden when attempting to list s3://aws-publicdatasets/common-crawl/</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/b0a88610d6c193a8/eeab717737ce077d?show_docid=eeab717737ce077d</link>
  <description>
  Hi Andrew, &lt;br&gt; &lt;p&gt;A lot of stuff in the s3://aws-publicdatasets/common -crawl root is &lt;br&gt; temporary in nature and not World readable. In the future, we can hopefully &lt;br&gt; move all of this stuff into a separate directory under the common-crawl &lt;br&gt; root. Unfortunately, renames are not really feasible due to the size of
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/b0a88610d6c193a8/eeab717737ce077d?show_docid=eeab717737ce077d</guid>
  <author>
  ahadr...@gmail.com
  (Ahad Rana)
  </author>
  <pubDate>Tue, 14 May 2013 21:21:30 UT
</pubDate>
  </item>
  <item>
  <title>403 Forbidden when attempting to list s3://aws-publicdatasets/common-crawl/</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/b0a88610d6c193a8/1a35acd7294f2a8f?show_docid=1a35acd7294f2a8f</link>
  <description>
  I&#39;m just getting started w/ the common crawl data set, and I can&#39;t seem to &lt;br&gt; list the common-crawl/ or common-crawl/parse-output directories from an &lt;br&gt; interactive Pig session on a stock EMR launch. &lt;br&gt; grunt&amp;gt; ls s3://aws-publicdatasets/common -crawl/ &lt;br&gt; &lt;p&gt;I have no problem listing s3://aws-publicdatasets/common -crawl/crawl-002
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/b0a88610d6c193a8/1a35acd7294f2a8f?show_docid=1a35acd7294f2a8f</guid>
  <author>
  amat...@gmail.com
  (Andrew Mattie)
  </author>
  <pubDate>Tue, 14 May 2013 20:07:04 UT
</pubDate>
  </item>
  <item>
  <title>Re: question aroundTerms of Use</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/58bbce23ba372aa4/be9981c6ac68e7a2?show_docid=be9981c6ac68e7a2</link>
  <description>
  I&#39;m not a lawyer, but &lt;br&gt; &lt;p&gt; that sounds like a permitted use to me. My interpretation of the &lt;br&gt; non-sublicensable and respects the sites&#39; ToUs and copyrights is intended &lt;br&gt; to prevent things like: &lt;br&gt; &lt;p&gt;- publishing a &amp;quot;Common Crawl for Movies&amp;quot; subset of the crawl (which would &lt;br&gt; also violate their Common Crawl trademark)
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/58bbce23ba372aa4/be9981c6ac68e7a2?show_docid=be9981c6ac68e7a2</guid>
  <author>
  tfmor...@gmail.com
  (Tom Morris)
  </author>
  <pubDate>Sun, 12 May 2013 14:20:41 UT
</pubDate>
  </item>
  <item>
  <title>question aroundTerms of Use</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/58bbce23ba372aa4/d504b233bda87c09?show_docid=d504b233bda87c09</link>
  <description>
  Hi, &lt;br&gt; I am bit confused about terms of using Common Crawl data: &lt;br&gt; *Terms of Use* Please refer to the Common Crawl Terms of Use &amp;lt;&lt;a target=&quot;_blank&quot; rel=nofollow href=&quot;http://www.commoncrawl.org/about/terms-of-use/&quot;&gt;[link]&lt;/a&gt;&amp;gt;document &lt;br&gt; for a detailed, authoritative description of our Terms of Use guidelines, &lt;br&gt; but, in general, you cannot republish the data retrieved from the crawl
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/58bbce23ba372aa4/d504b233bda87c09?show_docid=d504b233bda87c09</guid>
  <author>
  dhananjaybak...@gmail.com
  </author>
  <pubDate>Sun, 12 May 2013 05:27:00 UT
</pubDate>
  </item>
  <item>
  <title>Re: textdata and streaming hadoop jobs</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/c6dfe5e14dcb0b82/d1fa50e9cf336e39?show_docid=d1fa50e9cf336e39</link>
  <description>
  Hi Mat, Ahad, thnx for the quick reply! &lt;br&gt; yes! that&#39;s looks like just what i need. &lt;br&gt; &lt;p&gt;Mat, sorry i&#39;m not clear on how to build it. But once it&#39;s build I believe &lt;br&gt; I can just upload it to the EMR job and use it in place of &lt;br&gt; SequenceFileAsTextInput
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/c6dfe5e14dcb0b82/d1fa50e9cf336e39?show_docid=d1fa50e9cf336e39</guid>
  <author>
  kur...@gmail.com
  (Kurt Jx)
  </author>
  <pubDate>Mon, 06 May 2013 02:14:38 UT
</pubDate>
  </item>
  <item>
  <title>Re: textdata and streaming hadoop jobs</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/c6dfe5e14dcb0b82/63e0fa15d1f69b4e?show_docid=63e0fa15d1f69b4e</link>
  <description>
  Ah, cool! Thanks Ahad!
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/c6dfe5e14dcb0b82/63e0fa15d1f69b4e?show_docid=63e0fa15d1f69b4e</guid>
  <author>
  matthew.kel...@gmail.com
  (Mat Kelcey)
  </author>
  <pubDate>Sun, 05 May 2013 16:07:49 UT
</pubDate>
  </item>
  <item>
  <title>Re: textdata and streaming hadoop jobs</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/c6dfe5e14dcb0b82/49de26bcf150b009?show_docid=49de26bcf150b009</link>
  <description>
  Hi Mat, &lt;br&gt; &lt;p&gt;NP. I pushed it into the commoncrawl repo for you a while back: &lt;br&gt; &lt;p&gt;&lt;a target=&quot;_blank&quot; rel=nofollow href=&quot;https://github.com/commoncrawl/commoncrawl/blob/master/src/main/java/org/commoncrawl/hadoop/io/mapred/EscapedNewLineSequenceFileInputFormat.java&quot;&gt;[link]&lt;/a&gt; &lt;br&gt; &lt;p&gt;Ahad.
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/c6dfe5e14dcb0b82/49de26bcf150b009?show_docid=49de26bcf150b009</guid>
  <author>
  ahadr...@gmail.com
  (Ahad Rana)
  </author>
  <pubDate>Sun, 05 May 2013 15:58:43 UT
</pubDate>
  </item>
  <item>
  <title>Re: textdata and streaming hadoop jobs</title>
  <link>http://groups.google.com/group/common-crawl/browse_thread/thread/c6dfe5e14dcb0b82/42f6465edb337037?show_docid=42f6465edb337037</link>
  <description>
  yeah this is known pain in the butt when it comes to sequence files and &lt;br&gt; streaming, they just don&#39;t play well together :/ &lt;br&gt; &lt;p&gt;i wrote a custom input format to handle this, but i never followed through &lt;br&gt; with the PR with ahad to get this into the common crawl code (sorry ahad! i &lt;br&gt; didn&#39;t notice your comment!)
  </description>
  <guid isPermaLink="true">http://groups.google.com/group/common-crawl/browse_thread/thread/c6dfe5e14dcb0b82/42f6465edb337037?show_docid=42f6465edb337037</guid>
  <author>
  matthew.kel...@gmail.com
  (Mat Kelcey)
  </author>
  <pubDate>Sun, 05 May 2013 15:15:42 UT
</pubDate>
  </item>
  </channel>
</rss>
