<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <id>http://groups.google.com/group/gsitecrawler</id>
  <title type="text">SOFTplus GSiteCrawler Google Group</title>
  <subtitle type="text">
  Discussion group for the GSiteCrawler, a Windows tool used to crawl websites and automatically create Google Sitemap files (and much more).
  </subtitle>
  <link href="/group/gsitecrawler/feed/atom_v1_0_msgs.xml" rel="self" title="SOFTplus GSiteCrawler feed"/>
  <updated>2009-11-20T13:30:35Z</updated>
  <generator uri="http://groups.google.com" version="1.99">Google Groups</generator>
  <entry>
  <author>
  <name>Christina S</name>
  <email>web...@gmail.com</email>
  </author>
  <updated>2009-11-20T13:30:35Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/99f5983bbb0238cd?show_docid=99f5983bbb0238cd</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/99f5983bbb0238cd?show_docid=99f5983bbb0238cd"/>
  <title type="text">Re: [GSiteCrawler] Re: Criteria for identifying identical pages (Duplicate content URLs)</title>
  <summary type="html" xml:space="preserve">
  Hi again, &lt;br&gt; &lt;p&gt;The developers may eventually answer this. &lt;br&gt; &lt;p&gt;It doesn&#39;t remove from the good practices. &lt;br&gt; &lt;p&gt;As for the method of avoiding having duplicate content urls in the first place, what I said stands. &lt;br&gt; &lt;p&gt;If those urls have already been indexed, then you need to: &lt;br&gt; 1) not have them in navigation where Google et al. get at them - thus modify software to not create them at all
  </summary>
  </entry>
  <entry>
  <author>
  <name>Joe Germann</name>
  <email>j...@motorheadextraordinaire.com</email>
  </author>
  <updated>2009-11-20T11:36:02Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/b0c3f7516e177fe8?show_docid=b0c3f7516e177fe8</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/b0c3f7516e177fe8?show_docid=b0c3f7516e177fe8"/>
  <title type="text">Re: [GSiteCrawler] Re: Criteria for identifying identical pages (Duplicate content URLs)</title>
  <summary type="html" xml:space="preserve">
  Hi KKarl, &lt;br&gt; &lt;p&gt;Very interesting. I&#39;ll have to dig into Google deeper regarding &lt;br&gt; blocking access with robots.txt files. &lt;br&gt; &lt;p&gt;The issue on my web site is that the OS Commerce software kicks out a &lt;br&gt; lot of different links, some of which are to URL&#39;s that I want don&#39;t &lt;br&gt; want people landing on; like into a sorted list of some random
  </summary>
  </entry>
  <entry>
  <author>
  <name>kkarl</name>
  <email>casaforte...@gmail.com</email>
  </author>
  <updated>2009-11-20T09:06:41Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/8a2ab3158afa7214?show_docid=8a2ab3158afa7214</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/8a2ab3158afa7214?show_docid=8a2ab3158afa7214"/>
  <title type="text">Re: Criteria for identifying identical pages (Duplicate content URLs)</title>
  <summary type="html" xml:space="preserve">
  Webado &lt;br&gt; &lt;p&gt;The developers of GSC can certainly answer my question &lt;br&gt; &lt;p&gt;Joe &lt;br&gt; &lt;p&gt;One example can show the complex issues: &lt;br&gt; &lt;p&gt;We discovered by chance for six websites (only differing in language) &lt;br&gt; that there are a lot of duplicate URLs in the index of Google: &lt;br&gt; &lt;p&gt;1. We know how this DC is generated but not why!
  </summary>
  </entry>
  <entry>
  <author>
  <name>webado</name>
  <email>web...@gmail.com</email>
  </author>
  <updated>2009-11-20T06:14:39Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/0576c5904f5064df?show_docid=0576c5904f5064df</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/0576c5904f5064df?show_docid=0576c5904f5064df"/>
  <title type="text">Re: Criteria for identifying identical pages (Duplicate content URLs)</title>
  <summary type="html" xml:space="preserve">
  kkarl, I can&#39;t answer that. As I said I don&#39;t have any sites with this &lt;br&gt; issue that need special treatment. At most I dealt with a Zen Cart &lt;br&gt; site and since I had already noticed some useless urls (different sort &lt;br&gt; orders for instance), I had already handled them by modifying the &lt;br&gt; code to add rel=&amp;quot;nofollow&amp;quot; to those links. I didn&#39;t have to rely on
  </summary>
  </entry>
  <entry>
  <author>
  <name>Joe Germann</name>
  <email>motorheadextraordina...@gmail.com</email>
  </author>
  <updated>2009-11-19T16:44:23Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/b643492cc8e9f8fe?show_docid=b643492cc8e9f8fe</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/b643492cc8e9f8fe?show_docid=b643492cc8e9f8fe"/>
  <title type="text">Re: [GSiteCrawler] Re: Criteria for identifying identical pages (Duplicate content URLs)</title>
  <summary type="html" xml:space="preserve">
  Kkarl, &lt;br&gt; &lt;p&gt;I am by no means the expert on GSC but I was able to recognize URL &lt;br&gt; patterns and multiple entries that were pointing to the same &lt;br&gt; effective URL. My eCommerce PHP site code spits out quite a few &lt;br&gt; different URL link patterns. Based upon what pops up in a browser&#39;s &lt;br&gt; URL window, I chose the pattern that was consistent with the user
  </summary>
  </entry>
  <entry>
  <author>
  <name>kkarl</name>
  <email>casaforte...@gmail.com</email>
  </author>
  <updated>2009-11-19T15:59:53Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/a3c101a65c47da9c?show_docid=a3c101a65c47da9c</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/a3c101a65c47da9c?show_docid=a3c101a65c47da9c"/>
  <title type="text">Re: Criteria for identifying identical pages (Duplicate content URLs)</title>
  <summary type="html" xml:space="preserve">
  Webado: I agree totally with your comments and hints &lt;br&gt; &lt;p&gt;Webado &amp;amp; Joe: How do you know that the &amp;quot;identical pages listed&amp;quot; by &lt;br&gt; GSC are all of them - completely recognized as duplicate content by &lt;br&gt; GSC? &lt;br&gt; Thats why I am interested in knowing how GSC discovers DC, what are &lt;br&gt; the criteria ! &lt;br&gt; &lt;p&gt;Thx &lt;br&gt; Kkarl
  </summary>
  </entry>
  <entry>
  <author>
  <name>Joe Germann</name>
  <email>j...@motorheadextraordinaire.com</email>
  </author>
  <updated>2009-11-19T14:27:16Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/8881dfc87f28e275?show_docid=8881dfc87f28e275</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/8881dfc87f28e275?show_docid=8881dfc87f28e275"/>
  <title type="text">Re: [GSiteCrawler] Re: Criteria for identifying identical pages (Duplicate content URLs)</title>
  <summary type="html" xml:space="preserve">
  I let GSC crawl my entire PHP eCommerce site and then built a robots &lt;br&gt; file to filter out the duplicate castings that were generated &lt;br&gt; automagically by the eCommerce site. I now use the robots.txt file &lt;br&gt; to tell both the crawlers and GSC what to crawl. Every once in a &lt;br&gt; while I will run GSC without the robots directives to make sure that
  </summary>
  </entry>
  <entry>
  <author>
  <name>webado</name>
  <email>web...@gmail.com</email>
  </author>
  <updated>2009-11-19T14:14:05Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/6521595986f2e66f?show_docid=6521595986f2e66f</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/6521595986f2e66f?show_docid=6521595986f2e66f"/>
  <title type="text">Re: Criteria for identifying identical pages (Duplicate content URLs)</title>
  <summary type="html" xml:space="preserve">
  That&#39;s why you have to fix your site to avoid generating pages with &lt;br&gt; largely the same content (e.g. sorted in different ways or with a &lt;br&gt; larger view of a product image). &lt;br&gt; Proper use of a robots noindex met atag on page you dont&#39; want &lt;br&gt; indexed, and/or rel=&amp;quot;nofollow&amp;quot; on links to alternate disaplys of the
  </summary>
  </entry>
  <entry>
  <author>
  <name>kkarl</name>
  <email>casaforte...@gmail.com</email>
  </author>
  <updated>2009-11-19T13:59:43Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/3e7327667ad62732?show_docid=3e7327667ad62732</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/3e7327667ad62732?show_docid=3e7327667ad62732"/>
  <title type="text">Re: Criteria for identifying identical pages (Duplicate content URLs)</title>
  <summary type="html" xml:space="preserve">
  I believe that an answer is important because in case of thousands of &lt;br&gt; duplicate pages ( e.g. for shops) the size of the sitemap is strongly &lt;br&gt; reduced and the transferred URLs to the SEs are quasi correct after &lt;br&gt; disabling the duplicate content. &lt;br&gt; &lt;p&gt;To compare the content (body...) the task would be very time
  </summary>
  </entry>
  <entry>
  <author>
  <name>webado</name>
  <email>web...@gmail.com</email>
  </author>
  <updated>2009-11-19T13:25:45Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/252732558a742a2d?show_docid=252732558a742a2d</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/252732558a742a2d?show_docid=252732558a742a2d"/>
  <title type="text">Re: Criteria for identifying identical pages (Duplicate content URLs)</title>
  <summary type="html" xml:space="preserve">
  Same text content perhaps? &lt;br&gt; &lt;p&gt;I don&#39;t know exactly because I don&#39;t have such a situation anywhere to &lt;br&gt; test it.
  </summary>
  </entry>
  <entry>
  <author>
  <name>kkarl</name>
  <email>casaforte...@gmail.com</email>
  </author>
  <updated>2009-11-19T12:33:33Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/343ea63a3f2d9308?show_docid=343ea63a3f2d9308</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/d43aefe5f5b2b084/343ea63a3f2d9308?show_docid=343ea63a3f2d9308"/>
  <title type="text">Criteria for identifying identical pages (Duplicate content URLs)</title>
  <summary type="html" xml:space="preserve">
  Hi &lt;br&gt; &lt;p&gt;what are the criteria GSC identifies URLs as pointing to the same &lt;br&gt; content?? &lt;br&gt; &lt;p&gt;Thx &lt;br&gt; Kkarl
  </summary>
  </entry>
  <entry>
  <author>
  <email>harr...@igloos.co.uk</email>
  </author>
  <updated>2009-11-19T08:35:28Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/3a2d773a692c9001/4d8ecea3e4b21b5b?show_docid=4d8ecea3e4b21b5b</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/3a2d773a692c9001/4d8ecea3e4b21b5b?show_docid=4d8ecea3e4b21b5b"/>
  <title type="text">Out of the Office</title>
  <summary type="html" xml:space="preserve">
  Thank you for your email. I am out of the office until Monday 23rd November. Should your enquiry be urgent please contact my assistant Gillian on 01438 861418. &lt;br&gt; &lt;p&gt;Kind Regards &lt;br&gt; &lt;p&gt;Harriet Rearden &lt;br&gt; IGLOOS Ltd &lt;br&gt; &lt;a target=&quot;_blank&quot; rel=nofollow href=&quot;http://www.igloos.co.uk&quot;&gt;[link]&lt;/a&gt;
  </summary>
  </entry>
  <entry>
  <author>
  <name>Marian</name>
  <email>marian.ga...@gmail.com</email>
  </author>
  <updated>2009-11-18T17:53:57Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/73dc44cee785ffa2/52372b36fd760d5a?show_docid=52372b36fd760d5a</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/73dc44cee785ffa2/52372b36fd760d5a?show_docid=52372b36fd760d5a"/>
  <title type="text">news in GSC</title>
  <summary type="html" xml:space="preserve">
  this is message in first for GSC team &lt;br&gt; &lt;p&gt;I have to say that GSC is very good and helpful program. Thank you for &lt;br&gt; that. I have a question to you. Do you prepare a new version (update) &lt;br&gt; of this program and are you going to add new feature as SFTP? &lt;br&gt; &lt;p&gt;I think that in this time is SFTP very important for transfer of ftp
  </summary>
  </entry>
  <entry>
  <author>
  <name>MrMyckster</name>
  <email>2snap...@gmail.com</email>
  </author>
  <updated>2009-11-11T10:22:59Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/2096f03e38bfa53a/4ce2fc2e9b30c888?show_docid=4ce2fc2e9b30c888</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/2096f03e38bfa53a/4ce2fc2e9b30c888?show_docid=4ce2fc2e9b30c888"/>
  <title type="text">Sitemaps for folders</title>
  <summary type="html" xml:space="preserve">
  If you decide to separate your site into multiple sub-sites (maybe &lt;br&gt; based on sub-folders), then you can build a site-map for each such &lt;br&gt; sub-site, according to your own schedule. Again the same deal, each &lt;br&gt; site-map can be a site-map index with individual site-maps gzipped. &lt;br&gt; &lt;p&gt;Can you give me more information on how to do this?
  </summary>
  </entry>
  <entry>
  <author>
  <name>webado</name>
  <email>web...@gmail.com</email>
  </author>
  <updated>2009-11-10T13:35:24Z</updated>
  <id>http://groups.google.com/group/gsitecrawler/browse_thread/thread/e74b49b9f1f9ad64/416428a043ec806e?show_docid=416428a043ec806e</id>
  <link href="http://groups.google.com/group/gsitecrawler/browse_thread/thread/e74b49b9f1f9ad64/416428a043ec806e?show_docid=416428a043ec806e"/>
  <title type="text">Re: Heavy Site Map Issues</title>
  <summary type="html" xml:space="preserve">
  Hi Karthick, &lt;br&gt; &lt;p&gt;First of all I have to say I have no first hand knowledge of how to &lt;br&gt; manage such very large sites. The largest site I have has about 12000 &lt;br&gt; urls, easily managed by GSC, though it takes 3 hours or so to recrawl. &lt;br&gt; &lt;p&gt;GsiteCrawler has the option of making sitemap indexes for multiple &lt;br&gt; sitemaps.
  </summary>
  </entry>
</feed>
