http://mysite.com/items/1234.do
But there are links to this product that may have a querystring to
denote categories, or source codes, etc. Like this:
http://mysite.com/items/1234.do?source=yoursite&category=book
I've created some exclusions for my collections that will keep these
URLs with querystrings from appearing in my search results, and it
works fine. The problem is that I need to crawl all these bad URLs
(with querystring) to find the good URLs (without querystring). So of
my 5,000 product pages, my index contains 522,000 unique product pages
with the querystrings. This pushes me over my license of 500k. Is
there a way I can crawl everything, but only save the URLs that have
no querystring--in the same way my collection only returns URLs
without the querystring?
That would allow me to crawl the 522,000 unique pages to find all
5,000 URLs I want to keep. Make sense? Any suggestions?
If the item pages always have a query string, maybe you can exclude
some patterns in the "don't crawl"/exclusion box.
For instance, if
http://mysite.com/items/1234.do?source=yoursite&category=book
and
http://mysite.com/items/1234.do?source=yoursite&category=children
lead to the same page/content, you could exclude all the categories
apart from one.
Otherwise, maybe you can create a seeding page which links to the item
pages without the querystring, and only index through that page.
If this is not possible, then maybe using a feed (you feeding the GSA
which the list of URLs to crawl) could solve your issue.
S.
Go to Serving > Front Ends, select the Remove URLs tab, and enter the
appropriate regular expression(s) to match the parameters you wish to
exclude.
The one caveat to this is that the documents you exclude in this
manner still count against your document count.
To some degree my hands are tied because these pages are hosted by a
third-party. The ones I want to index do not have querystrings, but I
don't have a page that lists all those non-querystring pages. I'm
working with the third-party to get a sitemap added, but it's been a
struggle.
On Mar 6, 11:20 pm, "Seb" <sol...@gmail.com> wrote:
> If the same item pages are sometimes linked without a query string,
> you can use a "contains:source=" in the exclusions or similar.
>
> If the item pages always have a query string, maybe you can exclude
> some patterns in the "don't crawl"/exclusion box.
>
> For instance, if
>
> http://mysite.com/items/1234.do?source=yoursite&category=book
> andhttp://mysite.com/items/1234.do?source=yoursite&category=children