Crawl URLs but do not include in index

10 views
Skip to first unread message

Ty C.

unread,
Mar 6, 2007, 7:08:21 PM3/6/07
to Google Search Appliance
I'm crawling a shopping website with a mix of URLs. Each product has
it's own URL, like this:

http://mysite.com/items/1234.do

But there are links to this product that may have a querystring to
denote categories, or source codes, etc. Like this:

http://mysite.com/items/1234.do?source=yoursite&category=book

I've created some exclusions for my collections that will keep these
URLs with querystrings from appearing in my search results, and it
works fine. The problem is that I need to crawl all these bad URLs
(with querystring) to find the good URLs (without querystring). So of
my 5,000 product pages, my index contains 522,000 unique product pages
with the querystrings. This pushes me over my license of 500k. Is
there a way I can crawl everything, but only save the URLs that have
no querystring--in the same way my collection only returns URLs
without the querystring?

That would allow me to crawl the 522,000 unique pages to find all
5,000 URLs I want to keep. Make sense? Any suggestions?

Seb

unread,
Mar 7, 2007, 1:20:07 AM3/7/07
to Google Search Appliance
If the same item pages are sometimes linked without a query string,
you can use a "contains:source=" in the exclusions or similar.

If the item pages always have a query string, maybe you can exclude
some patterns in the "don't crawl"/exclusion box.

For instance, if

http://mysite.com/items/1234.do?source=yoursite&category=book
and
http://mysite.com/items/1234.do?source=yoursite&category=children

lead to the same page/content, you could exclude all the categories
apart from one.

Otherwise, maybe you can create a seeding page which links to the item
pages without the querystring, and only index through that page.

If this is not possible, then maybe using a feed (you feeding the GSA
which the list of URLs to crawl) could solve your issue.

S.

melissa

unread,
Mar 7, 2007, 3:43:21 PM3/7/07
to Google Search Appliance
Suggest you include it in the crawl, but exclude the URLs from
inclusion in the index through Front End configuration.

Go to Serving > Front Ends, select the Remove URLs tab, and enter the
appropriate regular expression(s) to match the parameters you wish to
exclude.

The one caveat to this is that the documents you exclude in this
manner still count against your document count.

Ty C.

unread,
Mar 7, 2007, 3:44:46 PM3/7/07
to Google Search Appliance
Thanks for your feedback!

To some degree my hands are tied because these pages are hosted by a
third-party. The ones I want to index do not have querystrings, but I
don't have a page that lists all those non-querystring pages. I'm
working with the third-party to get a sitemap added, but it's been a
struggle.


On Mar 6, 11:20 pm, "Seb" <sol...@gmail.com> wrote:
> If the same item pages are sometimes linked without a query string,
> you can use a "contains:source=" in the exclusions or similar.
>
> If the item pages always have a query string, maybe you can exclude
> some patterns in the "don't crawl"/exclusion box.
>
> For instance, if
>
> http://mysite.com/items/1234.do?source=yoursite&category=book

> andhttp://mysite.com/items/1234.do?source=yoursite&category=children

Mirac

unread,
Feb 26, 2015, 4:52:13 PM2/26/15
to Google-Search-...@googlegroups.com, Google-Sear...@googlegroups.com, tyca...@gmail.com
Another alternative to excluding those URLs in the Frontend configuration is to exclude them in the Collection configuration (Do Not Include Content Matching the Following Patterns).
This way those pages will still be crawled but they themselves will not be listed in the search results due to collection configuration.
Reply all
Reply to author
Forward
0 new messages