Hi Justin,
You won't like this, but I'm afraid you have one or two important
issues here:
a. your appliance is not working properly and needs upgrade to 5.0.4.G.
22 plus VM Patch #1
b. you don't fully understand how to add and remove documents to the
index
I can't tell yet whether a. or b. or both are actually happening, but
from your last message I'm afraid of both.
Let's start with a. When the appliance works properly, any URL that
return 404 should get removed from the index without retrying. This
would take about 30 minutes if the appliance is not busy, longer if it
is. Also, any URL that matches patterns in the Do Not Crawl box will
be removed, not just hidden, from the index. That can take up to 6
hours in old versions like 4.6.4 but must happen, otherwise the
appliance is definitely in a bad state and you should not use it for
production. Resetting the index will fix it, but that's the
sledgehammer, last-resort approach and yet won't prevent the issue
from happening again.
Yet it's not clear to me that your appliance is in a bad state. When
you say "they only mask things (in results and the admin tool)", do
you mean that those URLs disappear from both search results and Crawl
Diagnostics? If that is the case, those URLs are gone forever from the
index, not just masked or hidden. Even if you see the URLs in Crawl
Diagnostics but they never show up in search results, they are no
longer in the index, only Crawl Diagnostics is not aware of it yet.
You can only be certain URLs are not removed from the index when they
do show up in search results, is that happening to your URLs mathing
Do Not Crawl patterns? If they are, are you really sure they match the
patterns? Sometimes they don't and it's not obvious, try the pattern
test box for them.
If your appliance is on a bad state, it won't crawl. So just add
something to Start URLs and Crawl and Follow pattern and see if it
shows up in the index. If it does, the appliance may be suitable for
production, but if you don't get new sites into the index, it is not.
As for b. I only see that you are using the term "hidden" for URLs
that match patterns in the Do Not Crawl box. URLs matching those
patters should be *removed* from the index, gone, bye bye forever. You
can also hide URLs on each front end if you want, and this can help
you out when you need to avoid URLs from showing up in search results
*urgently*. However, this approach is not good for long term as it
degrades performance on the serving side (slower results) but will do
for emergencies:
Serving > Front Ends > Remove URLs
http://code.google.com/intl/es/apis/searchappliance/documentation/46/help_gsa/serve_remove.html
Then you have the launch and want it to be seamless, interesting
stuff.
So you have now
http://beta.current.site.com/ crawled and showing up
in search results, there is also
http://current.site.com/ in the index
with the old content. At some point, you want to switch the DNS so
that
current.site.com points to where
beta.current.site.com points
now. And at that point you want the appliance to have already crawled
the whole new site as
http://current.site.com/ but, how could this be
before the DNS switchover?
Easy option: use a different DNS server for the appliance and switch
it over to the new site a few days (to have plenty of time) before the
main, public-facing DNS. That would make the appliance see the new
content at
http://current.site.com/ before anybody else. Then reset
the index to have fresh start crawling the new site, being sure no
stuff from the old site will be crawled. By the time the main, public-
facing DNS is switched over, the index should be good.
Better option: use a differnet front ends for
beta.current.site.com
and
current.site.com and rewrite URLs in the former. Let's say you
have beta_frontend and current_frontend respectively. The
beta_frontend will be serving both URLs from
beta.current.site.com and
current.site.com but will rewrite
beta.current.site.com into
current.site.com so that beta content is served from
beta.current.site.com and fresh-new content from
current.site.com as
it re-crawls it. Old URLs returning 404 will gradually disappear.
Good complement: use incremental feeds with action="delete" to get rid
of old URLs fast. Get a list of all old URLs from
current.site.com and
feed them with action="delete", they should go away fast.
Removing Feed Content From the Index
http://code.google.com/intl/es/apis/searchappliance/documentation/46/feedsguide.html#removing_url
All this valid for all version from 4.6.4 onwards, which is all
versions I've ever worked with :-)
On Feb 4, 11:24 pm, justin blecher <
justin.blec...@gmail.com> wrote:
> On Feb 4, 12:26 pm, miguev <
mig...@gmail.com> wrote:
>
> > Two things you may be glad about:
> > 1. if the Do Not Crawl patterns did eventually removed URLs from the
> > index, then the appliance is not affected by the issues I was afraid
> > it would be, no need to upgrade
> > 2. all URLs that match any pattern in the Do Not Crawl box will be
> > removed from the index, it may take a while depending on the
> > appliance's work load
>
> > So if you want to delete the whole web site from the index, that's
> > easy although it can take a while (up to a few days at the worst), you
> > just need to add the root URLhttp://
www.website.com/inthe No Not