removing garbage URLs en masse from index

43 views
Skip to first unread message

justin blecher

unread,
Feb 3, 2009, 9:18:51 PM2/3/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
greetings all,

question: i have a load of URLs (tens of thousands, actually), that i
need to remove from a single site's index/collection on a GSA. how can
i do that? i've read that it's possible to can reset the entire GSA's
index, but 1) we don't have admin/superuser-level access to the GSA
(we only have a lowly user account to modify the frontend, create
reports, etc.) and 2) the GSA is being used for other purposes
(outside of this site).

let me 'splain a bit more...

the site we're working on is in development and is being crawled by a
GSA (v4.6.4). during our site's template integration, we put in "TEST"
links (href="TEST") so that we can easily find yet-to-be-integrated
links/functionality and clean them up as we go. the GSA started
crawling our site and hit our thousands of articles, finding TEST
links on every single page. they obviously return 404s, but still
appear in the index, cluttering up the view and making it difficult to
find the real errors. how can i remove them? i've added "TEST$" as a
URL pattern to ignore, but the TEST urls still appear in the list of
URLs crawled.

additionally, since the site was being crawled in mid-development,
some generated URLs that appeared on the site in the past month were
incorrect -- they either were ill-formed and/or pointed to content
that shouldn't have been indexed for various reasons. those generated
URLs have been fixed, but some are still returning HTTP 200. there is
no pattern to the these URLs. we're working on returning a more
accurate status code (404), but how can i get these out of the system,
too?

lastly, some generated URLs were really messed up and created
hostnames that were just plain invalid. but those got picked up by the
GSA, too. yes, i'd like to remove them, too.

it seems as though once the GSA finds URLs, there is no way to get
them out of the system, short of resetting the entire GSA's index...
is that correct? (again, i'd like to do it for just one site/
collection.) if this GSA wasn't being used for other purposes, we
could probably reset it, but what do people do in the case i've
described? just live with the a GSA index that is littered with errors
and exceptions? it sounds rather crufty and something i'd rather not
have to do.

i've temporarily added a robots.txt file to the site that disallows
"/" to all user agents (we're not live yet) in hopes of "resetting"
the index. so far, it seems to be increasing the "excluded URLs" in
chunks of a few thousand at a time, but the tens of thousands of
retrieval errors are still there. i was planning on revising the
robots.txt rules once all the "bad stuff" was out of the index.

*sigh*... google should really put something in the docs warning you
about setting the GSA loose during site development. i can't be the
only one that's run into this problem.

so does anyone have any thoughts or strategies for this mess^H^H^H^H
situation?


thanks in advance,

-justin

www.google-mini.net

unread,
Feb 4, 2009, 4:00:32 AM2/4/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
I just want to make sure I understand this correctly.

You want to remove any the pages that have href="TEST" inside of the
page content, correct?

As a work around, you can temporarily add a meta tag (<meta name="gsa-
code" value="live" />) and then use the FrontEnd filters to only show
pages that contain this meta tag: Google Mini > Serving > Front Ends >
Filters > Meta Tag Filter

In the Meta Tag Filter area, enter:
Met Tag Name: gsa-code
Value Type: Exact
Meta Tag Value: live

That is a work around you can use to get your search running on live
without actually removing the pages from the index. From there, what
you should do is cause all your test pages to return 404 errors as
part of the header response code. This will cause the Google Mini to
drop those pages from the index.

Finally, once you do have your search up and running. If you are using
ASP.NET, you should highly consider the Google Mini ASP.NET component
from http://www.google-mini.net/

The Google Mini ASP.NET component allows you to perform a Google Mini
search from any web page and get the search results back as an ASP.NET
object. The object includes all the search result item properties,
including title, description, URL, meta tags, crawl dates, file type,
etc. The best part is because the results come back as an ASP.NET
object, you don't have to worry about using XSLT technology, you can
just use CSS and XHTML (or even your existing MasterPage) to customize
the search result layout.

I hope this helps!

Jason Clark

Google Mini ASP.NET Integration and Customization
http://www.google-mini.net/
Toll Free: (877) 835-2801
Email: c...@google-mini.net

miguev

unread,
Feb 4, 2009, 4:17:37 AM2/4/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

Hi Justin,

Development appliance running a 2-versions-old release doesn't sound
any good, are you going live with that software version?
The oldest supported version is 5.0.4.G.22 (see
https://support.google.com/enterprise/doc/gsa/00/update_index_page.html
for details) and I guess you don't want to go live with an unsupported
version, so you'd probably be better off upgrading to that version
before launch.

Among a bunch of good reasons for upgrading, there is a famous issues
that seems to be hitting your appliance: it seems to be crawling but
neither the index nor Crawl Diagnostics are reflecting changes. When
trying to see whether a particular URL is in the index or not, search
for info:http://... instead of looking at Crawl Diagnostics, specially
for URLs that previously returned 404. Theoretically, any URL
returning 404 would be removed from the index withing 30 minutes in
lab condition:
http://code.google.com/intl/es/apis/searchappliance/documentation/52/admin_crawl/Introduction.html#remidx1hd

So you found a patter for the TEST$ URLs, that's good, search for them
and see if they are gone from the index.

For the invalid hostnames, can't you just match them the same way?
e.g. bahotsmane.moc/ in Do Not Crawl patterns?

Anyway, for those URLs you can't match with simple patterns, you can
use an incremental feed with action="delete". The feed will take
ownership of those URLs and kick them out of the index. Mind you, I
haven't tried this on 4.6.4 but it works fine from 5.0.0 onwards:
http://code.google.com/intl/es/apis/searchappliance/documentation/52/feedsguide.html#removing_url

I hope this helps! :)

justin blecher

unread,
Feb 4, 2009, 10:50:40 AM2/4/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

On Feb 4, 4:00 am, "www.google-mini.net"
<goo...@bellwetherentertainment.com> wrote:
> I just want to make sure I understand this correctly.
>
> You want to remove any the pages that have href="TEST" inside of the
> page content, correct?
[...]

not quite. all pages have (or more accurately, had at some point)
href="TEST" in the page content. i want to remove the TEST links at
various URL paths from the index. things like:

/path/a/TEST
/TEST
/path/b/TEST
/path/c/something-else/TEST

the URLs are returning 404s right now, but still appear in the crawl
queue and diagnostics. there are tens of thousands of them and they're
cluttering up the GSA's UI, making it hard to tell actual errors from
TEST errors.

hope that makes sense.

-justin

justin blecher

unread,
Feb 4, 2009, 12:18:35 PM2/4/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
miguev, thanks so much for the detailed info! see below...

On Feb 4, 4:17 am, miguev <mig...@gmail.com> wrote:
>
> Development appliance running a 2-versions-old release doesn't sound
> any good, are you going live with that software version?
> The oldest supported version is 5.0.4.G.22 (see
> https://support.google.com/enterprise/doc/gsa/00/update_index_page.html
> for details) and I guess you don't want to go live with an unsupported
> version, so you'd probably be better off upgrading to that version
> before launch.

unfortunately, my team doesn't have control over the GSA (it's our
client's). i'll see if i can convince them to upgrade, but i fear that
it's going to be a lot more work for everyone (including myself). i'm
not familiar enough with GSA licensing, but i suspect that their
support contract has run out and they aren't willing/interested in
paying for the upgrade (you can only get the upgrade if you pay for a
contract, right?).

> Among a bunch of good reasons for upgrading, there is a famous issues
> that seems to be hitting your appliance: it seems to be crawling but
> neither the index nor Crawl Diagnostics are reflecting changes. When
> trying to see whether a particular URL is in the index or not, search
> for info:http://... instead of looking at Crawl Diagnostics, specially
> for URLs that previously returned 404. Theoretically, any URL
> returning 404 would be removed from the index withing 30 minutes in
> lab condition:
> http://code.google.com/intl/es/apis/searchappliance/documentation/52/...

oooh... a "famous" issue, eh? is this truly a known bug that is
documented somewhere? if so, then i can point it out to our client and
further reinforce the need to upgrade.

> So you found a patter for the TEST$ URLs, that's good, search for them
> and see if they are gone from the index.
>
> For the invalid hostnames, can't you just match them the same way?
> e.g. bahotsmane.moc/ in Do Not Crawl patterns?

ok, so it seems that it just took a long time (overnight) for my
additions to the "Do Not Include Content Matching the Following
Patterns" list to get processed. the bad hostnames and the TEST URLs
are now not showing up in the list. whew!

i have a bit more investigation to do in this area. due to what i
learned about how the content-matching patterns work, i'm doing some
experimenting right now with trying to leverage that to remove the
whole site from the index. we'll see how it works. i figure i can't do
any more damage than i've already done. ;-)

sidenote: it's not entirely clear that the "Do Not Include Content
Matching the Following Patterns" list actually *removes* the URL from
the index if it already exists. since everything i had read and
observed thus far about the GSA's index pointed to the fact that
things got added but not removed, i interpreted the patterns as an
"ignore" filter, not as both an ignore filter and list of things to
remove (at some undetermined point in the near future). i think the
mental disconnect is the latency in the GSA applying the new settings.
i guess i just have to get used to the batch/queued nature of the GSA
and the fact that one can't get realtime feedback. it's unfortunate
that the interface doesn't reinforce this concept more. grrrr...

anyway...


> Anyway, for those URLs you can't match with simple patterns, you can
> use an incremental feed with action="delete". The feed will take
> ownership of those URLs and kick them out of the index. Mind you, I
> haven't tried this on 4.6.4 but it works fine from 5.0.0 onwards:
> http://code.google.com/intl/es/apis/searchappliance/documentation/52/...

right, i was reading about the feed technique. i had hoped to avoid
needing to use it. we'll see.

thanks again for all the info!

-justin

miguev

unread,
Feb 4, 2009, 12:26:08 PM2/4/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

Hi Justin,

Two things you may be glad about:
1. if the Do Not Crawl patterns did eventually removed URLs from the
index, then the appliance is not affected by the issues I was afraid
it would be, no need to upgrade
2. all URLs that match any pattern in the Do Not Crawl box will be
removed from the index, it may take a while depending on the
appliance's work load

So if you want to delete the whole web site from the index, that's
easy although it can take a while (up to a few days at the worst), you
just need to add the root URL http://www.website.com/ in the No Not
Crawl patterns and comment out all Start URLs that match it. Save
patterns and leave it alone, the whole website should vanish from the
index without changing any other settings.

justin blecher

unread,
Feb 4, 2009, 6:24:58 PM2/4/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

On Feb 4, 12:26 pm, miguev <mig...@gmail.com> wrote:
> Two things you may be glad about:
> 1. if the Do Not Crawl patterns did eventually removed URLs from the
> index, then the appliance is not affected by the issues I was afraid
> it would be, no need to upgrade
> 2. all URLs that match any pattern in the Do Not Crawl box will be
> removed from the index, it may take a while depending on the
> appliance's work load
>
> So if you want to delete the whole web site from the index, that's
> easy although it can take a while (up to a few days at the worst), you
> just need to add the root URLhttp://www.website.com/in the No Not
> Crawl patterns and comment out all Start URLs that match it. Save
> patterns and leave it alone, the whole website should vanish from the
> index without changing any other settings.

*sigh*... this sucks.

my initial hunch/understanding of the GSA appears to be right. once
content gets in, it doesn't seem possible to remove it. ever. the best
that can be done is that it can be 'excluded' (ignored/hidden) from
search results. in other words, the "Include Content..." and "Do Not
Include Content..." lists are purely for *hiding/ignoring* the indexed
pages from the search results. i've been able to verify this a few
times today. is this the "famous bug" you're talking about? after
reading the docs again, it seems like it's expected/normal behavior.
did 5.x change this at all?

in other words, i can't remove the contents of the index -- the GSA's
knowledge that certain URLs exist on the site -- using the URL pattern
matches. they only mask things (in results and the admin tool).

ok, so i've belabored the point. i get it. i finally understand how
the GSA works (until someone can clarify things a bit more with me).
let's move on to something more interesting: actually launching the
new site!

here's my situation... there are two sites:

existing live site @ current.site.com
- GSA index: a few hundred bad URLs with a handful of existing pages

new site in development @ beta.current.site.com
- GSA index: tens of thousands of pages (more bad URLs than good, but
i digress...)

(note that beta site is a subdomain of the current site. yes, you read
that right: beta.current.site.com)

ok, so we want to launch the new site, doing the standard DNS
switcheroo. once DNS is switched, the content on current.site.com
effectively goes away and the URLs will generate 404s on the new
server. we also have no index of our new content at the "real" (non-
beta) URL at launch time, because the GSA has yet to index our new
site, a process that may take hours/days. since it has to reindex the
same content at the new URL, that means that there will be two copies
of the content in the index. i assume the "correct" way to handle this
is to add the beta URL to the "do not crawl" list, effectively hiding
that content from the search results.

the question remains, though: how can we pre-populate the GSA's index
with all of the new beta content at the correct URL (allowing for a
seamless launch)?

the only way i can think of is to post-process the search results
markup (which i'm already doing for other reasons), replacing
beta.current.site.com with current.site.com during the time that the
GSA doesn't have an index of the beta content at the final URL. that
also means that at the point the site goes live, i have to exclude the
current.site.com from the index so the old content doesn't show up in
the search results. THEN, once i have confirmed the content is indexed
at the real URL, i add the beta URL to the ignore list and remove the
search results post-processing. whew.

again, moving forward, two copies of the content will remain in the
index, but i will be ignoring one copy of it.

and through all of this i have to deal with the awesome ~30 minute
latency of all changes, not really knowing *when* my change took
effect, and if the change actually had any effect. how do people work
like this? seriously?

my goal for all of this is simple: to clean it up and make it sane. it
seems like using the GSA, this is just not possible to do. once the
GSA's index is polluted with bad URLs, all you can do is attempt to
construct patterns that will match the undesirable URLs.

(sorry for the rant-like tone of the post... this GSA integration is
taking a bit longer and is causing a bit more frustration than we had
originally planned.)


thanks again,

-justin

miguev

unread,
Feb 5, 2009, 5:43:56 AM2/5/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

Hi Justin,

You won't like this, but I'm afraid you have one or two important
issues here:
a. your appliance is not working properly and needs upgrade to 5.0.4.G.
22 plus VM Patch #1
b. you don't fully understand how to add and remove documents to the
index

I can't tell yet whether a. or b. or both are actually happening, but
from your last message I'm afraid of both.

Let's start with a. When the appliance works properly, any URL that
return 404 should get removed from the index without retrying. This
would take about 30 minutes if the appliance is not busy, longer if it
is. Also, any URL that matches patterns in the Do Not Crawl box will
be removed, not just hidden, from the index. That can take up to 6
hours in old versions like 4.6.4 but must happen, otherwise the
appliance is definitely in a bad state and you should not use it for
production. Resetting the index will fix it, but that's the
sledgehammer, last-resort approach and yet won't prevent the issue
from happening again.

Yet it's not clear to me that your appliance is in a bad state. When
you say "they only mask things (in results and the admin tool)", do
you mean that those URLs disappear from both search results and Crawl
Diagnostics? If that is the case, those URLs are gone forever from the
index, not just masked or hidden. Even if you see the URLs in Crawl
Diagnostics but they never show up in search results, they are no
longer in the index, only Crawl Diagnostics is not aware of it yet.
You can only be certain URLs are not removed from the index when they
do show up in search results, is that happening to your URLs mathing
Do Not Crawl patterns? If they are, are you really sure they match the
patterns? Sometimes they don't and it's not obvious, try the pattern
test box for them.

If your appliance is on a bad state, it won't crawl. So just add
something to Start URLs and Crawl and Follow pattern and see if it
shows up in the index. If it does, the appliance may be suitable for
production, but if you don't get new sites into the index, it is not.

As for b. I only see that you are using the term "hidden" for URLs
that match patterns in the Do Not Crawl box. URLs matching those
patters should be *removed* from the index, gone, bye bye forever. You
can also hide URLs on each front end if you want, and this can help
you out when you need to avoid URLs from showing up in search results
*urgently*. However, this approach is not good for long term as it
degrades performance on the serving side (slower results) but will do
for emergencies:

Serving > Front Ends > Remove URLs
http://code.google.com/intl/es/apis/searchappliance/documentation/46/help_gsa/serve_remove.html

Then you have the launch and want it to be seamless, interesting
stuff.

So you have now http://beta.current.site.com/ crawled and showing up
in search results, there is also http://current.site.com/ in the index
with the old content. At some point, you want to switch the DNS so
that current.site.com points to where beta.current.site.com points
now. And at that point you want the appliance to have already crawled
the whole new site as http://current.site.com/ but, how could this be
before the DNS switchover?

Easy option: use a different DNS server for the appliance and switch
it over to the new site a few days (to have plenty of time) before the
main, public-facing DNS. That would make the appliance see the new
content at http://current.site.com/ before anybody else. Then reset
the index to have fresh start crawling the new site, being sure no
stuff from the old site will be crawled. By the time the main, public-
facing DNS is switched over, the index should be good.

Better option: use a differnet front ends for beta.current.site.com
and current.site.com and rewrite URLs in the former. Let's say you
have beta_frontend and current_frontend respectively. The
beta_frontend will be serving both URLs from beta.current.site.com and
current.site.com but will rewrite beta.current.site.com into
current.site.com so that beta content is served from
beta.current.site.com and fresh-new content from current.site.com as
it re-crawls it. Old URLs returning 404 will gradually disappear.

Good complement: use incremental feeds with action="delete" to get rid
of old URLs fast. Get a list of all old URLs from current.site.com and
feed them with action="delete", they should go away fast.

Removing Feed Content From the Index
http://code.google.com/intl/es/apis/searchappliance/documentation/46/feedsguide.html#removing_url

All this valid for all version from 4.6.4 onwards, which is all
versions I've ever worked with :-)

On Feb 4, 11:24 pm, justin blecher <justin.blec...@gmail.com> wrote:
> On Feb 4, 12:26 pm, miguev <mig...@gmail.com> wrote:
>
> > Two things you may be glad about:
> > 1. if the Do Not Crawl patterns did eventually removed URLs from the
> > index, then the appliance is not affected by the issues I was afraid
> > it would be, no need to upgrade
> > 2. all URLs that match any pattern in the Do Not Crawl box will be
> > removed from the index, it may take a while depending on the
> > appliance's work load
>
> > So if you want to delete the whole web site from the index, that's
> > easy although it can take a while (up to a few days at the worst), you
> > just need to add the root URLhttp://www.website.com/inthe No Not
Reply all
Reply to author
Forward
0 new messages