Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

htdig: htdig URL duplication

3 views
Skip to first unread message

Danny Birchall

unread,
Jan 25, 1999, 3:00:00 AM1/25/99
to

For historical reasons we have in our server configuaration a number of
server aliases which serve to cut out part of the URL to make it
shorter: eg
www.sussex.ac.uk/Units/foo/
becomes
www.sussex.ac.uk/foo/.
This causes a problem when running htdig, because inevitably somewhere
within our document tree, pages will be referenced both as /Units/foo/
and as /foo/. Result: ht://Dig indexes the same pages twice, with
different URLs, and when a htsearch is run, each pages is returned
twice, once with each URL.

At first we managed to get round this problem by using the local_urls
attribute in the config file. Together with a patch which modified htdig

to note the inode of the file being indexed and thus prevent the same
physical file being referenced a second time, we managed to eliminate
duplicates by checking that they only existed once on the filesystem.

The problem is that this only works for each complete new run of htdig.
When we run an update, all the duplications reappear.

Can anybody think of a workaround, however elaborate? This problem is
all that's stopping us from using htdig on our site.

Thanks


--------------------------------------------------
Danny Birchall
Editor
University of Sussex Information Service
http://www.sussex.ac.uk/

D.P.Bi...@sussex.ac.uk
Tel: (0)1273 678745
Fax: (0)1273 678441
---------------------------------------------------

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-...@sdsu.edu containing the single word "unsubscribe" in
the body of the message.

Knut A. Syed

unread,
Jan 25, 1999, 3:00:00 AM1/25/99
to
D.P.Bi...@sussex.ac.uk (Danny Birchall) writes:

> For historical reasons we have in our server configuaration a number of
> server aliases which serve to cut out part of the URL to make it
> shorter: eg www.sussex.ac.uk/Units/foo/ becomes www.sussex.ac.uk/foo/.
>
> This causes a problem when running htdig, because inevitably somewhere
> within our document tree, pages will be referenced both as /Units/foo/
> and as /foo/. Result: ht://Dig indexes the same pages twice, with
> different URLs, and when a htsearch is run, each pages is returned
> twice, once with each URL.

Configure your Web-server to make http://www.sussex.ac.uk/Units/foo/ a
redirect to http://www.sussex.ac.uk/foo/ instead of an alias.

Documentation for Apache redirect:
<URL:http://www.apache.org/docs/mod/mod_alias.html#redirect>
(HEAD at www.sussex.ac.uk told me you were running Apache.)

~kas

0 new messages