Removing expired contents

4 views
Skip to first unread message

germain

unread,
Sep 7, 2010, 4:11:42 PM9/7/10
to hounder
Bonjour,

How do we configure hounder to remove expired contents? i.e.: pages
containing "is not available".

The ImplementationDetails document talks about the
"pass.through.on.tags".

Cheers,

Germain

jhandl

unread,
Sep 7, 2010, 4:15:28 PM9/7/10
to hounder
Bonjour Germain,

Do you mean a page that returns a 404 error code or a page that have
the words "not available" in their contents?

germain

unread,
Sep 7, 2010, 5:01:55 PM9/7/10
to hounder
Bonjour jhandl,

Pages that have "not available" in their contents.

Thanks,

Germain
> > Germain- Hide quoted text -
>
> - Show quoted text -

Jorge Handl

unread,
Sep 7, 2010, 5:07:45 PM9/7/10
to hou...@googlegroups.com
Germain, you can use the WordFilterModule to remove the hotspot tag when it enounters a page that contains the "Not Available" phrase. Not sure if that would cover your use case, though.


--
You received this message because you are subscribed to the Google Groups "hounder" group.
To post to this group, send email to hou...@googlegroups.com.
To unsubscribe from this group, send email to hounder+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/hounder?hl=en.


germain

unread,
Sep 7, 2010, 5:24:44 PM9/7/10
to hounder
Thanks Jorge,

It's not quite what I had in mind. I am trying to index contents that
expires over time (status changes). Some servers return a 404 error
and other servers return a message instead of the 404 error. i.e.:
content not available... page expired... not available.

Can we expand the 404 error page definition to include pages with the
above messages? So that when these pages are crawled a second time,
their contents is removed from the index.

Thanks,

Germain
> > hounder+u...@googlegroups.com<hounder%2Bunsu...@googlegroups.com­>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/hounder?hl=en.- Hide quoted text -

Jorge Handl

unread,
Sep 7, 2010, 5:41:04 PM9/7/10
to hou...@googlegroups.com
A page that says "page not available" with a success return code (200) is in every way a normal, valid web page; it's only your interpretation of the rendered contents that conveys the meaning "not found". The only way to catch it is to have a filter for every possible return message you expect to get. The feasibility of that depends on the breath of the crawl you're trying to do. If you are crawling the web at large you will have to accept that some percentage of unwanted pages will filter through the process. 

If you are crawling a known set of sites and want to keep those pages out of the index, I recommend you use the WordFilterModule (or a more capable filter that considers the html structure to reduce false positives) to try to catch as many cases as possible, and configure the module to unset the "emitdoc" tag when a match is found, so it will not be sent to the index.

Jorge

To unsubscribe from this group, send email to hounder+u...@googlegroups.com.

germain

unread,
Sep 8, 2010, 10:57:49 PM9/8/10
to hounder
Bonjour Jorge,

Thank you for your explanation and for pointing me in the right
direction. I've been trying to configure the WordFilterModule and
want to double check my settings.

1. Add the WordFilterModule to the list of enabled modules in the
crawler.properties file.
2. Create a file called words.txt that includes the regex
expression .*smurf.*
3. Update the wordFilterModule.properties file with the following
values:

# Inherited properties from ATrueFalseModule
on.true.set.tags =
on.true.unset.tags = emitdoc,hotspot
on.false.set.tags = hotspot,emitdoc
on.false.unset.tags =

I am trying to skip over / remove html files that contain the word
"smurf" on my internal test server.

Cheers,

Germain
> > <hounder%2Bunsu...@googlegroups.com<hounder%252Bunsubscribe@googlegroup­s.com>
> > ­>
> > > > .
> > > > For more options, visit this group at
> > > >http://groups.google.com/group/hounder?hl=en.-Hide quoted text -

Jorge Handl

unread,
Sep 9, 2010, 12:03:27 AM9/9/10
to hou...@googlegroups.com
Germain, your setup is mostly correct, except that you don't need to use regex. Each line in the words.txt file constitutes a phrase, so if you want to tag pages that contain "page not found" or "content expired", you would write those two lines, without the quotes, to the words.txt file.

Jorge

To unsubscribe from this group, send email to hounder+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages