How to configure Dezi to "forget"?

17 views
Skip to first unread message

Paul Stubbe

unread,
Mar 10, 2015, 4:25:25 PM3/10/15
to dezi-...@googlegroups.com
Hi,

I used to daily index all my documents with swish.
With Dezi I decided to go for the incremental indexing and this works fine (thanks for the tips on how to do that).
I have a proces in place to reindex files that recently changed.

But now I have a new "problem":
       "How to best configure Dezi to forget after X days?" (Because stuff gets renamed, deleted, ..)    
             (I could do this with a external DB and check everything but that does not feel right.)

Is there a good practice to do this with Dezi?

Thanks,

Paul

Peter Karman

unread,
Mar 10, 2015, 5:35:51 PM3/10/15
to dezi-...@googlegroups.com
Keeping an index in sync with the original corpus is always a challenge, and IME, usually specific
to the kind of corpus you're starting from.

In general, I take a belt-and-suspenders approach: regular incremental updates and periodic full
re-indexing. The exact timing depends on your use case(s). For my projects, it's often
save-and-index (for db-based origins) or write-and-index (for file-based origins). That handles the
incremental updates. Then I do a full rebuild weekly, often early on a Sunday morning. That assumes
that your corpus is small enough to rebuild in a matter of hours, not days.

In addition, I have written several "prune" type scripts, like this one:

https://github.com/publicinsightnetwork/audience-insight-repository/blob/master/bin/prune-deleted#L159

which basically crawls through the original data source (in this case a database and filesystem) and
figures out what needs to be deleted from the index.

Using an external db to track the "state" of your documents doesn't sound insane to me. What about
it feels wrong?

Paul Stubbe

unread,
Mar 11, 2015, 8:23:13 AM3/11/15
to dezi-...@googlegroups.com, pe...@peknet.com
Peter,
It feels wrong to me because I would like to keep my overall system as simple as possible.
 
What I would like is something like the following scenario:.
 
- I would like to only work with the incremental system and (almost) never rebuild everything.
- I can run an re-indexing proces that visits all "source-docs" once a week.
- If a page did not get re-indexed after say three weeks I would like Dezi to forget about it.
       ( I can probably add an "indexed date" as meta information and run a proces after three weeks to delete those links that Dezi finds.)
 
 
I hope this helps you further develop this beautiful tool.
 
Greetings,
 
Paul
 
 
 

Peter Karman

unread,
Mar 13, 2015, 10:50:35 PM3/13/15
to dezi-...@googlegroups.com
On 3/11/15 7:23 AM, Paul Stubbe wrote:

> - If a page did not get re-indexed after say three weeks I would like Dezi to forget about it.
> ( I can probably add an "indexed date" as meta information and run a proces after three
> weeks to delete those links that Dezi finds.)
>


I think you could achieve that pretty well with the Lucy lib directly:

https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Index/Indexer.pod#delete_by_query-query

the "swishlastmodified" field stores the indexing time as an epoch value.

You might try something like this:

#!/usr/bin/env perl
use strict;

use SWISH::Prog::Lucy::Indexer;
use SWISH::Prog::Lucy::Searcher;
my $index = 'dezi.index';
my $indexer = SWISH::Prog::Lucy::Indexer->new( invindex => $index );
my $lucy_indexer = $indexer->get_lucy;
my $three_weeks_ago = shift(@ARGV) || time() - ( 86400 * 21 );
my $padded_zeros = '0' x length($three_weeks_ago);
my $query = qq/swishlastmodified=($padded_zeros .. $three_weeks_ago)/;
print "$query\n";
my $searcher = SWISH::Prog::Lucy::Searcher->new( invindex => $index );
my $results = $searcher->search($query);
printf( "will delete %s documents\n", $results->hits );
$lucy_indexer->delete_by_query( $results->query->as_lucy_query );
$lucy_indexer->commit();


--
Peter Karman . www.peknet.com . @peterkarman
Reply all
Reply to author
Forward
0 new messages