On 3/6/13 6:03 AM, Paul Stubbe wrote:
>
>
> Op vrijdag 12 oktober 2012 16:52:02 UTC+2 schreef Peter Karman het volgende:
>
> What I did was add this to my swish.config:
>
> MetaNames origin
>
> and then, since I was spidering a website, I added a doc-filter like
> this:
>
> % cat
doc-filter.pl <
http://doc-filter.pl>
> sub {
> my $doc = shift;
>
> my $buf = $doc->content;
>
> # add origin meta value
> $buf =~ s,</head>,<meta name="origin" value="mysite"/></head>,;
>
> # reset the content
> $doc->content($buf);
> }
>
> and then invoked that filter from swish3 cmd line:
>
> % swish3 -S spider -F lucy \
> -f dezi.index -i
http://mysite.foo \
> -c swish.conf --doc_filter
doc-filter.pl <
http://doc-filter.pl>
>
>
> Peter,
>
> Can / Should I use your new "dezibot" to do the same thing? (Add
> meta info.)
>
> What do you propose?
>
Paul,
Thanks for the question.
The dezibot implementation is just a wrapper around the swish3 spider,
providing persistent caching and storage using DBI to allow for scaling
to multiple simultaneous crawls.
For your purposes, I still suggest the swish3 spider with doc_filter.
That keeps all your collections in a single index. You could also create
multiple indexes, one for each data store, which is a similar technique
I'm using now at $work here:
https://www.publicinsightnetwork.org/?s=test
In those results, the Post results are indexed with the
dezi-for-wordpress plugin and the Query results are getting indexed
outside of wordpress. Dezi then serves up 2 indexes (one for each type,
each with same schema) and provides integrated results via the wordpress
plugin on the site.
My Dezi config looks like:
{ engine_config => {
index => [qw( pin.org.index queries.index )],
parser_config => { query_dialect => 'Lucy', },
facets => { names => [
qw( categories tags author type )]
},
do_not_hilite => { map { $_ => 1 }
qw( permalink type categories tags author ) },
cache_ttl => 60, # only cache facets a short time
# result attributes in response
fields => [
qw( id permalink numcomments categories categoriessrch
tags tagssrch author author_s type date modified
displaydate displaymodified )
],
},
}
HTH,
pek