arjan wrote on 8/4/15, 8:36 PM:
> Dear Peter,
>
> Tried to post this e-mail to
dezi-...@googlegroups.com, but although I'm
> subscribed to this group, I don't have permission to post to it. (see attachment)
>
Hi Arjan,
I just sent you an invite from the google list. I didn't see your email address
on the subscriber list.
I'm going to reply inline and cc: the list, to save you the trouble of
re-posting. I like to make sure my replies are on-list when possible so that
google will remember things when I immediately forget them. :)
> In response to my bug-report on
rt.cpan.org on the documentation you not only
> reported it fixed - thanks for that -, but you also said that Dezi::Lucy is a
> more modern version of the SWISH::Prog::Lucy module.
It's more modern in the sense that Dezi::App is a rewrite using Moose. As of
now, both Dezi::Lucy and SWISH::Prog::Lucy read/write identical indexes, and can
be used interchangeably.
>
> I used to work with Lucy for years, although not last two. I know Solr. And I
> understand the value of a search engine that for most tasks only needs
> configuration, but can be extended later. So I can see the rationale for Dezi.
> I'm diving into it now.
cool. thanks in advance for the quality feedback below.
>
> From the documentation I distilled:
> - Dezi::Lucy is from 2014. The world of Dezi::Lucy with Dezi::Lucy::Indexer,
> Lucy::Indexer and Lucy::Index::Indexer replaces the world of SWISH::Prog::Lucy
> from 2009.
> - Dezi::Lucy is based on SWISH in the sense that it always uses SWISH::3 in the
> aggregation for the XML and HTML parsing. It does not have to use SWISH-e as
> it's engine. It uses Lucy by default, where Dezi::Lucy::Indexer creates the
> schema with Lucy::Plan::Schema. Correct?
Perfectly correct.
>
> There are a few things that I find hard to distill from the documentation:
> - Are MetaNames stripped of special characters? (I used to store a seperate
> searchstring where special characters where translated to ASCII, so I would find
> results both with and without special characters.)
That is a good technique and I have used it myself. See
Search::Tools::Transliterate e.g.
There is no such feature built-in to Dezi. You would approach it much the same
way as you have in the past: create separate fields, one with full UTF-8 and one
with ASCII only. You could search both fields by default by configuring your
Search::Query::Parser with an array of 'default_field' values. See
https://metacpan.org/pod/Search::Query::Parser
> - Is there a simple way to write out the Schema that Dezi creates? (I used to
> document this like below)
You can view the Lucy Schema files in the index itself, which are written out as
JSON.
If you want to create the Schema yourself, in Perl rather than via a Dezi config
file, then you may already have feature requirements that are beyond what Dezi
does for you. Dezi's bias is configuration-over-code.
Here's some general notes:
https://metacpan.org/pod/Dezi::Lucy::Indexer#MetaNames-and-PropertyNames
Here's the code that turns a Dezi::Indexer::Config into a Lucy Schema:
https://metacpan.org/source/KARMAN/Dezi-App-0.013/lib/Dezi/Lucy/Indexer.pm#L100
which calls this method:
https://metacpan.org/source/KARMAN/Dezi-App-0.013/lib/Dezi/Lucy/Indexer.pm#L233
> - How do I instruct Dezi to create a new index?
Dezi will always create a new index if one does not already exist. You set the
index path via a Dezi::Lucy::InvIndex object (if you're writing your own code),
or with the deziapp -i option (if you're just using the cli) or via a config
file (if you're using the Dezi server):
https://metacpan.org/pod/Dezi::Config (see the engine_config section).
> - What is the relationship between PropertyNames and MetaNames on the one hand
> and facets and fields as defined in Dezi::Config on the other? I do understand
> that both PropertyNames and MetaNames are searchable and only PropertyNames are
> stored. Both are used to create the schema in Dezi::Lucy::Indexer. But how do
> the fields and facets that are defined in the engine_config in Dezi::Config
> relate to them and to the Schema?
Great questions.
A Lucy 'field' is defined by a MetaName and/or PropertyName. See the link above
for how defining one or both affects the attributes of the Lucy field.
The MetaName and PropertyName terms are legacy from the Swish-e application. I
think of them this way: a MetaName defines what you can search, and a
PropertyName defines what is returned from results. Often a field definition is
both a MetaName and a PropertyName but it is reasonable to define a field as
just one or the other, depending on your needs.
E.g.:
MetaNames foo
would mean I could search for:
query => 'foo:bar'
but the raw field value of 'foo' is not stored in the index and cannot be
returned as a result value, because 'foo' is not also defined as a PropertyName.
Likewise, if I defined my config with:
PropertyNames foo
then I could not search for 'foo:bar' but I could return the value of the 'foo'
field in my results.
So in practice, I pair MetaNames and PropertyNames together about 99% of the time.
MetaNames foo
PropertyNames foo
The 'fields' defined in an engine_config are PropertyNames. They are the values
that are returned in results.
The 'facets' defined in an engine_config are also PropertyNames: they are
aggregated and counted at search time based on the results from a query.
> - What is the regex used by the tokenizer? Is it different from
> '\w+(?:[\x{2019}\x{0026}\']\w+)*'? That's what I used to use.
>
Dezi avoids tokenizing when used with Lucy.
https://metacpan.org/source/KARMAN/Dezi-App-0.013/lib/Dezi/Indexer.pm#L46
SWISH::3 can tokenize, but in the interest of consistency and speed, the
SWISH::3 tokenizers are turned off in favor of letting Lucy's native features.
The default values of the Lucy::Analysis::RegexTokenizer are used:
https://metacpan.org/source/KARMAN/Dezi-App-0.013/lib/Dezi/Lucy/Indexer.pm#L109
According to the Lucy docs, that is '\w+(?:[\x{2019}']\w+)*'
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/RegexTokenizer.pod
If you needed to override the defaults for Dezi::Lucy::Indexer, I would gladly
accept a patch that added a 'token_pattern' modifier to that class. A PR via
github is always nicest for me.
> Great work by the way. Perhaps make more clear in the documentation that
> everything is easy to experiment with using bin/deziapp from the command line
> and bin/dezi and bin/dezi-client from the browser. And that from these, the path
> to the different modules is easy to follow.
>
That's a great point. I'm definitely too close to the code to see the gaps in
the documentation. Any/all improvements welcome.
HTH.
cheers,
pek
>
> sub _build_msg_schema {
> my $self = shift;
> my $schema = Lucy::Plan::Schema->new;
> # Analysers:
> my $case_folder = Lucy::Analysis::CaseFolder->new;
> my $tokenizer = Lucy::Analysis::RegexTokenizer->new(
> pattern => '\w+(?:[\x{2019}\x{0026}\']\w+)*'
> );
> my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
> analyzers => [ $case_folder, $tokenizer ],
> );
> # Field Types:
> my $type_id = Lucy::Plan::StringType->new(
> indexed => 1,
> stored => 1,
> sortable => 1
> ); #
> message_id/source_id/vectororder
> my $type_id_sortable = Lucy::Plan::StringType->new(
> indexed => 1,
> stored => 0,
> sortable => 1
> ); # issued
> my $type_text = Lucy::Plan::FullTextType->new(
> analyzer => $polyanalyzer,
> indexed => 1,
> stored => 0,
> sortable => 0
> ); #
> summary/description/searchstring
> my $type_text_stored = Lucy::Plan::FullTextType->new(
> analyzer => $polyanalyzer,
> indexed => 1,
> stored => 1,
> sortable => 0
> ); # title
> my $type_text_si = Lucy::Plan::FullTextType->new(
> analyzer => $polyanalyzer,
> indexed => 1,
> stored => 1,
> sortable => 0
> ); # url
>
>
> # Schema String Analyze Fulltext
> # implicit: document-id from Lucy
> itself index store sortable none
> tokanize case stemming lang/stop
> $schema->spec_field( name => 'message_id', type => $type_id
> );# Y Y Y Y
> $schema->spec_field( name => 'vectororder', type => $type_id
> );# Y Y Y Y
> $schema->spec_field( name => 'issued', type => $type_id );#
> Y Y Y Y
> $schema->spec_field( name => 'title', type => $type_text_stored );#
> Y Y N N Y Y N N
> $schema->spec_field( name => 'summary', type => $type_text );#
> Y N N N Y Y N N
> $schema->spec_field( name => 'description', type => $type_text
> );# Y N N N Y Y N N
> $schema->spec_field( name => 'searchstring', type => $type_text
> );# Y N N N Y Y N N
> $schema->spec_field( name => 'url', type => $type_text_si );#
> Y Y N Y
>
> return $schema;
> };
>
--
Peter Karman .
http://peknet.com/ .
pe...@peknet.com