Python client 0.1.0 available

12 views
Skip to first unread message

Peter Karman

unread,
Aug 24, 2012, 2:52:50 PM8/24/12
to dezi-...@googlegroups.com
Finally got back to this project:

https://github.com/karpet/dezi-client-python

Peter Karman

unread,
Aug 4, 2015, 10:27:19 PM8/4/15
to arjan, dezi-...@googlegroups.com
arjan wrote on 8/4/15, 8:36 PM:
> Dear Peter,
>
> Tried to post this e-mail to dezi-...@googlegroups.com, but although I'm
> subscribed to this group, I don't have permission to post to it. (see attachment)
>


Hi Arjan,

I just sent you an invite from the google list. I didn't see your email address
on the subscriber list.

I'm going to reply inline and cc: the list, to save you the trouble of
re-posting. I like to make sure my replies are on-list when possible so that
google will remember things when I immediately forget them. :)


> In response to my bug-report on rt.cpan.org on the documentation you not only
> reported it fixed - thanks for that -, but you also said that Dezi::Lucy is a
> more modern version of the SWISH::Prog::Lucy module.


It's more modern in the sense that Dezi::App is a rewrite using Moose. As of
now, both Dezi::Lucy and SWISH::Prog::Lucy read/write identical indexes, and can
be used interchangeably.


>
> I used to work with Lucy for years, although not last two. I know Solr. And I
> understand the value of a search engine that for most tasks only needs
> configuration, but can be extended later. So I can see the rationale for Dezi.
> I'm diving into it now.


cool. thanks in advance for the quality feedback below.

>
> From the documentation I distilled:
> - Dezi::Lucy is from 2014. The world of Dezi::Lucy with Dezi::Lucy::Indexer,
> Lucy::Indexer and Lucy::Index::Indexer replaces the world of SWISH::Prog::Lucy
> from 2009.
> - Dezi::Lucy is based on SWISH in the sense that it always uses SWISH::3 in the
> aggregation for the XML and HTML parsing. It does not have to use SWISH-e as
> it's engine. It uses Lucy by default, where Dezi::Lucy::Indexer creates the
> schema with Lucy::Plan::Schema. Correct?


Perfectly correct.


>
> There are a few things that I find hard to distill from the documentation:
> - Are MetaNames stripped of special characters? (I used to store a seperate
> searchstring where special characters where translated to ASCII, so I would find
> results both with and without special characters.)

That is a good technique and I have used it myself. See
Search::Tools::Transliterate e.g.

There is no such feature built-in to Dezi. You would approach it much the same
way as you have in the past: create separate fields, one with full UTF-8 and one
with ASCII only. You could search both fields by default by configuring your
Search::Query::Parser with an array of 'default_field' values. See
https://metacpan.org/pod/Search::Query::Parser


> - Is there a simple way to write out the Schema that Dezi creates? (I used to
> document this like below)

You can view the Lucy Schema files in the index itself, which are written out as
JSON.

If you want to create the Schema yourself, in Perl rather than via a Dezi config
file, then you may already have feature requirements that are beyond what Dezi
does for you. Dezi's bias is configuration-over-code.

Here's some general notes:
https://metacpan.org/pod/Dezi::Lucy::Indexer#MetaNames-and-PropertyNames

Here's the code that turns a Dezi::Indexer::Config into a Lucy Schema:

https://metacpan.org/source/KARMAN/Dezi-App-0.013/lib/Dezi/Lucy/Indexer.pm#L100

which calls this method:

https://metacpan.org/source/KARMAN/Dezi-App-0.013/lib/Dezi/Lucy/Indexer.pm#L233

> - How do I instruct Dezi to create a new index?

Dezi will always create a new index if one does not already exist. You set the
index path via a Dezi::Lucy::InvIndex object (if you're writing your own code),
or with the deziapp -i option (if you're just using the cli) or via a config
file (if you're using the Dezi server):

https://metacpan.org/pod/Dezi::Config (see the engine_config section).



> - What is the relationship between PropertyNames and MetaNames on the one hand
> and facets and fields as defined in Dezi::Config on the other? I do understand
> that both PropertyNames and MetaNames are searchable and only PropertyNames are
> stored. Both are used to create the schema in Dezi::Lucy::Indexer. But how do
> the fields and facets that are defined in the engine_config in Dezi::Config
> relate to them and to the Schema?

Great questions.

A Lucy 'field' is defined by a MetaName and/or PropertyName. See the link above
for how defining one or both affects the attributes of the Lucy field.

The MetaName and PropertyName terms are legacy from the Swish-e application. I
think of them this way: a MetaName defines what you can search, and a
PropertyName defines what is returned from results. Often a field definition is
both a MetaName and a PropertyName but it is reasonable to define a field as
just one or the other, depending on your needs.

E.g.:

MetaNames foo

would mean I could search for:

query => 'foo:bar'

but the raw field value of 'foo' is not stored in the index and cannot be
returned as a result value, because 'foo' is not also defined as a PropertyName.

Likewise, if I defined my config with:

PropertyNames foo

then I could not search for 'foo:bar' but I could return the value of the 'foo'
field in my results.

So in practice, I pair MetaNames and PropertyNames together about 99% of the time.

MetaNames foo
PropertyNames foo


The 'fields' defined in an engine_config are PropertyNames. They are the values
that are returned in results.

The 'facets' defined in an engine_config are also PropertyNames: they are
aggregated and counted at search time based on the results from a query.



> - What is the regex used by the tokenizer? Is it different from
> '\w+(?:[\x{2019}\x{0026}\']\w+)*'? That's what I used to use.
>

Dezi avoids tokenizing when used with Lucy.
https://metacpan.org/source/KARMAN/Dezi-App-0.013/lib/Dezi/Indexer.pm#L46

SWISH::3 can tokenize, but in the interest of consistency and speed, the
SWISH::3 tokenizers are turned off in favor of letting Lucy's native features.

The default values of the Lucy::Analysis::RegexTokenizer are used:
https://metacpan.org/source/KARMAN/Dezi-App-0.013/lib/Dezi/Lucy/Indexer.pm#L109

According to the Lucy docs, that is '\w+(?:[\x{2019}']\w+)*'
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/RegexTokenizer.pod

If you needed to override the defaults for Dezi::Lucy::Indexer, I would gladly
accept a patch that added a 'token_pattern' modifier to that class. A PR via
github is always nicest for me.


> Great work by the way. Perhaps make more clear in the documentation that
> everything is easy to experiment with using bin/deziapp from the command line
> and bin/dezi and bin/dezi-client from the browser. And that from these, the path
> to the different modules is easy to follow.
>

That's a great point. I'm definitely too close to the code to see the gaps in
the documentation. Any/all improvements welcome.

HTH.

cheers,
pek



>
> sub _build_msg_schema {
> my $self = shift;
> my $schema = Lucy::Plan::Schema->new;
> # Analysers:
> my $case_folder = Lucy::Analysis::CaseFolder->new;
> my $tokenizer = Lucy::Analysis::RegexTokenizer->new(
> pattern => '\w+(?:[\x{2019}\x{0026}\']\w+)*'
> );
> my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
> analyzers => [ $case_folder, $tokenizer ],
> );
> # Field Types:
> my $type_id = Lucy::Plan::StringType->new(
> indexed => 1,
> stored => 1,
> sortable => 1
> ); #
> message_id/source_id/vectororder
> my $type_id_sortable = Lucy::Plan::StringType->new(
> indexed => 1,
> stored => 0,
> sortable => 1
> ); # issued
> my $type_text = Lucy::Plan::FullTextType->new(
> analyzer => $polyanalyzer,
> indexed => 1,
> stored => 0,
> sortable => 0
> ); #
> summary/description/searchstring
> my $type_text_stored = Lucy::Plan::FullTextType->new(
> analyzer => $polyanalyzer,
> indexed => 1,
> stored => 1,
> sortable => 0
> ); # title
> my $type_text_si = Lucy::Plan::FullTextType->new(
> analyzer => $polyanalyzer,
> indexed => 1,
> stored => 1,
> sortable => 0
> ); # url
>
>
> # Schema String Analyze Fulltext
> # implicit: document-id from Lucy
> itself index store sortable none
> tokanize case stemming lang/stop
> $schema->spec_field( name => 'message_id', type => $type_id
> );# Y Y Y Y
> $schema->spec_field( name => 'vectororder', type => $type_id
> );# Y Y Y Y
> $schema->spec_field( name => 'issued', type => $type_id );#
> Y Y Y Y
> $schema->spec_field( name => 'title', type => $type_text_stored );#
> Y Y N N Y Y N N
> $schema->spec_field( name => 'summary', type => $type_text );#
> Y N N N Y Y N N
> $schema->spec_field( name => 'description', type => $type_text
> );# Y N N N Y Y N N
> $schema->spec_field( name => 'searchstring', type => $type_text
> );# Y N N N Y Y N N
> $schema->spec_field( name => 'url', type => $type_text_si );#
> Y Y N Y
>
> return $schema;
> };
>


--
Peter Karman . http://peknet.com/ . pe...@peknet.com

arjan

unread,
Aug 7, 2015, 7:53:24 PM8/7/15
to dezi-...@googlegroups.com
Dear Peter,

Thank you very much for your reply.

To start a bit off-topic - although I believe understanding unicode is quite key to good searching - if I'm allowed.
I did not know Search::Tools::Transliterate. In general the goal is to be able to find a word with special characters when searched for with and without these special characters. A special character, such as an "n" with a tilde (~), can be written in unicode as a combination of code points or as a single code point. This means as U+006E (n) and U+0303 (the combining tilde) or as a single code point, in this case U+00F1. The first is called composed diacritic, the latter single character diacritic. Normalization Form D is to rewrite all single character diacritic characters into composed diacritic characters. For this the module Unicode::Normalize::NFD is available. After this the general class mark can be removed with the simple regex: s/\p{M}//g/. Or in code:

sub _normalize_form_d {
    my $searchstring    = shift;
    # normalize form d using Unicode::Normalize
    # rewrites single character diacritic characters into composed diacritic
    # character: character + symbol as mark:
    $searchstring = Unicode::Normalize::NFD( $searchstring );
    # remove general class mark:
    $searchstring =~ s/\p{M}//g;

    return $searchstring;
}

If both the original string and the string normalized form d are both indexed and searched, both a search with special characters and without special characters will result in the same hit.

In Search::Tools::Transliterate, you use an explicit map written by Markus Kuhn. I suppose you do more or less the same with this map. Am I correct? Unicode::Normalize uses perls core Unicode database in /lib/Unicode. Do you have a special reason to use Markus Kuhns map? Or is it that you do want to do some utf8 corrections at the same time?

MetaNames are not stripped of special characters. Thanks! Good to know.
Lucy Schema is written as JSON in the InvIndex folder. Thanks!
To know how the schema is written, I found my answers in more detail in sub _get_lucy_field_type in Dezi::Lucy::Indexer.

There are still some things however, that I don't understand or am not sure of, despite your very clear explanations.

1. If define both PropertyNames and fields in the config, defining the fields is no use. The definition of PropertyNames goes over the definition of fields. Correct? Fields are a synonym for PropertyNames.
2. Facets on the other hand, their values are aggregated at search time. Definition of facets is never redundant to PropertyNames. Definition of PropertyNames is only redundant to facets in so far there are PropertyNames that do not need to be used as a search filter. Correct?
3. Is there a simple way to create a Dezi::Indexer::Doc object? Or more specifically how to add the swish3_handler? I thought I found this:

my $aggregator = Dezi::Aggregator->new( set_parser_from_type => 1 );
$aggregator->swish_filter($doc);

But I still get: Dezi::Indexer=HASH(0x41a76f0) must implement swish3_handler in this test script:

my $invindex    = Dezi::InvIndex->new( path => '<path>' );
my $config        = {};
$config = Config::Any->load_files( {
    files     => ['<path>'],
    use_ext    => 1,
})->[0]->{'<path>'};

my $indexer = Dezi::Indexer->new(
         invindex    => $invindex,
         config      => Dezi::Indexer::Config->new(%$config),
         count       => 0,
         clobber     => 1,
         flush       => 10000,
         started     => time()         
);


my $xml     = '<some xml>';
my $length  = length $xml;

my $doc = Dezi::Indexer::Doc->new(
    version    => 3,
    url        => '<some url>',
    content    => $xml,
    size       => $length,
    type       => 'application/xml',
    modtime    => time()       
);

my $aggregator = Dezi::Aggregator->new( set_parser_from_type => 1 );
$aggregator->swish_filter($doc);

$indexer->start();
$indexer->process( $doc );
indexer->finish();

Kind regards,
Arjan.
-- 
Met vriendelijke groet,
Arjan Widlak

Bezoek onze site op:
http://www.unitedknowledge.nl

De rijkshuisstijl, ook voor tablet en iPhone:
http://www.rijkshuisstijl.unitedknowledge.nl/

United Knowledge, inhoud en techniek 
Bilderdijkstraat 79N
1053 KM Amsterdam
T +31 (0)20 737 1851
F +31 (0)84 877 0399
bur...@unitedknowledge.nl
http://www.unitedknowledge.nl

M +31 (0)6 2427 1444
E ar...@unitedknowledge.nl

We use WebGUI, the Open Source CMS
http://www.webgui.org/

Peter Karman

unread,
Aug 11, 2015, 11:31:55 AM8/11/15
to dezi-...@googlegroups.com
arjan wrote on 8/7/15, 6:53 PM:

> If both the original string and the string normalized form d are both indexed
> and searched, both a search with special characters and without special
> characters will result in the same hit.
>
> In Search::Tools::Transliterate, you use an explicit map written by Markus Kuhn.
> I suppose you do more or less the same with this map. Am I correct?
> Unicode::Normalize uses perls core Unicode database in /lib/Unicode. Do you have
> a special reason to use Markus Kuhns map? Or is it that you do want to do some
> utf8 corrections at the same time?


Search::Tools::Transliterate substitutes ASCII for non-ASCII characters. It does
not remove the non-ASCII characters, unless configured to do so by the mapping.

So \xf1 becomes 'n'.

Unicode::Normalize could certainly be used in cooperation with
S::T::Transliterate. Here's your example code, amended:

my $transliterator = Search::Tools::Transliterate->new();

sub _normalize_form_d {
my $searchstring = shift;
# normalize form d using Unicode::Normalize
# rewrites single character diacritic characters into composed diacritic
# character: character + symbol as mark:
$searchstring = Unicode::Normalize::NFD( $searchstring );

return $transliterator->convert( $searchstring );
}


> 1. If define both PropertyNames and fields in the config, defining the fields is
> no use. The definition of PropertyNames goes over the definition of fields.
> Correct? Fields are a synonym for PropertyNames.

Incorrect. PropertyNames are an indexer config option. fields are a searcher
config option. Your 'fields' definition may be a subset of the PropertyNames.

e.g.

engine_config => {
fields => [qw( foo bar )], # search-time

indexer_config => { # indexer-time
PropertyNames => 'foo bar baz',

}
}


> 2. Facets on the other hand, their values are aggregated at search time.
> Definition of facets is never redundant to PropertyNames. Definition of
> PropertyNames is only redundant to facets in so far there are PropertyNames that
> do not need to be used as a search filter. Correct?

Correct. The facet names should be a (subset) list of PropertyName values.

> 3. Is there a simple way to create a Dezi::Indexer::Doc object? Or more
> specifically how to add the swish3_handler? I thought I found this:
>

> ||my $indexer = Dezi::Indexer->new(||
> || invindex => $invindex,||
> || config => Dezi::Indexer::Config->new(%$config),||
> || count => 0,||
> || clobber => 1,||
> || flush => 10000,||
> || started => time() ||
> ||);||


You want to use Dezi::Lucy::Indexer instead.

Dezi::Indexer is an abstract base class.
Reply all
Reply to author
Forward
0 new messages