Dezi search

6 views
Skip to first unread message

raja raja

unread,
Dec 4, 2017, 1:09:18 PM12/4/17
to dezi-...@googlegroups.com
Hi,

I am starting to explore Dezi (https://metacpan.org/pod/Dezi::Client ; https://metacpan.org/pod/distribution/Dezi/bin/dezi) for my full-text search requirements. The tool looks pretty cool with no external dependencies. My dataset (with millions of rows), in a simplified form, has a format that looks something like this (tab delimited):

ShopID ProdType ProdID ProdName
1 Shirt S1 Shirt1
1 Shirt S2 Shirt2
1 Jeans J1 Jeans1
2 Shirt S1 Shirt1
2 Jeans J1 Jeans1
3 Shirt S2 Shirt2
3 Belts B1 Belt1

My task is to count number of unique shops (along with their IDs) that satify queries, such as:

i) (ProdType:Shirt AND ProdID:S1 AND ProdName:Shirt1) AND (ProdType:Shirt AND ProdID:S2 AND ProdName:Shirt2)
ii) (ProdType:Shirt AND ProdID:S1 AND ProdName:Shirt1) AND (ProdType:Belts AND ProdID:B1 AND ProdName:Belt1)
iii) (ProdType:Shirt AND ProdID:S1 AND ProdName:Shirt1) AND (ProdType:Belts AND ProdID:B1 AND ProdName:Belt1) AND (ProdType:Jeans AND ProdID:J1 AND ProdName:Jeans1)

This task can probably be done in other ways, but Dezi looks appealing to me from a broader perspective, so I would like to explore and learn it first.

I understand that I have to first index the data and then run queries on the indexed data.

I tried the code below, but it has two problems: i) it is a little slow if you have millions of records; ii) it considers each row as one document, while I guess I probably want all data associated with one shop (with unique ShopID) be considered as one doc (this I guess will help me in getting the right counts on shops when I do the search on the index later).

Any suggestions, how to go about it?

Thank you,
Raja



#!/perl/bin/perl
use strict;
use warnings;
use Dezi::Client;
use Dezi::Doc;


if ($#ARGV != 0) {
print "usage: program arguments\n";
}

my $client = Dezi::Client->new(server => 'http://localhost:5000');
my $inputfile=$ARGV[0];
my $line;
my @array=();
my $count=0;

open(RF1, "$inputfile") or die "Can't open < $inputfile: $!";
#open(WF, ">$outputfile"); #open for output

$line=<RF1>; #read out header line

while ($line=<RF1>) {
chomp $line;
$count=$count+1;
@array=split(/\t/, $line);

my $doc = Dezi::Doc->new(
uri => "$count",
);
$doc->set_field('Col1' => "$array[0]");
$doc->set_field('Col2' => "$array[1]");
$doc->set_field('Col3' => "$array[2]");
$doc->set_field('Col4' => "$array[3]");

$client->index( $doc );
}


close (RF1);

Peter Karman

unread,
Dec 9, 2017, 11:39:07 PM12/9/17
to dezi-...@googlegroups.com
raja raja wrote on 12/4/17 12:09 PM:
> Hi,
>
> I am starting to explore Dezi (https://metacpan.org/pod/Dezi::Client ; https://metacpan.org/pod/distribution/Dezi/bin/dezi) for my full-text search requirements. The tool looks pretty cool with no external dependencies. My dataset (with millions of rows), in a simplified form, has a format that looks something like this (tab delimited):
>
> ShopID ProdType ProdID ProdName
> 1 Shirt S1 Shirt1
> 1 Shirt S2 Shirt2
> 1 Jeans J1 Jeans1
> 2 Shirt S1 Shirt1
> 2 Jeans J1 Jeans1
> 3 Shirt S2 Shirt2
> 3 Belts B1 Belt1
>
> My task is to count number of unique shops (along with their IDs) that satify queries, such as:
>
> i) (ProdType:Shirt AND ProdID:S1 AND ProdName:Shirt1) AND (ProdType:Shirt AND ProdID:S2 AND ProdName:Shirt2)
> ii) (ProdType:Shirt AND ProdID:S1 AND ProdName:Shirt1) AND (ProdType:Belts AND ProdID:B1 AND ProdName:Belt1)
> iii) (ProdType:Shirt AND ProdID:S1 AND ProdName:Shirt1) AND (ProdType:Belts AND ProdID:B1 AND ProdName:Belt1) AND (ProdType:Jeans AND ProdID:J1 AND ProdName:Jeans1)
>
> This task can probably be done in other ways, but Dezi looks appealing to me from a broader perspective, so I would like to explore and learn it first.


If your queries really look like that, it doesn't seem like you really need
full-text search. That problem seems better suited to a RDBMS like MySQL or
PostgreSQL. You could even prototype it with SQLite.

Still if you want to use Dezi, I have comments below.




>
> I understand that I have to first index the data and then run queries on the indexed data.
>
> I tried the code below, but it has two problems: i) it is a little slow if you have millions of records;


Using Dezi and Dezi::Client is the slowest way to index. That mechanism is the
easiest to use without writing much code (as you've discovered) but the HTTP
overhead of server/client communication is significant.

There are couple, alternate, ways to speed things up.

(a) Turn off auto_commit.

When you run the dezi server, pass it the --no-ac option (or the equivalent in
your dezi server config file) which means a new indexer is not spawned for each
POST to the server. That saves a lot of overhead, but it means each doc is not
committed atomically.

Instead, you must call the "commit" command manually. I've tweeked your example
code here:

https://gist.github.com/karpet/a4a0dae27fb5360fc24b16f6de0e9e85

Pay attention to the $BATCH_SIZE which will commit() periodically to flush the
buffer. If you really have millions of documents, that is important. You can
tweek the value as you need to.

(b) Use a custom aggregator class with Dezi::App.

This is the fastest possible way, because there is no server/client HTTP
communication. Everything runs in the local process.

I've included an example here:

https://gist.github.com/karpet/e332b9e1f273aadea2d9be4657f3c451

which is much the same code as you had, but as an Aggregator subclass. There's a
comment on the gist that explains how to run it.

Once you've created the index, you can make it available for searching using the
dezi server just as you would with the server/client setup.


A couple things to note:

* You need to use the field name you intend to search under as the set_field()
key value. You were using "Col1" etc instead of the column field name itself.

* Dezi will lowercase all field names, so use "shopid" instead of "ShopID" for
consistency.


ii) it considers each row as one document, while I guess I probably want all
data associated with one shop (with unique ShopID) be considered as one doc
(this I guess will help me in getting the right counts on shops when I do the
search on the index later).
>
> Any suggestions, how to go about it?
>

You have a couple options.

If you only ever want to return a "ShopID" as your "document", then you should
wrangle your data into that shape before you push it into the index. That is
part of my clue that you really want a SQL solution since that gives you that
control without creating a virtual "document" like Dezi requires.

Or, if you really want your "document" in the format you are currently using,
you can use the "facets" feature of the Dezi server to return unique ShopId
values for any given search resultset.

See https://metacpan.org/pod/Dezi::Config and search for "facets".

HTH,
pek


--
Peter Karman . he/him/his . https://karpet.github.io/

raja raja

unread,
Dec 12, 2017, 2:12:25 PM12/12/17
to dezi-...@googlegroups.com
Thank you much, Peter for your detailed reply. Will absorb n work on it now. One reason for me to explore Dezi is that I also have unstructured text (e.g. reviews) alongside the data with format shown below, so I thought of using one common platform for this. Also, Dezi looks pretty light-weight with no-frills approach.

Best,
Raja


> Sent: Saturday, December 09, 2017 at 8:39 PM
> From: "Peter Karman" <pe...@peknet.com>
> To: dezi-...@googlegroups.com
> Subject: Re: [dezi-search] Dezi search
> --
> Dezi search platform . http://dezi.org/
> ---
> You received this message because you are subscribed to the Google Groups "dezi" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dezi-search...@googlegroups.com.
> To post to this group, send email to dezi-...@googlegroups.com.
> Visit this group at https://groups.google.com/group/dezi-search.
> For more options, visit https://groups.google.com/d/optout.
>

Peter Karman

unread,
Dec 12, 2017, 2:28:39 PM12/12/17
to dezi-...@googlegroups.com
raja raja wrote on 12/12/17 1:12 PM:
> Thank you much, Peter for your detailed reply. Will absorb n work on it now. One reason for me to explore Dezi is that I also have unstructured text (e.g. reviews) alongside the data with format shown below, so I thought of using one common platform for this. Also, Dezi looks pretty light-weight with no-frills approach.

That's very reasonable.
Reply all
Reply to author
Forward
0 new messages