The Return of SenseiBA?

Otis Gospodnetic

unread,

Jul 8, 2013, 5:07:40 PM7/8/13

to sensei...@googlegroups.com

Hi,

I was looking for some info about int/float/etc... compression and
remembered SenseiBA was all about compression, so I wanted to look at
the docs/code.

Are there any signs, any hope of SenseiBA resurrecting from the ashes?

Thanks,
Otis
--
SENSEI Performance Monitoring -- http://sematext.com/spm

Volodymyr Zhabiuk

unread,

Jul 8, 2013, 6:20:37 PM7/8/13

to sensei...@googlegroups.com

Hi Otis

The compression in SenseiBA was quite trivial and optimized for random
reads. I will ask LinkedIn folks to answer to this email thread about
when the LinkedIn open source approval process will be done for
SenseiBA.
Also we may chat about your use case and see if I can provide any help

Thanks,
Volodymyr

2013/7/8 Otis Gospodnetic <otis.gos...@gmail.com>:

> --
> You received this message because you are subscribed to the Google Groups "Sensei" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sensei-searc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Lerchmo

unread,

Aug 5, 2013, 3:44:23 PM8/5/13

to sensei...@googlegroups.com

Is it possible to post how sensei stores it's data?

Do you keep reverse indexes (bitmap) for each dictionary value?

or do you scan all of the columns in question for each query?

Also, what does the forward index achieve?

thanks!

Volodymyr Zhabiuk

unread,

Aug 7, 2013, 2:06:26 AM8/7/13

to sensei...@googlegroups.com

Hi Derek

Sure. Are you interested in Sensei or Sensei-BA?

Bitmaps for the inverted index take much space if there are many distinct values in the column. Sensei leverages Lucene to store the inverted index http://lucene.apache.org/core/3_5_0/fileformats.html#Inverted Indexing. It also maintains a forward index. You can think about it as the big int array, where the index is the docid and the value is the reference for the element in the dictionary.

Forward index becomes handy if we need to do range queries or to quickly retrieve the column's value by the docid. The later is needed for executing aggregation functions against the dataset

In Sensei-BA we had three kinds of columns. If documents are sorted by some column, that column could be represented as a set of unique values(dictionary) along with the start and end docid per each such value. This datastructure serves as the inverted index to speed up queries and also we can quickly lookup the value by the docid by doing the binary search on those ranges. If the column is not sorted we just create the forward index, but instead of int array, we use a more compact datastructure. Let's say we have 1000 distinct values in the column, in this case it would take 10 bits to reference the dictionary value from the forward index. By default we were not creating the inverted index for non sorted column, so if the query contains only non sorted columns we would need to do a full scan . But it was possible to specify in the config to create one. We leveraged a modified P4Delta algorithm similar to https://github.com/linkedin-sna/sna-page/blob/master/kamikaze/index.php to store the inv index in memory

The third column format was used for secondary sorted columns. It was an extension of sorted column format

Here is some info http://www.slideshare.net/VolodymyrZhabiuk/index-types

With many thanks,

Volodymyr

2013/8/5 Lerchmo <de...@eclipse.io>

Lerchmo

unread,

Sep 6, 2013, 1:04:21 PM9/6/13

to sensei...@googlegroups.com

Ah P4Delta is very cool, thanks!

Reply all

Reply to author

Forward