Index update performance

Picky / Florian Hanke

unread,

Nov 13, 2011, 5:43:42 AM11/13/11

to picky...@googlegroups.com

Hi Pickynauts!

Today we did some fun testing. Basically, Florian threw some objects at me and I tried to eat (index) them as fast as I could. The results are here:

http://florianhanke.com/blog/2011/11/13/picky-update-performance.html

It was a real good workout. I'm quite exhausted though (was only able to use one processor), but totally happy. You should check it out. It's some nice numbers for a change!

I'm sure we'll ten-tackle searching speed real soon :D Ahoi landlubberrrrrs!

Picky

Michael Below

unread,

Nov 13, 2011, 11:51:01 AM11/13/11

to picky...@googlegroups.com

Hi,
Regarding performace: i have wondered if there is a more efficient way to feed my tags to picky. They come in an array, and just feeding that to picky led to picky words like: ["word1",

Now I am using .join(" ") to turn the array into a string, and then picky takes that string apart again.

Since I am indexing off-line it probably doesn't matter, but I have been wondering if there is a more efficient way to do this...

Cheers

Michael

Picky / Florian Hanke

unread,

Nov 13, 2011, 7:22:12 PM11/13/11

to picky...@googlegroups.com, Michael Below

Hi Michael,

You can write your own tokenizer, for example and pass it to the indexing method (for the index) or to the tokenizer option (for a single category):

https://github.com/floere/picky/blob/master/server/test_project_sinatra/app.rb#L302-319

Simple test for it is here:

https://github.com/floere/picky/blob/master/server/test_project_sinatra/spec/integration_spec.rb#L293

If you don't want a whole new tokenizer like in the test, just get a new tokenizer

tokenizer = Picky::Tokenizer.new stopwords: /.../, etc. more options (see indexing options, http://florianhanke.com/picky/documentation.html#indexes-indexing)

If you'd like the tokenizer to be able to handle Arrays, override preprocess in the tokenize method (https://github.com/floere/picky/blob/master/server/lib/picky/tokenizer.rb#L174-181) and change the split method to be able to handle arrays:

class << tokenizer

def preprocess array

array # Does essentially nothing anymore. This will also jump over character substitution, illegal character removal and stopword removal.

end

def split array

array # Already split.

end

Then, pass this into the index:

index = Picky::Index.new ... do

indexing tokenizer # Use your custom tokenizer that handles arrays.

category :text, tokenizer: tokenizer # Use your custom tokenizer just on a single category.

end

Or rewrite the preprocess step to be able to handle the Array:

def preprocess array

array.collect! { |element| remove_illegals substitute_characters(element); remove_non_single_stopwords element }

end

(Please note that I wrote this by heart)

Does this help?

Cheers,

Florian

Reply all

Reply to author

Forward