Index update performance

13 views
Skip to first unread message

Picky / Florian Hanke

unread,
Nov 13, 2011, 5:43:42 AM11/13/11
to picky...@googlegroups.com
Hi Pickynauts!

Today we did some fun testing. Basically, Florian threw some objects at me and I tried to eat (index) them as fast as I could. The results are here:

http://florianhanke.com/blog/2011/11/13/picky-update-performance.html

It was a real good workout. I'm quite exhausted though (was only able to use one processor), but totally happy. You should check it out. It's some nice numbers for a change!

I'm sure we'll ten-tackle searching speed real soon :D Ahoi landlubberrrrrs!
   Picky

Michael Below

unread,
Nov 13, 2011, 11:51:01 AM11/13/11
to picky...@googlegroups.com
Hi,
Regarding performace: i have wondered if there is a more efficient way to feed my tags to picky. They come in an array, and just feeding that to picky led to picky words like: ["word1",

Now I am using .join(" ") to turn the array into a string, and then picky takes that string apart again.

Since I am indexing off-line it probably doesn't matter, but I have been wondering if there is a more efficient way to do this...

Cheers

Michael

Picky / Florian Hanke

unread,
Nov 13, 2011, 7:22:12 PM11/13/11
to picky...@googlegroups.com, Michael Below
Hi Michael,

You can write your own tokenizer, for example and pass it to the indexing method (for the index) or to the tokenizer option (for a single category):
Simple test for it is here:

If you don't want a whole new tokenizer like in the test, just get a new tokenizer
tokenizer = Picky::Tokenizer.new stopwords: /.../, etc. more options (see indexing options, http://florianhanke.com/picky/documentation.html#indexes-indexing)

If you'd like the tokenizer to be able to handle Arrays, override preprocess in the tokenize method (https://github.com/floere/picky/blob/master/server/lib/picky/tokenizer.rb#L174-181) and change the split method to be able to handle arrays:

class << tokenizer
  def preprocess array
    array # Does essentially nothing anymore. This will also jump over character substitution, illegal character removal and stopword removal.
  end
  def split array
    array # Already split. 
  end
end

Then, pass this into the index:

index = Picky::Index.new ... do
  indexing tokenizer # Use your custom tokenizer that handles arrays.
  category :text, tokenizer: tokenizer # Use your custom tokenizer just on a single category.
end

Or rewrite the preprocess step to be able to handle the Array:
    def preprocess array
      array.collect! { |element| remove_illegals substitute_characters(element); remove_non_single_stopwords element }
    end
(Please note that I wrote this by heart)

Does this help?

Cheers,
   Florian
Reply all
Reply to author
Forward
0 new messages