Indexing options

20 views
Skip to first unread message

Michael Below

unread,
Nov 18, 2011, 8:18:32 AM11/18/11
to picky...@googlegroups.com
Hi,

today I tried to set some indexing options:

PagesIndex = Picky::Index.new(:pages) do
indexing stopwords: /\b(and|or|in|on|is|has|und|oder|auf|ist|hat|
wird|der|die|das|dem|sich|dann|des|den|werden|auf|ich|er|sie|es|fï¿œr|
gegen|ein|eine|einen|eines|ob|zu|zur|zum|fï¿œr|sein|ihr|dass|von|vom|vor|
bei|also|nur|um|nicht|nein|ja|wir|am|an|haben)\b/i,
removes_characters: /[^\w\s\"\~\*\:\,]/

The first line results in an indexing error I don't understand, see
below. Any ideas what I am doing wrong this time :-) ?

Cheers

Michael

$ rake index
Loaded picky with environment 'development' in /home/mbelow/html/judiz
on Ruby 1.9.3.
:public is no longer used to avoid overloading Module#public,
use :public_folder instead
from /home/mbelow/html/judiz/app.rb:80:in `<class:CommentSearch>'
Application loaded.
14:12:33: Indexing using 4 processors, in random order.
14:12:33: "development:pages": Starting parallel data preparation.
/home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/tokenizer.rb:70:in `gsub!': can't modify frozen String (RuntimeError)

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/tokenizer.rb:70:in `remove_illegals'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/tokenizer.rb:192:in `preprocess'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/tokenizer.rb:175:in `tokenize'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/indexers/parallel.rb:48:in `block (2 levels) in process'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/indexers/parallel.rb:47:in `each'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/indexers/parallel.rb:47:in `block in process'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/indexers/parallel.rb:40:in `each'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/indexers/parallel.rb:40:in `process'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/indexers/base.rb:23:in `index'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/index_indexing.rb:78:in `index_in_parallel'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/index_indexing.rb:27:in `index'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/cores.rb:53:in `call'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/cores.rb:53:in `block (2 levels) in forked'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/cores.rb:51:in `fork'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/cores.rb:51:in `block in forked'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/cores.rb:41:in `loop'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/cores.rb:41:in `forked'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/picky/indexes_indexing.rb:30:in `index'
from (__DELEGATION__):2:in `index'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/gems/picky-3.4.3/lib/tasks/index.rake:10:in `block in <top (required)>'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/task.rb:205:in `call'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/task.rb:205:in `block in execute'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/task.rb:200:in `each'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/task.rb:200:in `execute'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/task.rb:158:in `block in invoke_with_call_chain'

from /home/mbelow/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/monitor.rb:211:in `mon_synchronize'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/task.rb:151:in `invoke_with_call_chain'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/task.rb:144:in `invoke'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/application.rb:116:in `invoke_task'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/application.rb:94:in `block (2 levels) in top_level'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/application.rb:94:in `each'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/application.rb:94:in `block in top_level'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/application.rb:133:in `standard_exception_handling'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/application.rb:88:in `top_level'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/application.rb:66:in `block in run'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/application.rb:133:in `standard_exception_handling'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/lib/rake/application.rb:63:in `run'

from /home/mbelow/.rvm/gems/ruby-1.9.3-p0@global/gems/rake-0.9.2.2/bin/rake:33:in `<top (required)>'
from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/bin/rake:19:in `load'
from /home/mbelow/.rvm/gems/ruby-1.9.3-p0/bin/rake:19:in `<main>'
14:12:33: Indexing finished.


--
Michael Below <be...@judiz.de>

Michael Below

unread,
Nov 18, 2011, 8:47:03 AM11/18/11
to picky...@googlegroups.com
Just saw that there have been a couple new versions: the same happens
with 3.5.4 and a minimal list:

PagesIndex = Picky::Index.new(:pages) do
indexing stopwords: /\b(and|or|in|on|is|has)\b/i

Cheers

Michael

--
Michael Below <be...@judiz.de>

Picky / Florian Hanke

unread,
Nov 18, 2011, 10:07:49 AM11/18/11
to picky...@googlegroups.com
Hi Michael,

I might have some ideas. Picky is pretty destructive with the data you give it. That is, it changes the strings you pass into it.

This, or a very similar problem is noted as an issue on GIthub: https://github.com/floere/picky/issues/39

Now, since Picky does not freeze Strings (well, some it does, but not the ones passed in), this probably means that Nanoc pr obably uses freeze on the strings – perhaps in this case a good idea.
Why it only occurs now is a bit strange. Could it be that you only started using stopwords etc. now? Probably not, right?
Could it be that you're using a new version of Nanoc?

In any case, it's probably a good idea to call #dup on any data before Picky uses it, since dup does not copy the frozen state of a String: http://railsblogger.blogspot.com/2009/03/ruby-dup-vs-clone.html

Is that an ok solution for you? I'm afraid the resolution of issue 39 will take some time and thought.

Cheers,
   Florian

Michael Below

unread,
Nov 18, 2011, 10:34:54 AM11/18/11
to picky...@googlegroups.com
Am Freitag, den 18.11.2011, 14:47 +0100 schrieb Michael Below:
> Just saw that there have been a couple new versions: the same happens
> with 3.5.4 and a minimal list:
> PagesIndex = Picky::Index.new(:pages) do
> indexing stopwords: /\b(and|or|in|on|is|has)\b/i

I found the problem: The nanoc objects I was passing to picky are
frozen. Adding a .dup to every returned nanoc object helped.

Now I am wondering why the stopwords don't seem to help. E.g. I have
defined "in" as a stopword, but there is a huge entry for it in the
index. What could cause that? Is there something wrong in my stopword
list (as posted three mails ago)?

Best

Michael Below

unread,
Nov 18, 2011, 10:39:59 AM11/18/11
to picky...@googlegroups.com
Hi Florian,

Am Freitag, den 18.11.2011, 07:07 -0800 schrieb Picky / Florian Hanke:

> Why it only occurs now is a bit strange. Could it be that you only started
> using stopwords etc. now? Probably not, right?

Yes, that's it, I started using stopwords only some days ago...

> In any case, it's probably a good idea to call #dup on any data before
> Picky uses it, since dup does not copy the frozen state of a String:
> http://railsblogger.blogspot.com/2009/03/ruby-dup-vs-clone.html

Yes, that works perfectly.

Thanks

Picky / Florian Hanke

unread,
Nov 18, 2011, 10:46:22 AM11/18/11
to picky...@googlegroups.com
Hi Michael,

Good to hear!
It seems your stopword list is fine. Picky has a bit of a peculiar behavior regarding stopwords. If the data only contains the stopword, it will not remove it.
Could it be that you have category data where there is only "in"? (And if yes… I am unsure if the Picky behavior is a good one for indexing)

In any case, if in your Search object you define your searching as follows
  Picky::Search.new your_index do
    searching stopwords: /\b(in)\b/i
  end
then in will be removed (except if you only entered "in", then it will survive).
This solves the problem at the expense that you have superfluous "in"s in your index.

Cheers,
   Florian
Reply all
Reply to author
Forward
0 new messages