issue 1 & 3

netawater

unread,

Jun 23, 2009, 3:09:24 AM6/23/09

to montezuma-dev

I have taken the time to try to fix them. here is my achievement,
please give your comment.

issue 1, standard tokenizer hangs on some input

As Edi Weitz pointed out, the culprit is the complex regular
expression(method, token-regexp) in standard-tokenizer.lisp, and I
have reduced the problem into a simple case:

CL-USER> (cl-ppcre:scan
(cl-ppcre:create-scanner
"(_\\w+)*\\@\\w+")
"_______________________________________"
:start 0)
; Evaluation aborted.

I speculate that '\w' includes underscore in regular expression would
account for this bug. and replace with other character of '\w' cause
it too.

CL-USER> (cl-ppcre:scan (cl-ppcre:create-scanner
"(a\\w+)*\\@\\w+")
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
:start 0)
; Evaluation aborted.

cl-ppcre is a perl-compatible regular expressions library, I should
check it in Perl. Maybe perl is more efficient in regular expression
operation, I raise the number of underscores, but it is OK.

$str = "john._______________________________________
__________________________________";

if ($str =~ m/(_*\w+)*\@\w+/)
{
print "ok\n";
}

To conclude it isn't montezuma's bug but cl-ppcre.

issue 3, broken :must-not-occur or phrase query

I found query "html-template !\"edi weitz\"" is OK in my test corpus,
but if I tried query "html-template !edi !test", it tell me:
"Invalid initialization argument: SCORER in call for class #<STANDARD-
CLASS DISJUNCTION-SUM-SCORER>.".

It is obvious that there is not slot named scorer in disjunction-sum-
scorer class, it maybe a typo, scorer should be substituted by sub-
scorers. I modified it at line 199 of boolean-scorer.lisp, It is OK
and all unit test are passed too.

Leslie P. Polzer

unread,

Jun 23, 2009, 3:42:27 AM6/23/09

to montez...@googlegroups.com

netawater wrote:

> To conclude it isn't montezuma's bug but cl-ppcre.

We should report it to the cl-ppcre guys then.

> issue 3, broken :must-not-occur or phrase query
>
> I found query "html-template !\"edi weitz\"" is OK in my test corpus,
> but if I tried query "html-template !edi !test", it tell me:
> "Invalid initialization argument: SCORER in call for class #<STANDARD-
> CLASS DISJUNCTION-SUM-SCORER>.".
>
> It is obvious that there is not slot named scorer in disjunction-sum-
> scorer class, it maybe a typo, scorer should be substituted by sub-
> scorers. I modified it at line 199 of boolean-scorer.lisp, It is OK
> and all unit test are passed too.

Excellent, can you send a patch in diff format?

Leslie

netawater

unread,

Jun 24, 2009, 4:18:01 AM6/24/09

to montezuma-dev

I have uploaded it: http://groups.google.com/group/montezuma-dev/web/issue1.diff.
It is just a simple substitute, :).

I also report the first one into cl-ppcre Google group.

On Jun 23, 3:42 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

Leslie P. Polzer

unread,

Jun 24, 2009, 4:55:35 AM6/24/09

to montezuma-dev

On Jun 24, 10:18 am, netawater <netawa...@gmail.com> wrote:
> I have uploaded it:http://groups.google.com/group/montezuma-dev/web/issue1.diff.

Thanks! Could you add a test case for this bug as well?

Also, please don't put parens on a single line by themselves.

> It is just a simple substitute, :).

Patches are more convenient and prevent typos, that's why I asked for
it.

Leslie

netawater

unread,

Jun 25, 2009, 2:00:30 AM6/25/09

to montezuma-dev

I have upload the test case http://groups.google.com/group/montezuma-dev/web/typo.lisp
and amend the patch http://groups.google.com/group/montezuma-dev/web/typo.diff.

It is a shame that I found there is a test case related issue 1, and
issue 1 is still NG, :(. I'm studying it now.

It seems I can not new post in cl-ppcre google group and I post my
report into cl-ppcre mail list and looking forward to they response.

On Jun 24, 4:55 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

Leslie P. Polzer

unread,

Jun 26, 2009, 8:00:05 AM6/26/09

to montez...@googlegroups.com

netawater wrote:
>
> I have upload the test case http://groups.google.com/group/montezuma-dev/web/typo.lisp
> and amend the patch http://groups.google.com/group/montezuma-dev/web/typo.diff.

Thanks, but there are still minor problems:

* you have named the test case and files "typo". Why?
That does not seem to be an accurate description of what it does.

* the new file needs to be added to the list of test files
in the ASD.

* please send me a full Git-style patch so I can just say
patch < fix-subscorer.diff
I think "svn diff" will generate a proper patch.

> It seems I can not new post in cl-ppcre google group and I post my
> report into cl-ppcre mail list and looking forward to they response.

I have also answered there.

Independently from that, how about adding a custom tokenizer
that uses a state machine to do the work?

Leslie

--
http://www.linkedin.com/in/polzer

netawater

unread,

Jun 27, 2009, 8:42:03 AM6/27/09

to montezuma-dev

On Jun 26, 8:00 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

> netawater wrote:
>
> > I have upload the test casehttp://groups.google.com/group/montezuma-dev/web/typo.lisp

> > and amend the patchhttp://groups.google.com/group/montezuma-dev/web/typo.diff.

>
> Thanks, but there are still minor problems:
>
> * you have named the test case and files "typo". Why?
> That does not seem to be an accurate description of what it does.

I use typo for I suppose I fix a typo problem. I thought I fix issue
1, but it is not, so I rename it. Which name do you recommend?

>
> * the new file needs to be added to the list of test files
> in the ASD.
>
> * please send me a full Git-style patch so I can just say
> patch < fix-subscorer.diff
> I think "svn diff" will generate a proper patch.

but I truly use svn diff to produce the diff file. I have used svn
checkout a read only version for I'm not a developer of it.

Leslie P. Polzer

unread,

Jun 28, 2009, 3:09:39 AM6/28/09

to montez...@googlegroups.com

netawater wrote:

> I use typo for I suppose I fix a typo problem.

A typo problem would be a missing character or two,
most frequently a wrong or missing paren.

How about boolean-subscorer.diff for the patch name?

As for the test name I suppose it should be integrated
into some other existing part of the test suite, but
I don't have time to look for this part right now.

Maybe you will find one. If not decide on a good name
that takes into accout the larger structure (e.g.
scorer.lisp)

> but I truly use svn diff to produce the diff file. I have used svn
> checkout a read only version for I'm not a developer of it.

Ah, perhaps you forgot to "svn add" the new file before generating
the diff?

>> Independently from that, how about adding a custom tokenizer
>> that uses a state machine to do the work?

I think you forgot to answer that question...

Leslie

netawater

unread,

Jun 28, 2009, 11:41:24 PM6/28/09

to montezuma-dev

On Jun 28, 3:09 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

> netawater wrote:
> > I use typo for I suppose I fix a typo problem.
>
> A typo problem would be a missing character or two,
> most frequently a wrong or missing paren.
>
> How about boolean-subscorer.diff for the patch name?
>
> As for the test name I suppose it should be integrated
> into some other existing part of the test suite, but
> I don't have time to look for this part right now.
>
> Maybe you will find one. If not decide on a good name
> that takes into accout the larger structure (e.g.
> scorer.lisp)
>
> > but I truly use svn diff to produce the diff file. I have used svn
> > checkout a read only version for I'm not a developer of it.
>
> Ah, perhaps you forgot to "svn add" the new file before generating
> the diff?

I think I'm beginning to understand you mean.
1. modify boolean-scorer.lisp.
2. add a test case named tc-boolean-subscorer.lisp into tests/unit/
search/
3. add tc-boolean-subscorer.lisp into montezuma.asd
4. svn diff > boolean-subscorer.diff
Is it correct? :)

>
> >> Independently from that, how about adding a custom tokenizer
> >> that uses a state machine to do the work?
>
> I think you forgot to answer that question...

Do you mean we should divide the complex regular expression into many
parts, and one part is a state, then conceive a state machine
algorithm to do tokenizer work?

The _xxx@xxx regular expression should be reserved?

Thanks for your guide? :)
>
> Leslie

Leslie P. Polzer

unread,

Jul 1, 2009, 2:20:11 PM7/1/09

to montez...@googlegroups.com

netawater wrote:

> I think I'm beginning to understand you mean.
> 1. modify boolean-scorer.lisp.
> 2. add a test case named tc-boolean-subscorer.lisp into tests/unit/ search/
> 3. add tc-boolean-subscorer.lisp into montezuma.asd
> 4. svn diff > boolean-subscorer.diff
> Is it correct? :)

Yes, very good! But don't forget "svn add" before diffing.

> Do you mean we should divide the complex regular expression
> into many parts, and one part is a state, then conceive a
> state machine algorithm to do tokenizer work?

Yes. But maybe it will be enough to rewrite the tokenizer
regexp from scratch with proper documentation.

> The _xxx@xxx regular expression should be reserved?

I think we should get rid of it entirely.

netawater

unread,

Jul 3, 2009, 3:29:35 AM7/3/09

to montezuma-dev

On Jul 2, 2:20 am, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

> netawater wrote:
> > I think I'm beginning to understand you mean.
> > 1. modify boolean-scorer.lisp.
> > 2. add a test case named tc-boolean-subscorer.lisp into tests/unit/ search/
> > 3. add tc-boolean-subscorer.lisp into montezuma.asd
> > 4. svn diff > boolean-subscorer.diff
> > Is it correct? :)
>
> Yes, very good! But don't forget "svn add" before diffing.

I have uploaded the diff:
http://montezuma-dev.googlegroups.com/web/boolean-subscorer.diff?gda=-BIZKUgAAAAHcebzgL8O7LT9-wIq0tDqRi_4IpZrQFloDlBWJ2R4QVHkLHU_ZGJOFr8BwKYhqN1ldlWJbCLdb1KzwG_kFsLMGjVgdwNi-BwrUzBGT2hOzg

Please check it.

>
> > Do you mean we should divide the complex regular expression
> > into many parts, and one part is a state, then conceive a
> > state machine algorithm to do tokenizer work?
>
> Yes. But maybe it will be enough to rewrite the tokenizer
> regexp from scratch with proper documentation.
>
> > The _xxx@xxx regular expression should be reserved?
>
> I think we should get rid of it entirely.

I got it and I will try to do this work.

Besides, I have found the cause of issue1. the bug occur when scorer
score result documents.To go further, it is the phrase scorer respond
to phrase term of second boolean clause. Its weight is not normalized.
We regard the second boolean clause is prohibited so not to normalize
it.

I don't full understand prohibit, Please give your comment. Thank you!

Leslie P. Polzer

unread,

Jul 4, 2009, 3:52:40 AM7/4/09

to montez...@googlegroups.com

netawater wrote:

> I have uploaded the diff:
> http://montezuma-dev.googlegroups.com/web/boolean-subscorer.diff?gda=-BIZKUgAAAAHcebzgL8O7LT9-wIq0tDqRi_4IpZrQFloDlBWJ2R4QVHkLHU_ZGJOFr8BwKYhqN1ldlWJbCLdb1KzwG_kFsLMGjVgdwNi-BwrUzBGT2hOzg
>
> Please check it.

Almost: the fixture name still says 'typo'.

+(deftestfixture typo

> Besides, I have found the cause of issue1. the bug occur when scorer
> score result documents.To go further, it is the phrase scorer respond
> to phrase term of second boolean clause. Its weight is not normalized.
> We regard the second boolean clause is prohibited so not to normalize
> it.

So why exactly do we get a NIL instead of a proper weight (which
should be an integer I guess)? 1 seems to be a sensible default
assuming the weights are multiplied with each other.

In any case we either need to assign a default weight or make
the phrase scorer handle NIL weights properly.

netawater

unread,

Jul 5, 2009, 3:23:36 AM7/5/09

to montezuma-dev

On Jul 4, 3:52 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

> netawater wrote:
> > I have uploaded the diff:

> >http://montezuma-dev.googlegroups.com/web/boolean-subscorer.diff?gda=...

>
> > Please check it.
>
> Almost: the fixture name still says 'typo'.
>
> +(deftestfixture typo

I have fixed it.

>
> > Besides, I have found the cause of issue1. the bug occur when scorer
> > score result documents.To go further, it is the phrase scorer respond
> > to phrase term of second boolean clause. Its weight is not normalized.
> > We regard the second boolean clause is prohibited so not to normalize
> > it.
>
> So why exactly do we get a NIL instead of a proper weight (which
> should be an integer I guess)? 1 seems to be a sensible default
> assuming the weights are multiplied with each other.
>
> In any case we either need to assign a default weight or make
> the phrase scorer handle NIL weights properly.

Yes, it is because query-norm is not bounded but I suppose the weight
should calculated by normalized-weight, if we just give it a default
value, is it tumble the whole algorithm?

Thank you very much!

Leslie P. Polzer

unread,

Jul 5, 2009, 4:37:47 AM7/5/09

to montezuma-dev

On Jul 5, 9:23 am, netawater <netawa...@gmail.com> wrote:

> I have fixed it.

Applied to latest trunk and issue closed. Thank you very much!

> Yes, it is because query-norm is not bounded but I suppose the weight
> should calculated by normalized-weight, if we just give it a default
> value, is it tumble the whole algorithm?

I'm afraid I must return that question to you.

You need to check how the algorithm uses the weights. If it just
multiplies
them then the multiplicative identity (i.e. 1) is appropriate. If not
then we'll
need to find another solution.

Leslie

netawater

unread,

Jul 8, 2009, 5:27:16 AM7/8/09

to montezuma-dev

On Jul 5, 4:37 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

> On Jul 5, 9:23 am, netawater <netawa...@gmail.com> wrote:
>
> > I have fixed it.
>
> Applied to latest trunk and issue closed. Thank you very much!

But I have mentioned before it is not a fix for issue1, although at
the beginning I suppose it was.

>
> > Yes, it is because query-norm is not bounded but I suppose the weight
> > should calculated by normalized-weight, if we just give it a default
> > value, is it tumble the whole algorithm?
>
> I'm afraid I must return that question to you.
>
> You need to check how the algorithm uses the weights. If it just
> multiplies
> them then the multiplicative identity (i.e. 1) is appropriate. If not
> then we'll
> need to find another solution.
>
> Leslie

I have studied the scorer algorithm and think there are one solution
for this issue:

It regards the second clause as prohibited for it is must-not-occur,
so it does not
compute its weight for efficient, however later it try to score result
doc with it for
interim predicate. We can prevent this issue by compute its weight
anyway.

How do you think about it?

Leslie P. Polzer

unread,

Jul 8, 2009, 5:46:51 AM7/8/09

to montez...@googlegroups.com

netawater wrote:

>> Applied to latest trunk and issue closed. Thank you very much!
> But I have mentioned before it is not a fix for issue1, although at
> the beginning I suppose it was.

Sorry, I got that wrong. Issue re-opened.

> I have studied the scorer algorithm and think there are one solution
> for this issue:
>
> It regards the second clause as prohibited for it is must-not-occur,
> so it does not
> compute its weight for efficient, however later it try to score result
> doc with it for
> interim predicate. We can prevent this issue by compute its weight
> anyway.
>
> How do you think about it?

It seems to me that prohibited clauses shouldn't be weightless
as they're just a negation of a weighted clause.

Is there a good way to get a sensible weight for a prohibited
clause?

Leslie

netawater

unread,

Jul 8, 2009, 9:35:20 AM7/8/09

to montezuma-dev

On Jul 8, 5:46 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

(defmethod normalize-weight ((self boolean-weight) norm)
(let ((query (query self)))
(setf norm (* norm (boost query)))
(dosequence (weight (weights self) :index i)
(let ((clause (elt (clauses query) i)))
(unless (prohibited? clause)
(normalize-weight weight norm))))))

It doesn't compute the weight when it find the clause is prohibited,
so we can invalid this condition.

Leslie P. Polzer

unread,

Jul 8, 2009, 11:18:31 AM7/8/09

to montez...@googlegroups.com

On Wed, Jul 08, 2009 at 06:35:20AM -0700, netawater wrote:

> (defmethod normalize-weight ((self boolean-weight) norm)
> (let ((query (query self)))
> (setf norm (* norm (boost query)))
> (dosequence (weight (weights self) :index i)
> (let ((clause (elt (clauses query) i)))
> (unless (prohibited? clause)
> (normalize-weight weight norm))))))
>
> It doesn't compute the weight when it find the clause is prohibited,
> so we can invalid this condition.

Do the tests still pass then? Does the weight make sense just by
removing that conditional?

netawater

unread,

Jul 8, 2009, 10:35:02 PM7/8/09

to montezuma-dev

On Jul 8, 11:18 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

Yes, none of 2358 tests failed.

The weight must be used, it is why we have this issue and I believe
this trick is only used for efficient.

Leslie P. Polzer

unread,

Jul 9, 2009, 2:52:10 AM7/9/09

to montez...@googlegroups.com

netawater wrote:

>> Do the tests still pass then? Does the weight make sense just by
>> removing that conditional?
>
> Yes, none of 2358 tests failed.
>
> The weight must be used, it is why we have this issue and I believe
> this trick is only used for efficient.

Okay great, thanks a lot for checking this out. :)

Let's add a proper test case for this then and commit it.

Leslie

netawater

unread,

Jul 9, 2009, 11:25:17 PM7/9/09

to montezuma-dev

On Jul 9, 2:52 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

I have upload it and for I can not check in code so this patch
contains the previous one.

I only open a remark test case for issue1, but modify it expectation
result.
word1 should occur and word must not occur is definitely produce null
result.

Please check it.

Leslie P. Polzer

unread,

Jul 10, 2009, 3:57:08 AM7/10/09

to montez...@googlegroups.com

On Thu, Jul 09, 2009 at 08:25:17PM -0700, netawater wrote:

> I have upload it and for I can not check in code so this patch
> contains the previous one.

You need to 'svn revert' your old working directory changes, then
do 'svn update' to get them from the repository. After that you
can work from a clean slate.

> I only open a remark test case for issue1, but modify it expectation
> result.

Can you rephrase that sentence?

netawater

unread,

Jul 10, 2009, 8:42:22 AM7/10/09

to montezuma-dev

On Jul 10, 3:57 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

> On Thu, Jul 09, 2009 at 08:25:17PM -0700, netawater wrote:
> > I have upload it and for I can not check in code so this patch
> > contains the previous one.
>
> You need to 'svn revert' your old working directory changes, then
> do 'svn update' to get them from the repository. After that you
> can work from a clean slate.

Sorry, I forgot you have check in the previous patch. I re-upload the
patch file.

>
> > I only open a remark test case for issue1, but modify it expectation
> > result.
>
> Can you rephrase that sentence?

There is a test case for issue1 already but it was disable by someone.
however I suppose its expectation result is wrong, for its query
require word should occur and must not occur, so the result should be
nil.

Leslie P. Polzer

unread,

Jul 10, 2009, 9:01:22 AM7/10/09

to montez...@googlegroups.com

netawater wrote:

> There is a test case for issue1 already but it was disable by someone.
> however I suppose its expectation result is wrong, for its query
> require word should occur and must not occur, so the result should be
> nil.

Ah. Do you want to include that in your patch, or do you want me
to take a look at it? In the latter case you need to give
me the name and file of the test.

Thanks a lot!

Leslie

netawater

unread,

Jul 10, 2009, 10:06:28 AM7/10/09

to montezuma-dev

On Jul 10, 9:01 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

It is at tc-index-searcher.lisp and my patch have enable it.

Thank you for take a look at it.

Leslie P. Polzer

unread,

Jul 11, 2009, 4:44:30 AM7/11/09

to montezuma-dev

On Jul 10, 4:06 pm, netawater <netawa...@gmail.com> wrote:

> It is at tc-index-searcher.lisp and my patch have enable it.
>
> Thank you for take a look at it.

Verified and committed; issue closed again. Thanks!

netawater

unread,

Jul 16, 2009, 1:44:09 AM7/16/09

to montezuma-dev

I have rearrange the structure of complex regular expression in order
to make it legible, and fix issue3 based on Dean's idea.

Please check it,
http://montezuma-dev.googlegroups.com/web/fix-issue3.diff?gda=3HMpyEEAAAAHcebzgL8O7LT9-wIq0tDquKvYFJ3iXRH9yS7L3p4rTaA6e0SxrKxyI5Z4YwFZ-lBTCT_pCLcFTwcI3Sro5jAzlXFeCn-cdYleF-vtiGpWAA&gsc=chceZgsAAABGQNwde6vBKNhyGx_D02WX

I have added a test case for it.

On Jul 11, 4:44 pm, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

Leslie P. Polzer

unread,

Jul 21, 2009, 12:50:36 PM7/21/09

to montezuma-dev

On Jul 16, 7:44 am, netawater <netawa...@gmail.com> wrote:
> I have rearrange the structure of complex regular expression in order
> to make it legible, and fix issue3 based on Dean's idea.
>

> Please check it,http://montezuma-dev.googlegroups.com/web/fix-issue3.diff?gda=3HMpyEE...

>
> I have added a test case for it.

This patch yields "the variable MONTEZUMA::*INDEX* is unbound."
I suppose you need to setup the index somewhere before...

netawater

unread,

Jul 22, 2009, 9:56:40 PM7/22/09

to montezuma-dev

Very sorry, I have fixed it, please check.

On Jul 22, 12:50 am, "Leslie P. Polzer" <s...@viridian-project.de>
wrote:

Leslie P. Polzer

unread,

Jul 23, 2009, 3:06:07 AM7/23/09

to montezuma-dev

Committed as r414 and issue #3 closed.

I've added TRIVIAL-TIMEOUT to the dependencies of the test system and
wrapped the test case in a WITH-TIMEOUT to ensure we don't clobber the
CPU forever if the test fails.

We have resolved all tickets in the tracker now. Perhaps we should
celebrate that with another release.

Reply all

Reply to author

Forward