AND query on full text is really NEAR query!?

41 views
Skip to first unread message

Jasper Bedaux

unread,
Mar 1, 2012, 7:06:45 AM3/1/12
to xtf-...@googlegroups.com

Hello everyone,

After some complaints from our customers, we did some quick trials and my colleague and I were a bit surprised to find out that it looks like XTF does not support true AND searches on full text. If you search for two terms that must both appear, it only finds results where the two terms are within the maximum proximity range, that is only 20 words apart (approx 1 line) by default.

From http://xtf.cdlib.org/documentation/under-the-hood/#QueryOperations, if you look at the paragraph "AND Query on Text" and "NOT Clause on Text", it appears that this behavior is by design:

"... Thus XTF interprets AND queries on the full text as NEAR queries instead, with the slop factor set to the maximum for that index. ..."

The problem is that users (and probably even developers) do not know about this and do not find documents they are expecting to find or should be finding, e.g. if "Amsterdam" appears on one page, and "Africa" a few sentences later, XTF will NOT find this document if you search for "amsterdam" and "africa". Only when both terms are within the max proximity range, the document will be found.

In the metadata fields, it appears that there is no such a proximity limit on and searches, only for full text.

Before digging into this deeper, I would like to know if there is a simple way to change this behavior, I did not see this immediately when looking at the XSLT, but I did not look deep into this, so maybe I missed things. Or do I have to go into the Java code to change this? Also, would it have any side effects (e.g. on ranking) if I would mess with this mechanism?

If there is no easy way to change this, I think it would be better to change the user interface in such a way that it becomes clear that it is not possible to perform searches where two terms are separated more than the max proximity range if they both must appear. That way users will at least understand what they are really searching for and hopefully understand why they are not finding documents they know contain both terms...

Also, on the results page, XTF suggests (Search: "amsterdam" and "africa" in ...) that an AND search is performed. I think this is a bit misleading and it would be better to change this to e.g. Search: "amsterdam" near "africa" in ...

I understand that for large collections it may give more accurate results when performing NEAR queries instead of AND queries, and I think it is also useful to use proximity to improve ranking, but I think it is not a good thing to just throw away other results where the terms are not near each other, especially when XTF is used in scientific research. Ideally, using NEAR instead of AND will just be a suggestion, or maybe even the default, but it should be possible to also perform real AND searches I think.

Regards,
Jasper

Martin Haye

unread,
Mar 1, 2012, 11:54:30 AM3/1/12
to xtf-...@googlegroups.com
Hi Jasper,

We've found it fairly unusual for users to want the behavior you specify, but if it makes sense for your collection and users, by all means go for it.

You should be able to achieve it by changing this line in queryParserCommon.xsl:

<and field="text" maxSnippets="3" maxContext="60">

to this:

<and field="text" maxSnippets="3" useProximity="no">

You could perhaps do something fancier by wrapping both versions in an <or> query, and de-boosting the no-proximity 'and' query.

--Martin


From: xtf-...@googlegroups.com [xtf-...@googlegroups.com] on behalf of Jasper Bedaux [bed...@gmail.com]
Sent: Thursday, March 01, 2012 4:06 AM
To: xtf-...@googlegroups.com
Subject: [xtf-devel] AND query on full text is really NEAR query!?

--
You received this message because you are subscribed to the Google Groups "XTF Developer list" group.
To post to this group, send email to xtf-...@googlegroups.com.
To unsubscribe from this group, send email to xtf-devel+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/xtf-devel?hl=en.

Jasper Bedaux

unread,
Mar 2, 2012, 3:10:23 AM3/2/12
to xtf-...@googlegroups.com
Hi Martin,

Thanks for your fast answer!
I took a few minutes to try your suggestion and my first observation is that it is working for simple search (search everything = keywords), but not for advanced search (on the full text field). But now that I know what to look for I will probably be able to make this work for advanced search too. I understand that the ranking does not take the proximity into account anymore, unless I implement this myself by following your fancier suggestion, which I will try.

Something else that stopped working is that the search results that only appear using the real AND, do not appear anymore in the DynaXML Bookreader viewer.

And from the documentation I found:

"Note that if proximity processing is turned off, individual text hits within document text and meta-data fields will not be highlighted, and scores for matching documents will be somewhat different."

So it appears there are a few drawbacks/problems to solve.

I will look into this further and report back, but it might take a while before I find some time...

Maybe - to not break existing functionality -  I will not change the simple search method and implement your suggestion as a separate third method in advanced search (apart from the already existing OR and NEAR), by allowing users to choose between three instead of two options:

Search for documents containing:

- all of these words close together (previously "All of these words")
- all of these words anywhere (the "new" AND method)
- any of these words (same as old OR method)

And maybe I will just change the interface text "Exclude" to "Not close to" to keep it simple.

I think something like this will describe better how XTF is searching...

Maybe you want to consider for the default advanced search form of XTF to change "all of these words" to "all of these words close together", without changing anything to the underlying functionality, and maybe also "Exclude" to "Exclude matches close to".
Or maybe I am just too nitpicking about this ;-) because of my background in physics/mathematics...

Jasper

Jasper Bedaux

unread,
Mar 2, 2012, 7:28:55 AM3/2/12
to xtf-...@googlegroups.com
Just a small update in case someone else wants to make this modification:

To effectuate the change also for advanced search and for the DynaXML viewer (uses the same template), I just changed inside resultFormatterCommon.xsl the "maxContext" attributes from the Single-field parameter template for the field names 'text' and 'query' into "useProximity" attributes with value "no". This seems to work fine, so now the only drawbacks remaining are those mentioned in the documentation (highlighting and less advanced scoring (proximity not taken into account)).

Jasper

Jasper Bedaux

unread,
Mar 2, 2012, 8:53:56 AM3/2/12
to xtf-...@googlegroups.com
My apologies, that was a bit too fast, it fails when proximity is specified and is applied to elements it should not be applied to.
I think it is better to add the following lines in queryParserCommon.xsl:

<!-- Do not use proximity for and-queries on full-text -->
<xsl:if test="$op='and' and (@name='query' or @name='text')">
  <xsl:attribute name="useProximity" select="'no'"/>
</xsl:if>

just before

<!-- Process all the phrases and/or terms -->

Actually I am quite happy with this solution: now it is possible to search in both ways: when setting the proximity dropdown to the maximum value, the search works like before. When setting it to the empty value, it now works like a true AND search and the exclude also works like I would expect!

Maybe the ranking could be improved, but I am not sure if this is necessary.

Jasper
Reply all
Reply to author
Forward
0 new messages