kern-2107 proposed solr change

3 views
Skip to first unread message

Walsh, Mark

unread,
Aug 30, 2011, 3:40:59 AM8/30/11
to sakai-...@googlegroups.com

Hi,

 

Just bringing to the list’s attention that the proposed changes associated with jira KERN-2107 could affect all solr queries, and perhaps those more experienced in solr  may like to run their eye over the proposed changes, as perhaps this is not the correct approach.

 

The aim of the change was to move “reader restrictions”  from the solr “q” parameter to the solr “fq” parameter to prevent “reader restrictions” from influencing the solr documents score. I’m not sure what the unintended consequences of this proposal are ?

 

The proposed code changes can be found here [1], and the pull request can be found here [2].

 

Regards,

Mark Walsh.

 

[1]

https://github.com/mawalsh/nakamura/tree/kern-2107

 

[2] pull request

https://github.com/sakaiproject/nakamura/pull/284

Charles Sturt University

|   ALBURY-WODONGA   |   BATHURST   |   CANBERRA   |   DUBBO   |   GOULBURN   |   ONTARIO   |   ORANGE    |   SYDNEY   |   WAGGA WAGGA   |

Give Generously - Support Young Australians
You can help young Australians to go to University and succeed in their studies by giving generously to the Charles Sturt University Foundation. To find out more or to make a donation go to the Foundation web site. Australian donations are tax deductible.
LEGAL NOTICE
This email (and any attachment) is confidential and is intended for the use of the addressee(s) only. If you are not the intended recipient of this email, you must not copy, distribute, take any action in reliance on it or disclose it to anyone. Any confidentiality is not waived or lost by reason of mistaken delivery. Email should be checked for viruses and defects before opening. Charles Sturt University (CSU) does not accept liability for viruses or any consequence which arise as a result of this email transmission. Email communications with CSU may be subject to automated email filtering, which could result in the delay or deletion of a legitimate email before it is read at CSU. The views expressed in this email are not necessarily those of CSU.

Charles Sturt University in Australia The Chancellery, Panorama Avenue, Bathurst NSW Australia 2795 (ABN: 83 878 708 551; CRICOS Provider Numbers: 00005F (NSW), 01947G (VIC), 02960B (ACT)).
Charles Sturt University in Ontario 860 Harrington Court, Burlington Ontario Canada L7N 3N4 Registration: www.peqab.ca

Consider the environment before printing this email.

John Norman

unread,
Aug 30, 2011, 5:53:23 AM8/30/11
to sakai-...@googlegroups.com
So I went through the material and had some worries, but not about whether this is a safe change or not.

I could not readily understand what the likely impact was on search scenarios other than the scenario reported in the JIRA - expecting a random search result. Is the change only scoped to a random search request, or all search requests?

It also raises another question about how we manage search tuning and design. I have to imagine our algorithms are very hard to test with sample data and I am also conscious that Google search development involves a lot of A/B testing. Do we have a plan in this area? Is there somewhere I can look that has a plain english explanation of what our user-facing search behaviour is by design? I'm not sure if this is a Nakamura question or a UX question.

John

On 30 Aug 2011, at 08:40, Walsh, Mark wrote:

Hi,
 
Just bringing to the list’s attention that the proposed changes associated with jira KERN-2107 could affect all solr queries, and perhaps those more experienced in solr  may like to run their eye over the proposed changes, as perhaps this is not the correct approach.
 
The aim of the change was to move “reader restrictions”  from the solr “q” parameter to the solr “fq” parameter to prevent “reader restrictions” from influencing the solr documents score. I’m not sure what the unintended consequences of this proposal are ?
 
The proposed code changes can be found here [1], and the pull request can be found here [2].
 
Regards,
Mark Walsh.
 
[1]
 
[2] pull request

|   ALBURY-WODONGA   |   BATHURST   |   CANBERRA   |   DUBBO   |   GOULBURN   |   ONTARIO   |   ORANGE    |   SYDNEY   |   WAGGA WAGGA   |

Give Generously - Support Young Australians
You can help young Australians to go to University and succeed in their studies by giving generously to the Charles Sturt University Foundation. To find out more or to make a donation go to the Foundation web site. Australian donations are tax deductible.
LEGAL NOTICE
This email (and any attachment) is confidential and is intended for the use of the addressee(s) only. If you are not the intended recipient of this email, you must not copy, distribute, take any action in reliance on it or disclose it to anyone. Any confidentiality is not waived or lost by reason of mistaken delivery. Email should be checked for viruses and defects before opening. Charles Sturt University (CSU) does not accept liability for viruses or any consequence which arise as a result of this email transmission. Email communications with CSU may be subject to automated email filtering, which could result in the delay or deletion of a legitimate email before it is read at CSU. The views expressed in this email are not necessarily those of CSU.

Charles Sturt University in Australia The Chancellery, Panorama Avenue, Bathurst NSW Australia 2795 (ABN: 83 878 708 551; CRICOS Provider Numbers: 00005F (NSW), 01947G (VIC), 02960B (ACT)). 
Charles Sturt University in Ontario 860 Harrington Court, Burlington Ontario Canada L7N 3N4 Registration: www.peqab.ca

Consider the environment before printing this email. 

-- 
You received this message because you are subscribed to the Google Groups "Sakai Nakamura" group.
To post to this group, send email to sakai-...@googlegroups.com.
To unsubscribe from this group, send email to sakai-kernel...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/sakai-kernel?hl=en.

John Norman
Director - CARET
University of Cambridge
jo...@caret.cam.ac.uk
+44-1223-765367

Carl Hall

unread,
Aug 30, 2011, 5:05:23 PM8/30/11
to sakai-...@googlegroups.com
As far as the change goes, functionally I agree with it. I think this will also take some of the unexpected randomness out of some of our searches. Thanks for digging into this. 
My only point of reserve is I would like to hear Mark Triggs weigh in on how filter queries that aren't very similar will affect memory usage and caching.

To John's points:
# This will affect all Solr queries in the system as this is how we limit what searched content is shown to the authenticated user. All Solr queries that go through our search framework pass through this processor to have the appropriate "readers" added to the query. Whenever content is indexed, the indexing processor adds "readers" to the index document.

# We don't currently have a good testing arrangement for changes in search tuning and design. Most of our tuning so far has been a result of finding slow queries either through the logs or by user feedback. I think having some description of the search behavior in different areas could be good for deploying institutions and those extending the system.

Mark Triggs

unread,
Aug 30, 2011, 5:31:30 PM8/30/11
to sakai-...@googlegroups.com
Hi Mark,

I assume the issue you're seeing with scoring is that the current
practice of sticking an extra implicit clause onto each query like:

AND readers:(mwalsh OR mst OR ...)

is causing documents with more readers in common with the current user's
to end up being scored more highly? If so, I think filters sound like
what you want, since you really just want to know if any term in that
field matched or not.

Your pull request looks good to me, but if it's not too hairy I'd
consider passing through the "readers:" filter and any pre-existing
filters as separate strings if possible. Passing multiple filters to
Solr is semantically the same as ANDing them together (as you're doing),
but should allow Solr to construct and cache the filters separately.
Since a user's list of "readers" isn't likely to change much between
queries, it's a good candidate for being cached and reused, and Solr
should do that if you keep it as a separate filter.

Give me a yell if I can help with anything,

Mark


"Walsh, Mark" <maw...@csu.edu.au> writes:

> Hi,
>
> Just bringing to the list's attention that the proposed changes
> associated with jira KERN-2107 could affect all solr queries, and
> perhaps those more experienced in solr may like to run their eye over
> the proposed changes, as perhaps this is not the correct approach.
>
> The aim of the change was to move "reader restrictions" from the solr
> "q" parameter to the solr "fq" parameter to prevent "reader
> restrictions" from influencing the solr documents score. I'm not sure
> what the unintended consequences of this proposal are ?
>
> The proposed code changes can be found here [1], and the pull request
> can be found here [2].
>
> Regards,
> Mark Walsh.
>
> [1]
> https://github.com/mawalsh/nakamura/tree/kern-2107
>
> [2] pull request
> https://github.com/sakaiproject/nakamura/pull/284

--
Mark Triggs
<ma...@dishevelled.net>

John Norman

unread,
Aug 30, 2011, 5:44:43 PM8/30/11
to sakai-...@googlegroups.com
On 30 Aug 2011, at 22:05, Carl Hall wrote:

As far as the change goes, functionally I agree with it. I think this will also take some of the unexpected randomness out of some of our searches.
??? I thought we were putting randomness in? Are you suggesting that taking readers into account can produce individual results that are surprising to the external observer? This could be a good thing.

Thanks for digging into this. 
My only point of reserve is I would like to hear Mark Triggs weigh in on how filter queries that aren't very similar will affect memory usage and caching.

To John's points:
# This will affect all Solr queries in the system as this is how we limit what searched content is shown to the authenticated user. All Solr queries that go through our search framework pass through this processor to have the appropriate "readers" added to the query. Whenever content is indexed, the indexing processor adds "readers" to the index document.

So if this is correct, are we not throwing away some potentially useful information with the other readers that might help when for example, I am looking for stuff related to my courses?


# We don't currently have a good testing arrangement for changes in search tuning and design. Most of our tuning so far has been a result of finding slow queries either through the logs or by user feedback. I think having some description of the search behavior in different areas could be good for deploying institutions and those extending the system.

I wasn't talking about the technical performance of search code so much as the effectiveness of the search in finding what I am looking for. That's why I tentatively thought it might be a UX issue.

John

Carl Hall

unread,
Aug 30, 2011, 6:17:52 PM8/30/11
to sakai-...@googlegroups.com
On Tue, Aug 30, 2011 at 5:44 PM, John Norman <jo...@caret.cam.ac.uk> wrote:
On 30 Aug 2011, at 22:05, Carl Hall wrote:
As far as the change goes, functionally I agree with it. I think this will also take some of the unexpected randomness out of some of our searches.
??? I thought we were putting randomness in? Are you suggesting that taking readers into account can produce individual results that are surprising to the external observer? This could be a good thing.
Sorry, there's 2 points of randomness I talked about and I wasn't very clear.

Topic of this thread: Mark is adding randomness to a feed that is expected to return random content.

What I introduced: It has been noted by testing at NYU that some of our other search feeds, like searching for users, can return some unpredictable results or order things in an non-obvious way (when not sorted explicitly). Moving the "readers" to a filter query will help remove some of this expected ordering of results since the "readers" won't affect scoring in this setup.
To John's points:
# This will affect all Solr queries in the system as this is how we limit what searched content is shown to the authenticated user. All Solr queries that go through our search framework pass through this processor to have the appropriate "readers" added to the query. Whenever content is indexed, the indexing processor adds "readers" to the index document.
So if this is correct, are we not throwing away some potentially useful information with the other readers that might help when for example, I am looking for stuff related to my courses?
This gets a bit into how the index is designed but the readers that we add to the search is just a way of enforcing our ACLs during search since the index is disconnected from the content.
# We don't currently have a good testing arrangement for changes in search tuning and design. Most of our tuning so far has been a result of finding slow queries either through the logs or by user feedback. I think having some description of the search behavior in different areas could be good for deploying institutions and those extending the system.
I wasn't talking about the technical performance of search code so much as the effectiveness of the search in finding what I am looking for. That's why I tentatively thought it might be a UX issue.
Looks like we may have a server and ux issue then :)
Reply all
Reply to author
Forward
0 new messages