1. Do not use everything-matches wildcard queries
Getting "the first N of something" can be pretty fast. But issuing a
functionally equivalent text-search query with a universal-wildcard
("the first N of something whose content matches '*'") is slower. And
doing a text-search query with _multiple_ wildcards ("... whose content
matches '*~' or '**'") is slowest of all. If we're not actually
searching for text, we shouldn't ask for a text-search.
Instead, the default query (with no user input) should fetch
"somethings" without any "contains" clause. To help with that switch,
Ian has implemented a slightly tricky way for a single search template
file to hold multiple flavors of query. Currently, the most efficient
way to toggle the efficient query flavor is probably to have it take
effect for the request parameter "q=*". And so in such cases, the client
needs to do what I just said clients shouldn't do. :) (At KERN-1303
I've inquired about adding a less ambiguous trigger.)
Currently, the following search template files support efficient "q=*"
handling:
/var/search/content.json
/var/search/files/allfiles.json
/var/search/files/mybookmarks.json
/var/search/files/mycontacts.json
/var/search/files/myfiles.json
/var/search/groupmembers.json
/var/search/groups.json
/var/search/pool/all.json
/var/search/pool/managing.json
/var/search/pool/me/manager.json
/var/search/pool/me/viewer.json
/var/search/sitecontent.json
/var/search/sites.json
/var/search/users.json
/var/search/usersgroups.json
2. Do not include multiple wildcards in a single query
A single query that combines a fuzzy word-based search ("test~") and a
wildcard substring-based search ("test*") will take about as long as
doing each separately. It's better to decide which approach [1] applies
and then apply it.
3. Avoid broad and vague wildcard searches
A search like "*e*" or "t*" is better than a search like "*~" but it's
still much slower than a more precise search. Lucene even recommends [2]
setting a minimum prefix length for fuzzy searches. If possible, it
would be good to keep very short wildcard searches focused fairly
tightly (e.g., on Basic Profile data rather than on all content under a
User's home folder).
4. Use the simplest and most direct query that makes sense
Queries such as the ones we currently use for User and Group entities
are inefficient so far as the server is concerned and often also
inefficient for the user experience: if I'm trying to find a user whose
last name is Marks, I won't be happy to get the Profiles of every entity
who mentioned the English word "mark" in their Pages.
If no one objects, I'd like to add some more focused search template
files to Nakamura to support client-side experimentation; for example,
"/var/search/users-profile.json" (focusing on Profile data) and
"/var/search/users-basic.json" (focusing on Basic Profile data).
5. The big access control issue
(No recommendation for this yet, just a warning.)
Existing text-search technologies assume that anyone who is allowed to
query a set of resources will also be allowed to see the search results.
This leads to some difficult performance issues when our search results
have more finely grained access controls (as when a non-logged-in
session queries all files in Pooled Content). The particular sort of
broad and complex search done in "users.json" and "groups.json" (where
content is searched but not directly returned to the client) also leads
to a tricky security issue (KERN-1323) which can probably only be fixed
by making performance even worse.
[1] http://lucene.apache.org/java/2_4_1/queryparsersyntax.html
[2] http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
Best,
Ray
On Wed, Oct 27, 2010 at 7:09 PM, Ray Davis <r...@media.berkeley.edu> wrote:
> Existing text-search technologies assume that anyone who is allowed to query
> a set of resources will also be allowed to see the search results.
... is there not some way to assume the above as far as the search
operation is concerned, but then have the UI mask the returned
resources for which viewing is not permitted (i.e. a second request
which asks, from the given list, which can I view)? I guess what I'm
wondering about is whether we're pushing an extra burden onto returned
results from the server that might be more efficiently handled as a
matter of presentation.
Told you it was naive.
~Clay
The server cant respond with anything about items you cant view, that would be a security breach and an information leak.
eg
If I search for "Ian Boston" and get a list of documents back that I then find I cant read, then I might deduce that the contents of those documents is sensitive . The existence of something in a set of search results is often enough to leak the information.
It doesn't matter that the operation is performed behind the scenes by the browser, if its over the network then its leaked, so we have to filter it in the server while its still secure before it gets out.
Ian
>
> Told you it was naive.
>
> ~Clay
> _______________________________________________
> sakai-ui-dev mailing list
> sakai-...@collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-ui-dev
Well, we sort of do that already, which is why we we have the
performance and security problems. :)
Performance problem - Lucene is wonderfully fast at collecting a small
number of matching results. But by interceding in a post-result stage,
we ruin its assumptions:
1. We ask for 100 results, and Jackrabbit asks Lucene.
2. Lucene hands Jackrabbit 100 results.
3. In post-result filtering Jackrabbit throws away 90 results that the
current user can't see.
4. Jackrabbit asks Lucene for 200 results.
5. ... and throws them away. So now Jackrabbit and Lucene get serious,
and fetch 400 results.
6. And so on until every possible result (the equivalent of paging
through possibly tens of thousands of content nodes) has been retrieved.
Security issue - Let's say we _don't_ have Jackrabbit check that the
current user is allowed to see the particular content node that matched
the search. (Which is what we inadvertently made happen in our current
User/Group search.) Well, then Jackrabbit also doesn't know why the
search result is there, and therefore it doesn't have enough information
to know that it should be weeded out of the results.
The easy solution to both problems is to do what text-search engines
usually do: search what the user has access to and report the matches
directly. That's the situation with publicly accessible data or for
community-scoped data, which is where we usually use search engines.
Best,
Ray
Yes, and thats what we do.
Search Templates may be configured with a implementation of the SearchResultPostProcessor interface that is looked up from OSGi, and use to post process the result set.
It inserts a filter iterator into the stack of iterators from which the the result sets are pulled. At the lower levels of the stack are paging filters and counting filters, as well as, in some cases merging filters for merging from 2 iterator streams.
>
> sort of like a results rendering pipeline?
Yes, absolutely.
Ian
>
> --
> You received this message because you are subscribed to the Google Groups "Sakai Nakamura" group.
> To post to this group, send email to sakai-...@googlegroups.com.
> To unsubscribe from this group, send email to sakai-kernel...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sakai-kernel?hl=en.
>