Null transiency in the indexes

23 views
Skip to first unread message

Ayende Rahien

unread,
Dec 6, 2010, 11:50:31 PM12/6/10
to ravendb
Assume the following index:

from doc in docs
select new { doc.User.Name }


What happen if the document doesn't have User.Name ?

Right now, we throw NRE, which abort the indexing of this particular document, and if enough indexed documents throw, we disable the index.

We can handle it in a few other ways:
1 Trying to index missing values will result in not indexing the entire document, but not generate an error that might disable the index
2 Trying to index missing values will work, but skip any missing values (if all the values are missing, will skip the value completely).
3 Index the value as NULL_VALUE (as if it was null)
4 Index the value as MISSING_VALUE

Thoughts?

Tobias Grimm

unread,
Dec 7, 2010, 3:08:16 AM12/7/10
to rav...@googlegroups.com
Am Dienstag, den 07.12.2010, 14:50 +1000 schrieb Ayende Rahien:


> from doc in docs
> select new { doc.User.Name }
> What happen if the document doesn't have User.Name ?

IMHO, if Name is null this should be indexed as a NULL value, so that I
can query for users without a Name. If User is null, the the doc
shouldn't be in the index at all. When there's another field like:

select new { Title= Title, doc.User.Name }

...then if User=null it should probably be indexed as (Title,
MISSING_VALUE)-> doc.

For the above index I would e.g. like to be able to do:

Query<Doc>(doc => doc.User.Name == "Bob")
// -> docs with User=null are just ignored

Query<Doc>(doc => doc.User.Name == null)
// -> docs with User=null are just ignored

Query<Doc>(doc => doc.Title == "something")
// -> Includes docs with User == null

Query<Doc>(doc => doc.Title == "something" && doc.User.Name == "Bob")
// -> docs with User=null are just ignored

Tobias

Tobias Grimm

unread,
Dec 7, 2010, 3:11:09 AM12/7/10
to rav...@googlegroups.com
> select new { Title= Title, doc.User.Name }

Typo, should be: select new { doc.Title, doc.User.Name }


Ayende Rahien

unread,
Dec 7, 2010, 3:52:48 AM12/7/10
to ravendb
What is the meaning of MISSING_VALUE vs. Not indexing it in the first place?
We aren't going to allow to query for missing values.

SHSE

unread,
Dec 7, 2010, 5:06:29 AM12/7/10
to ravendb
I think that option 2 will be the best. "Select" kyword should point
database WHAT to index from the MATCHED document. If we can't find
what to index we will just index nothing.

And for the others:
1. There is no reason not to index the entire document, bacause the
document matches "where" condition. There will be misunderstood.
3. Null is actually a VALUE. That means that we found a value and
indexed it, but it is not true.
4. What is the case for this option?
If we want to find, say, all documents contained "User" property, why
wouldn't we do smth like this:

from doc in docs
where doc.User != null
select doc.User.Name

Or:

from doc in docs
where doc.HasProperty("User")
select doc.User.Name

Ok, suppose we choose option 4. How do we query this index with
MISSING_VALUE?
session.Query<Foo>(foo => foo.Name == QueryUtil.Missing) ?

Bruno Lopes

unread,
Dec 7, 2010, 5:21:31 AM12/7/10
to rav...@googlegroups.com
May be better for diagnostics? 
I know I sometimes peek into the index to figure out what's wrong with a particular index.

Not indexing the whole document feels troublesome, imo. If you're working on an index with many fields and the error is in a different field than the particular one you're working on, or if you forget to set a property it can take a while to figure out what's wrong. At least until we get some more diagnostics (RavenProf :P )

Indexing as NULL makes it indistinguishable between doc.User.Name being null and doc.User being null.

Between not indexing the value and indexing with MISSING_VALUE I'd prefer the MISSING_VALUE option

Matt Warren

unread,
Dec 7, 2010, 6:44:24 AM12/7/10
to ravendb
Is this a test, are we supposed to think of another option as per your
last blog post ;-)
See http://ayende.com/Blog/archive/2010/12/04/the-trap-of-choices.aspx.

Chris Marisic

unread,
Dec 7, 2010, 9:54:43 AM12/7/10
to ravendb
I'm a fan of MISSING_VALUE, indexes shouldn't stop working entirely
because a document is incorrect. The only question i have is what
implications does this raise with the client API, perhaps the client
API should some how this error case occurred?

Perhaps the real reason this is brought up is error collection for
indexes is lacking? Becuase the indexing error collection is just
about completely worthless as it stands now.

However if the error results were more along the lines

Index: UserByID, document: user/1 (link to document), select new
{ doc.User.Name } results in NullReference Exception.

Back tracking to my earlier statement, perhaps it would make sense to
return all of this information to the client when you query an index
that you get like .Advanced.IndexErrors, or that it drops out with the
statistics object or something.

Another thing to consider is perhaps this shouldn't just be a single
defined convention, maybe it should be defined as a convention for
Raven and then added to indexes as a Raven option that lets you choose
what happens with indexing errors: raise error, ignore entire
document, ignore field / return null.


On Dec 7, 5:21 am, Bruno Lopes <brunomlo...@gmail.com> wrote:
> May be better for diagnostics?
> I know I sometimes peek into the index to figure out what's wrong with a
> particular index.
>
> Not indexing the whole document feels troublesome, imo. If you're working on
> an index with many fields and the error is in a different field than the
> particular one you're working on, or if you forget to set a property it can
> take a while to figure out what's wrong. At least until we get some more
> diagnostics (RavenProf :P )
>
> Indexing as NULL makes it indistinguishable between doc.User.Name being null
> and doc.User being null.
>
> Between not indexing the value and indexing with MISSING_VALUE I'd prefer
> the MISSING_VALUE option
>
> On Tue, Dec 7, 2010 at 8:52 AM, Ayende Rahien <aye...@ayende.com> wrote:
> > What is the meaning of MISSING_VALUE vs. Not indexing it in the first
> > place?
> > We aren't going to allow to query for missing values.
>

Tobi

unread,
Dec 7, 2010, 12:00:28 PM12/7/10
to rav...@googlegroups.com
Am 07.12.2010 09:52, schrieb Ayende Rahien:

> What is the meaning of MISSING_VALUE vs. Not indexing it in the first place?

Just what I explained in my sample queries. With the following index:

select new { doc.Title, doc.User.Name }

...when a document has a Title != null and a User == null, I still want
to be able to query this index for documents by Title.

Query<Doc>(doc => doc.Title == "something")

When doc.User is null then doc.User.Name is missing. But because
there's also doc.Title in the index, the document must be indexed.

In order to do that, the doc.User.Name field in the index probably
needs a special "MISSING_VALUE" which can be distinguished from null or
any real value. (I might want to query for doc.User.Name == null, which
must not include docs where doc.User == null)

A document may only be completely ignored, if ALL fields in the index
are missing.

Tobias

Ayende Rahien

unread,
Dec 7, 2010, 3:39:19 PM12/7/10
to ravendb
No

Ayende Rahien

unread,
Dec 7, 2010, 3:42:13 PM12/7/10
to ravendb
inline

On Wed, Dec 8, 2010 at 12:54 AM, Chris Marisic <ch...@marisic.com> wrote:
I'm a fan of MISSING_VALUE, indexes shouldn't stop working entirely
because a document is incorrect. The only question i have is what
implications does this raise with the client API, perhaps the client
API should some how this error case occurred?

Perhaps the real reason this is brought up is error collection for
indexes is lacking? Becuase the indexing error collection is just
about completely worthless as it stands now.

However if the error results were more along the lines

Index: UserByID, document: user/1 (link to document), select new
{ doc.User.Name } results in NullReference Exception.


Oh! I like this. I'll add this to the SL UI
 
Back tracking to my earlier statement, perhaps it would make sense to
return all of this information to the client when you query an index
that you get like .Advanced.IndexErrors, or that it drops out with the
statistics object or something.

The problem is that for the most part, you don't care when you are querying.
But maybe we can put something like [1230/43932] in the UI next to the index name?
 
Another thing to consider is perhaps this shouldn't just be a single
defined convention, maybe it should be defined as a convention for
Raven and then added to indexes as a Raven option that lets you choose
what happens with indexing errors: raise error, ignore entire
document, ignore field / return null.


That is a very good point, yes. 
Anyone wants to implements this?

Ayende Rahien

unread,
Dec 8, 2010, 10:13:39 PM12/8/10
to ravendb
I just pushed this to the transitive-null branch, there are 3 failing tests, and I am literally at the queue line for go on the plane
If you can, please take a look at this

Remco Ros

unread,
Dec 9, 2010, 7:34:39 AM12/9/10
to ravendb
I had a quick look. It looks fine. My queries now run as expected too.

Remco
Reply all
Reply to author
Forward
0 new messages