Worst mistakes in ravendb design / api / etc

1,784 views
Skip to first unread message

Oren Eini (Ayende Rahien)

unread,
Feb 9, 2016, 6:40:02 AM2/9/16
to ravendb
Guys,
We are doing a lot of work on 4.0, and one of the things we are looking at is not just adding new stuff, but removing bad old stuff.

What are the things that you regret having in RavenDB? Pifalls, issues, confusing, etc?


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 

Chris Marisic

unread,
Feb 9, 2016, 10:56:21 AM2/9/16
to RavenDB - 2nd generation document database
#1 it is still absurdly hard to index nested objects especially with nested collections and nested objects and for people to get the behavior they expect

#2 system administration needs to be made first class, the tools that exist now are more akin to dev tools than actual production server administrator tools.

#3 better ways to do boosting and other fine grained results control

Kijana Woodard

unread,
Feb 9, 2016, 11:02:00 AM2/9/16
to rav...@googlegroups.com
- Allowing int and guid as ids for the client.
- Remove attachments [already planned?].
- Remove identities or make them a "first class citizen" [documents?]. They are hard to manage over time.
- Invoking concurrency checks with session.Store is not really clear. [maybe just docs / intellisense issue]
- Invoking DatabasCommands through session.Advanced should default to the db of the session. Fun bugs happened there.
- Drop sql replication and instead provide a gist demonstrating how to do it with Data Subscriptions.
- Consider dropping the original patch commands and stick with js patches.
- Consider some consolidation between map/reduce and dynamic aggregation. Perhaps Data Subscriptions would be better for some cases.
- Using a transformer within a transformer is awkward. 
- Trying to session.Load/Query to "a different model" is awkward. Sometimes I just want to coerce the data into a different model and ignore the class in metadata. It works surprisingly well if the metadata model is not in the project.
- Where you can and can't assert generic parameters in session.Include/Query is confusing.
- The order of generic parameters is confusing between Queries and Transformers. I always expect Transformers parameters to be swapped.

I'd like some consolidation between Stream, Data Subscriptions, Changes, LoadStartingWith, increase Take limits + Take(int.MaxValue), and deep paging, The all kinda sorta cover the same ground, but with quirks that make each of them difficult to work with deterministically.

Right now, replication is fraught with "bad choices" for failure cases. Hopefully raft [3.5?] alleviates that issue.

Minor nit - it's be nice to be able to pass null to session.Load and get null back instead of throwing an exception. I usually end up having to "check null twice" in some way.




 

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Oren Eini (Ayende Rahien)

unread,
Feb 9, 2016, 12:15:11 PM2/9/16
to ravendb
1) What do you mean, hard to index nested objects?
2) Can you be more specific? 
3) What is missing?

--

Oren Eini (Ayende Rahien)

unread,
Feb 9, 2016, 12:18:08 PM2/9/16
to ravendb
inline

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Tue, Feb 9, 2016 at 6:01 PM, Kijana Woodard <kijana....@gmail.com> wrote:
- Allowing int and guid as ids for the client.

Yes, that is something that we want to remove.
 
- Remove attachments [already planned?].

Done
 
- Remove identities or make them a "first class citizen" [documents?]. They are hard to manage over time.

What do you mean? You want to store this as a document like hilo?
 
- Invoking concurrency checks with session.Store is not really clear. [maybe just docs / intellisense issue]

What do you mean?
 
- Invoking DatabasCommands through session.Advanced should default to the db of the session. Fun bugs happened there.

Yes, that needs to be fixed.
 
- Drop sql replication and instead provide a gist demonstrating how to do it with Data Subscriptions.

Why?
That means that every user will have to write their own replication system, which is decidedly non trivial, even if subscriptions handles a lot of it.
 
- Consider dropping the original patch commands and stick with js patches.

Planned.
 
- Consider some consolidation between map/reduce and dynamic aggregation. Perhaps Data Subscriptions would be better for some cases.

Not following.
 
- Using a transformer within a transformer is awkward. 
- Trying to session.Load/Query to "a different model" is awkward. Sometimes I just want to coerce the data into a different model and ignore the class in metadata. It works surprisingly well if the metadata model is not in the project.
- Where you can and can't assert generic parameters in session.Include/Query is confusing.
- The order of generic parameters is confusing between Queries and Transformers. I always expect Transformers parameters to be swapped.

I'd like some consolidation between Stream, Data Subscriptions, Changes, LoadStartingWith, increase Take limits + Take(int.MaxValue), and deep paging, The all kinda sorta cover the same ground, but with quirks that make each of them difficult to work with deterministically.


Can you expand on that?
 

Right now, replication is fraught with "bad choices" for failure cases. Hopefully raft [3.5?] alleviates that issue.


What do you mean?

Chris Marisic

unread,
Feb 9, 2016, 12:31:07 PM2/9/16
to RavenDB - 2nd generation document database


On Tuesday, February 9, 2016 at 11:15:11 AM UTC-6, Oren Eini wrote:
1) What do you mean, hard to index nested objects?

Foo {
       Bar {
           List<Baz> {
                        Each_Barz  {
                                  SubInnerBaz {
                                         List<Address>
}
}       
}
}

I want to find Foo by Baz's address.Try free coding that answer and see how many attempts you have to go to the index definition before it works in real code. so much of raven has always felt like "i don't know, this should work" and then spend an hour whacking it with a wrench that something slightly different works for no obvious reason and then you move on 
 
2) Can you be more specific? 

All of them. All of the tools are awful. You need a singular admin tool that supports local and remote backup, recovery, restoration, some basic level of shard balancing, replication control, deployment & upgrades. That multiple servers can be easily administered concurrently through the single application.
 
3) What is missing?

So take a simple full text search example Product Title and Product Description. I want to FTS across all of the text there but to prioritize matches in title. 

Similarly things like better built in support for Ngram indexes, other more advanced scenarios 

Kijana Woodard

unread,
Feb 9, 2016, 1:22:59 PM2/9/16
to rav...@googlegroups.com
inline

On Tue, Feb 9, 2016 at 11:17 AM, Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
inline

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Tue, Feb 9, 2016 at 6:01 PM, Kijana Woodard <kijana....@gmail.com> wrote:
- Allowing int and guid as ids for the client.

Yes, that is something that we want to remove.
 
- Remove attachments [already planned?].

Done 
 
- Remove identities or make them a "first class citizen" [documents?]. They are hard to manage over time.

What do you mean? You want to store this as a document like hilo?
 

Yeah. I've settled on creating documents manually that hold identities. Makes them easier to manage, export/import, exclude, patch, etc.

- Invoking concurrency checks with session.Store is not really clear. [maybe just docs / intellisense issue]

What do you mean?

Between forcing concurrency check with an etag or overwriting by specifying the id, the session.Store overloads are confusing. If you set the string id, what does that mean for the id property on your document instance? If you've set concurrency on the session [or as the default] the string overload will never work. With etag and string.....I guess it's saying overwrite if the etag hasn't changed. The etag story [assume you've sent it from the client] is even harder to reason about in a multi-master scenario.
 
- Invoking DatabasCommands through session.Advanced should default to the db of the session. Fun bugs happened there.

Yes, that needs to be fixed.
 
- Drop sql replication and instead provide a gist demonstrating how to do it with Data Subscriptions.

Why?
That means that every user will have to write their own replication system, which is decidedly non trivial, even if subscriptions handles a lot of it.

When there are issues, and there are many on the mailing list, you have to wait for a new release and for that to become stable, or patch it yourself. Given reliable data subscriptions, maintaining the sql statements in c# seems preferable to shipping js. Is there "enough left over" to justify a supported feature or is there "so much variety" of what people want to do, it's better to simply demonstrate a few ways via gist / sample app and let people glue the data subscription to the external store of choice. For instance, write to elastic search or to files or whatever.
 
- Consider dropping the original patch commands and stick with js patches.

Planned.
 
- Consider some consolidation between map/reduce and dynamic aggregation. Perhaps Data Subscriptions would be better for some cases.

Not following.
It's not clear today how to achieve multi-level reduce. Advice is to do one level of reduce and then dynamic aggregation. That's more art than science. If multi reduce is not [easily] achievable, then maybe data subscriptions provides a clearer path forward. Fwiw, having reduce results as documents might be interesting here. The documents are mapped and possibly reduced [re-reduced]. Almost sounds like SIR at that point [see new entry below].
 
- Using a transformer within a transformer is awkward. 
- Trying to session.Load/Query to "a different model" is awkward. Sometimes I just want to coerce the data into a different model and ignore the class in metadata. It works surprisingly well if the metadata model is not in the project.
- Where you can and can't assert generic parameters in session.Include/Query is confusing.
- The order of generic parameters is confusing between Queries and Transformers. I always expect Transformers parameters to be swapped.
Do all of these make sense? 

I'd like some consolidation between Stream, Data Subscriptions, Changes, LoadStartingWith, increase Take limits + Take(int.MaxValue), and deep paging, The all kinda sorta cover the same ground, but with quirks that make each of them difficult to work with deterministically.


Can you expand on that?
 
Ultimately, all of these options represent ways to bypass the "safe by default" 128 - 1024 Load/Query restriction. Fwiw, I'm happy with that restriction.

But when you decide you need to "go through all the results", there are several options that can accomplish that. They all have trade offs that appear to be dead ends. I'm coming to believe that Data Subscriptions is the best bet except for a few caveats: 

- You can't use your own id which means you have to save the subscription id to a document which raises it's own failure scenario issues.

- They don't work well in a fail over scenario between primary and secondary. I think raft will address this problem.

I use Stream fairly regularly. Invariably, I end up bringing the stream into memory [ToList, et al] so as not to get caught in the "reading too slow" trap. That's especially annoying when the bottleneck is writing back to the db. Of course, bringing the stream into memory defeats the purpose of "streaming" in the first place. Then you have to be careful about not loading "too much". Both the "reading too slow" timeout and "read all" workaround introduce non-deterministic errors into the system. User activity around document creation can create these issues long after the software is released.

I should add here that Scripted Patch is sometimes an alternative. That's equally hard to reason about. 
- If multiple patches are started, will the operations interleave?
- Can a patch timeout on the server?
- If so, is the work that's been done in a transaction?
- If not, how would one determine where to "restart"...assuming one even knew that failure occurred?

Right now it's kind of "fire and pray". Stream and Data Subscriptions at least give you reliable semantics around partial success.

By consolidation, I wonder if Stream could be replaced by a non-persistent Data Subscription. Do we even need LoadStartingWith considering the overloads for Stream/Data Subscriptions? Changes is fine as Data Subscription support, but does it still have sufficient value as a front line api element? Deep paging [Skip(12300).Take(100)] doesn't work very well [slow, iirc] and is kind of a silly UI concept. Take(int.MaxValue) has the same problem [memory pressure] as Stream().ToList().

Right now, replication is fraught with "bad choices" for failure cases. Hopefully raft [3.5?] alleviates that issue.


When the Primary goes down, how can your code reason about the value of the Secondary? You can allow writes, and then deal with conflict resolution. You can allow reads, but risk bewildered users who "swear I just changed that". Even reads during normal operations risk interleaved results if you happen to bounce between nodes [two web servers pinned to opposite dbs]. I think raft will address those issues because the write is either on the majority or not. The next piece would be validating that a read is from a server "at least as up to date" as the end client.

What do you mean?
 
Minor nit - it's be nice to be able to pass null to session.Load and get null back instead of throwing an exception. I usually end up having to "check null twice" in some way.
Any thoughts on this one?
 
Another candidate for removal: is SIR pulling it's weight as a feature? I like the idea, but again, it's possibly supplanted by data subscriptions and a tiny amount of custom c#.

Oren Eini (Ayende Rahien)

unread,
Feb 9, 2016, 2:25:38 PM2/9/16
to ravendb
 

What do you mean? You want to store this as a document like hilo?
 

Yeah. I've settled on creating documents manually that hold identities. Makes them easier to manage, export/import, exclude, patch, etc.


Is this still the case since smuggler work with identities? And you can modify them in the studio?
There are good reasons why we want to keep them out of documents. To start with, they aren't, really. 

 
 
- Drop sql replication and instead provide a gist demonstrating how to do it with Data Subscriptions.

Why?
That means that every user will have to write their own replication system, which is decidedly non trivial, even if subscriptions handles a lot of it.

When there are issues, and there are many on the mailing list, you have to wait for a new release and for that to become stable, or patch it yourself. Given reliable data subscriptions, maintaining the sql statements in c# seems preferable to shipping js. Is there "enough left over" to justify a supported feature or is there "so much variety" of what people want to do, it's better to simply demonstrate a few ways via gist / sample app and let people glue the data subscription to the external store of choice. For instance, write to elastic search or to files or whatever.
 

Absolutely disagreeing with you here. For several reasons.
One, the SQL Replication is a major feature in the way people consider RavenDB. 
Second, saying "here is how you can do that" because we have reliable subscriptions seems very much like this mindset:
Inline image 1

Third, I don't think we are seeing very many SQL Replication issues in the past year or so. We had a lot with index replication, but SQL Replication has been pretty great & stable.

 
- Consider some consolidation between map/reduce and dynamic aggregation. Perhaps Data Subscriptions would be better for some cases.

Not following.
It's not clear today how to achieve multi-level reduce. Advice is to do one level of reduce and then dynamic aggregation. That's more art than science. If multi reduce is not [easily] achievable, then maybe data subscriptions provides a clearer path forward. Fwiw, having reduce results as documents might be interesting here. The documents are mapped and possibly reduced [re-reduced]. Almost sounds like SIR at that point [see new entry below].
 

We already have that in SIR, no?
 
- Using a transformer within a transformer is awkward. 
- Trying to session.Load/Query to "a different model" is awkward. Sometimes I just want to coerce the data into a different model and ignore the class in metadata. It works surprisingly well if the metadata model is not in the project.
- Where you can and can't assert generic parameters in session.Include/Query is confusing.
- The order of generic parameters is confusing between Queries and Transformers. I always expect Transformers parameters to be swapped.
Do all of these make sense? 


Yes. That is an API issue. I wish we had better overall way to handle such complex things in the first place, to be honest.
 
I'd like some consolidation between Stream, Data Subscriptions, Changes, LoadStartingWith, increase Take limits + Take(int.MaxValue), and deep paging, The all kinda sorta cover the same ground, but with quirks that make each of them difficult to work with deterministically.


Can you expand on that?
 
Ultimately, all of these options represent ways to bypass the "safe by default" 128 - 1024 Load/Query restriction. Fwiw, I'm happy with that restriction.

But when you decide you need to "go through all the results", there are several options that can accomplish that. They all have trade offs that appear to be dead ends. I'm coming to believe that Data Subscriptions is the best bet except for a few caveats: 

Out of the items you listed, only subscriptions and streams will actually give you the full data set.
 

- You can't use your own id which means you have to save the subscription id to a document which raises it's own failure scenario issues.


This is by design, otherwise you run into a lot of edgecases with two clients reading from the same subscription, or stealing it off one another.
 
- They don't work well in a fail over scenario between primary and secondary. I think raft will address this problem.


Probably not, actually. As it currently stand, 4.0 is going to have separate etags for each server. 
The current plan is to use raft to coordinate the cluster, move the replication to a more gossip like protocol to support higher number of interconnected nodes in a cluster.
I would like to discuss a good solution for this, but it is actually a really hard problem, and probably deserve a separate thread.
 
I use Stream fairly regularly. Invariably, I end up bringing the stream into memory [ToList, et al] so as not to get caught in the "reading too slow" trap. That's especially annoying when the bottleneck is writing back to the db. Of course, bringing the stream into memory defeats the purpose of "streaming" in the first place. Then you have to be careful about not loading "too much". Both the "reading too slow" timeout and "read all" workaround introduce non-deterministic errors into the system. User activity around document creation can create these issues long after the software is released.

We fixed a bunch of issues around that in bulk insert + streaming. In particular, data subscriptions are no longer susceptible to "processing time takes too long" issues.
 

I should add here that Scripted Patch is sometimes an alternative. That's equally hard to reason about. 
- If multiple patches are started, will the operations interleave?

Yes
 
- Can a patch timeout on the server?

No
 
- If so, is the work that's been done in a transaction?

Patches run in multiple separate transaction batches.
 
- If not, how would one determine where to "restart"...assuming one even knew that failure occurred?


You can look at the operation stats you get back.

And you can't restart, you've to run it again,
 
Right now it's kind of "fire and pray". Stream and Data Subscriptions at least give you reliable semantics around partial success.

By consolidation, I wonder if Stream could be replaced by a non-persistent Data Subscription. Do we even need LoadStartingWith considering the overloads for Stream/Data Subscriptions?

LoadStartingWith is actually quite common in scenarios such as give me this month's transaction: prefix: "accounts/1234/txs/2016-01"

Subscriptions are for getting all the documents matching a particular (relatively simple) criteria.
This is an ongoing effort, which under write load may never end. 

Streams are take the full result of a query, and get it. This can be a map/reduce output, all documents matching a complex query, etc.

"Stream me all the credit card made within 7 miles radius of the bank robbery" is not somtehing that you can do in a subscription.
 
Changes is fine as Data Subscription support, but does it still have sufficient value as a front line api element? Deep paging [Skip(12300).Take(100)] doesn't work very well [slow, iirc] and is kind of a silly UI concept. Take(int.MaxValue) has the same problem [memory pressure] as Stream().ToList().


I don't understand the last statement.
Changes() are a way to get notifications from the database about what is happening right now.

 
Right now, replication is fraught with "bad choices" for failure cases. Hopefully raft [3.5?] alleviates that issue.


When the Primary goes down, how can your code reason about the value of the Secondary? You can allow writes, and then deal with conflict resolution. You can allow reads, but risk bewildered users who "swear I just changed that". Even reads during normal operations risk interleaved results if you happen to bounce between nodes [two web servers pinned to opposite dbs]. I think raft will address those issues because the write is either on the majority or not. The next piece would be validating that a read is from a server "at least as up to date" as the end client.


That isn't what we're doing. We are using Raft to select the leader, and that is the server that will accept writes. There is still a small chance that the leader being depose will accept a write, but that would be replicated to its siblings. 
The leader can move between nodes on the fly.
 
What do you mean?
 
Minor nit - it's be nice to be able to pass null to session.Load and get null back instead of throwing an exception. I usually end up having to "check null twice" in some way.
Any thoughts on this one?
 

Yes.
 
Another candidate for removal: is SIR pulling it's weight as a feature? I like the idea, but again, it's possibly supplanted by data subscriptions and a tiny amount of custom c#.


I think that it is too complex to be really useful in many scenarios.
We might just replace it with an option to say "persist this map/reduce index results as documents".

Kijana Woodard

unread,
Feb 9, 2016, 3:50:57 PM2/9/16
to rav...@googlegroups.com
On Tue, Feb 9, 2016 at 1:25 PM, Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
 

What do you mean? You want to store this as a document like hilo?
 

Yeah. I've settled on creating documents manually that hold identities. Makes them easier to manage, export/import, exclude, patch, etc.


Is this still the case since smuggler work with identities? And you can modify them in the studio?
There are good reasons why we want to keep them out of documents. To start with, they aren't, really. 

Yes. We ran into an issue the other where you get all the identities when you export and you can't filter them out. We nearly overwrote the identities, but the export size was suspicious [single collection]. We handled it by editing the raven dump, but....something else you need to remember operationally. 
 
 
- Drop sql replication and instead provide a gist demonstrating how to do it with Data Subscriptions.

Why?
That means that every user will have to write their own replication system, which is decidedly non trivial, even if subscriptions handles a lot of it.

When there are issues, and there are many on the mailing list, you have to wait for a new release and for that to become stable, or patch it yourself. Given reliable data subscriptions, maintaining the sql statements in c# seems preferable to shipping js. Is there "enough left over" to justify a supported feature or is there "so much variety" of what people want to do, it's better to simply demonstrate a few ways via gist / sample app and let people glue the data subscription to the external store of choice. For instance, write to elastic search or to files or whatever.
 

Absolutely disagreeing with you here. For several reasons.
One, the SQL Replication is a major feature in the way people consider RavenDB. 
Second, saying "here is how you can do that" because we have reliable subscriptions seems very much like this mindset:
Inline image 1

Third, I don't think we are seeing very many SQL Replication issues in the past year or so. We had a lot with index replication, but SQL Replication has been pretty great & stable.
 
I concede this point. The strongest argument for me is that it is a compelling selling point to say, "need your data in RDBMS, turn on SQL Replication". 

It'd be nice to have an example of "doing sql replication with data subscriptions" to stave off feature requests for edge cases or if someone wanted to write to another kind of store.

 
- Consider some consolidation between map/reduce and dynamic aggregation. Perhaps Data Subscriptions would be better for some cases.

Not following.
It's not clear today how to achieve multi-level reduce. Advice is to do one level of reduce and then dynamic aggregation. That's more art than science. If multi reduce is not [easily] achievable, then maybe data subscriptions provides a clearer path forward. Fwiw, having reduce results as documents might be interesting here. The documents are mapped and possibly reduced [re-reduced]. Almost sounds like SIR at that point [see new entry below].
 

We already have that in SIR, no?
 
I think we'll cover this below.
- Using a transformer within a transformer is awkward. 
- Trying to session.Load/Query to "a different model" is awkward. Sometimes I just want to coerce the data into a different model and ignore the class in metadata. It works surprisingly well if the metadata model is not in the project.
- Where you can and can't assert generic parameters in session.Include/Query is confusing.
- The order of generic parameters is confusing between Queries and Transformers. I always expect Transformers parameters to be swapped.
Do all of these make sense? 


Yes. That is an API issue. I wish we had better overall way to handle such complex things in the first place, to be honest.
 
I'd like some consolidation between Stream, Data Subscriptions, Changes, LoadStartingWith, increase Take limits + Take(int.MaxValue), and deep paging, The all kinda sorta cover the same ground, but with quirks that make each of them difficult to work with deterministically.


Can you expand on that?
 
Ultimately, all of these options represent ways to bypass the "safe by default" 128 - 1024 Load/Query restriction. Fwiw, I'm happy with that restriction.

But when you decide you need to "go through all the results", there are several options that can accomplish that. They all have trade offs that appear to be dead ends. I'm coming to believe that Data Subscriptions is the best bet except for a few caveats: 

Out of the items you listed, only subscriptions and streams will actually give you the full data set.
 
IIRC, LoadStartingWith still gives everything. But you could also write a loop and page. That's probably not a great idea given stream and data subscriptions.
 

- You can't use your own id which means you have to save the subscription id to a document which raises it's own failure scenario issues.


This is by design, otherwise you run into a lot of edgecases with two clients reading from the same subscription, or stealing it off one another.
 
That pushes the problem to the user. I need to experiment more here, but it seems like one would have to try to save a document with a given name, then create the subscription, then save the subscription id in the document....and address failures that can happen along the way. For instance, I failed to save the subscription id, how do I clean up the stranded subscription? We can query the list of subscriptions, but how can an arbitrary piece of code know which one is "no good"?
 
- They don't work well in a fail over scenario between primary and secondary. I think raft will address this problem.


Probably not, actually. As it currently stand, 4.0 is going to have separate etags for each server. 
The current plan is to use raft to coordinate the cluster, move the replication to a more gossip like protocol to support higher number of interconnected nodes in a cluster.
I would like to discuss a good solution for this, but it is actually a really hard problem, and probably deserve a separate thread.
 
I use Stream fairly regularly. Invariably, I end up bringing the stream into memory [ToList, et al] so as not to get caught in the "reading too slow" trap. That's especially annoying when the bottleneck is writing back to the db. Of course, bringing the stream into memory defeats the purpose of "streaming" in the first place. Then you have to be careful about not loading "too much". Both the "reading too slow" timeout and "read all" workaround introduce non-deterministic errors into the system. User activity around document creation can create these issues long after the software is released.

We fixed a bunch of issues around that in bulk insert + streaming. In particular, data subscriptions are no longer susceptible to "processing time takes too long" issues.
 
Nice. I'll try again with 3.5. 
 

I should add here that Scripted Patch is sometimes an alternative. That's equally hard to reason about. 
- If multiple patches are started, will the operations interleave?

Yes
 
- Can a patch timeout on the server?

No
 
- If so, is the work that's been done in a transaction?

Patches run in multiple separate transaction batches.
 
- If not, how would one determine where to "restart"...assuming one even knew that failure occurred?


You can look at the operation stats you get back.

And you can't restart, you've to run it again,

Right. With patches, developers need to make them idempotent and make them resilient to changes by other code doing concurrent patches. I don't know if people think about it at that level. I think there's a [incorrect] perception that these are going to run serially.
 
Right now it's kind of "fire and pray". Stream and Data Subscriptions at least give you reliable semantics around partial success.

By consolidation, I wonder if Stream could be replaced by a non-persistent Data Subscription. Do we even need LoadStartingWith considering the overloads for Stream/Data Subscriptions?

LoadStartingWith is actually quite common in scenarios such as give me this month's transaction: prefix: "accounts/1234/txs/2016-01"

Subscriptions are for getting all the documents matching a particular (relatively simple) criteria.
This is an ongoing effort, which under write load may never end. 

Streams are take the full result of a query, and get it. This can be a map/reduce output, all documents matching a complex query, etc.

"Stream me all the credit card made within 7 miles radius of the bank robbery" is not somtehing that you can do in a subscription.
Stream has a StartsWith parameter. Why not drop LoadStartingWith? We just reported an issue with it that was fixed 30037. Not saying it's not useful [I happen to really like it], but does it pull it's weight when there's another way to do the same thing and more [api surface consolidation]?
 
Changes is fine as Data Subscription support, but does it still have sufficient value as a front line api element? Deep paging [Skip(12300).Take(100)] doesn't work very well [slow, iirc] and is kind of a silly UI concept. Take(int.MaxValue) has the same problem [memory pressure] as Stream().ToList().


I don't understand the last statement.
Changes() are a way to get notifications from the database about what is happening right now.
 
Right. I'm thinking of Changes as "non-persistent Data Subscriptions". Could "do not persist" be a subscription option?  


 
Right now, replication is fraught with "bad choices" for failure cases. Hopefully raft [3.5?] alleviates that issue.


When the Primary goes down, how can your code reason about the value of the Secondary? You can allow writes, and then deal with conflict resolution. You can allow reads, but risk bewildered users who "swear I just changed that". Even reads during normal operations risk interleaved results if you happen to bounce between nodes [two web servers pinned to opposite dbs]. I think raft will address those issues because the write is either on the majority or not. The next piece would be validating that a read is from a server "at least as up to date" as the end client.


That isn't what we're doing. We are using Raft to select the leader, and that is the server that will accept writes. There is still a small chance that the leader being depose will accept a write, but that would be replicated to its siblings. 
The leader can move between nodes on the fly.
 
What do you mean?
 
Minor nit - it's be nice to be able to pass null to session.Load and get null back instead of throwing an exception. I usually end up having to "check null twice" in some way.
Any thoughts on this one?
 

Yes.
 
Another candidate for removal: is SIR pulling it's weight as a feature? I like the idea, but again, it's possibly supplanted by data subscriptions and a tiny amount of custom c#.


I think that it is too complex to be really useful in many scenarios.
We might just replace it with an option to say "persist this map/reduce index results as documents".
 
I think that would be nice. It seems that with "no extra work", one could write a map/reduce on those persisted result documents and now you have unlimited reduce [endless loops aside]. Rollup by city/state/country/continent is straight forward to implement and you don't have to worry about "too much data" clogging up dynamic aggregation.


One additional one: I think Sharding is poorly understood and/or under-utilized. I think it should be more popular than it is.

Oren Eini (Ayende Rahien)

unread,
Feb 9, 2016, 5:30:44 PM2/9/16
to ravendb
inline


It'd be nice to have an example of "doing sql replication with data subscriptions" to stave off feature requests for edge cases or if someone wanted to write to another kind of store. 

What do you need more than getting the json and outputting the SQL?

 
Out of the items you listed, only subscriptions and streams will actually give you the full data set.
 
IIRC, LoadStartingWith still gives everything. But you could also write a loop and page. That's probably not a great idea given stream and data subscriptions.
 

That isn't correct. It is paged like everything else.
 

- You can't use your own id which means you have to save the subscription id to a document which raises it's own failure scenario issues.


This is by design, otherwise you run into a lot of edgecases with two clients reading from the same subscription, or stealing it off one another.
 
That pushes the problem to the user. I need to experiment more here, but it seems like one would have to try to save a document with a given name, then create the subscription, then save the subscription id in the document....and address failures that can happen along the way. For instance, I failed to save the subscription id, how do I clean up the stranded subscription? We can query the list of subscriptions, but how can an arbitrary piece of code know which one is "no good"?
 

You can query the subscriptions, yes. An a subscription that isn't opened doesn't actually use any resources whatsoever.
So "leaking" a subscription has no cost.

Note that the whole idea is that you'll use subscriptions for very long running tasks, such as getting documents from a database as they change over months and years.

 
And you can't restart, you've to run it again,

Right. With patches, developers need to make them idempotent and make them resilient to changes by other code doing concurrent patches. I don't know if people think about it at that level. I think there's a [incorrect] perception that these are going to run serially.
 

They _are_ going to run seirally. In the sense that each patch is applied independently.
It is just that if you have multiple UpdateByIndex operations running, they will run concurrent and can interleave (on different documents).


"Stream me all the credit card made within 7 miles radius of the bank robbery" is not somtehing that you can do in a subscription.
Stream has a StartsWith parameter. Why not drop LoadStartingWith? We just reported an issue with it that was fixed 30037. Not saying it's not useful [I happen to really like it], but does it pull it's weight when there's another way to do the same thing and more [api surface consolidation]?
 

LoadStartingWith is not a streaming / unlimited API.
It load the documents to the session identity map, and they get change tracking.
 
Changes is fine as Data Subscription support, but does it still have sufficient value as a front line api element? Deep paging [Skip(12300).Take(100)] doesn't work very well [slow, iirc] and is kind of a silly UI concept. Take(int.MaxValue) has the same problem [memory pressure] as Stream().ToList().


I don't understand the last statement.
Changes() are a way to get notifications from the database about what is happening right now.
 
Right. I'm thinking of Changes as "non-persistent Data Subscriptions". Could "do not persist" be a subscription option?  

What is the scenario that you are trying to enable?

 
I think that it is too complex to be really useful in many scenarios.
We might just replace it with an option to say "persist this map/reduce index results as documents".
 
I think that would be nice. It seems that with "no extra work", one could write a map/reduce on those persisted result documents and now you have unlimited reduce [endless loops aside]. Rollup by city/state/country/continent is straight forward to implement and you don't have to worry about "too much data" clogging up dynamic aggregation.


One additional one: I think Sharding is poorly understood and/or under-utilized. I think it should be more popular than it is.
 

Sharding probably requires us to do it completely on the server side with dynamic scale up & down.
That isn't a simple problem, and we aren't going to address it in 4.0 in any big way right now. 

Oren Eini (Ayende Rahien)

unread,
Feb 9, 2016, 5:40:54 PM2/9/16
to ravendb

Kijana Woodard

unread,
Feb 9, 2016, 5:56:28 PM2/9/16
to rav...@googlegroups.com
On Tue, Feb 9, 2016 at 4:30 PM, Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
inline


It'd be nice to have an example of "doing sql replication with data subscriptions" to stave off feature requests for edge cases or if someone wanted to write to another kind of store. 

What do you need more than getting the json and outputting the SQL?
 
That simplicity why I suggested killing the feature in the first place. But as I said, I concede it's useful for marketing, convincing other stakeholders, etc. I think it'd be a nice article that would get people comfortable with raven in general and Data Subscriptions in particular.

 
Out of the items you listed, only subscriptions and streams will actually give you the full data set.
 
IIRC, LoadStartingWith still gives everything. But you could also write a loop and page. That's probably not a great idea given stream and data subscriptions.
 

That isn't correct. It is paged like everything else.
 
Ok. Haven't tried it in a while with something that would break a page barrier, but iirc, you can just say pageSize: 5000 [or whatever]. I don't think it's limited to 1024.
 

- You can't use your own id which means you have to save the subscription id to a document which raises it's own failure scenario issues.


This is by design, otherwise you run into a lot of edgecases with two clients reading from the same subscription, or stealing it off one another.
 
That pushes the problem to the user. I need to experiment more here, but it seems like one would have to try to save a document with a given name, then create the subscription, then save the subscription id in the document....and address failures that can happen along the way. For instance, I failed to save the subscription id, how do I clean up the stranded subscription? We can query the list of subscriptions, but how can an arbitrary piece of code know which one is "no good"?
 

You can query the subscriptions, yes. An a subscription that isn't opened doesn't actually use any resources whatsoever.
So "leaking" a subscription has no cost.

Note that the whole idea is that you'll use subscriptions for very long running tasks, such as getting documents from a database as they change over months and years.

 
And you can't restart, you've to run it again,

Right. With patches, developers need to make them idempotent and make them resilient to changes by other code doing concurrent patches. I don't know if people think about it at that level. I think there's a [incorrect] perception that these are going to run serially.
 

They _are_ going to run seirally. In the sense that each patch is applied independently.
It is just that if you have multiple UpdateByIndex operations running, they will run concurrent and can interleave (on different documents).
 
Right. Serial per document, but a developer needs to reason about the impact of concurrent patches across the set. I'm not sure this is well understood in the community.


"Stream me all the credit card made within 7 miles radius of the bank robbery" is not somtehing that you can do in a subscription.
Stream has a StartsWith parameter. Why not drop LoadStartingWith? We just reported an issue with it that was fixed 30037. Not saying it's not useful [I happen to really like it], but does it pull it's weight when there's another way to do the same thing and more [api surface consolidation]?
 

LoadStartingWith is not a streaming / unlimited API.
It load the documents to the session identity map, and they get change tracking.
 
Changes is fine as Data Subscription support, but does it still have sufficient value as a front line api element? Deep paging [Skip(12300).Take(100)] doesn't work very well [slow, iirc] and is kind of a silly UI concept. Take(int.MaxValue) has the same problem [memory pressure] as Stream().ToList().


I don't understand the last statement.
Changes() are a way to get notifications from the database about what is happening right now.
 
Right. I'm thinking of Changes as "non-persistent Data Subscriptions". Could "do not persist" be a subscription option?  

What is the scenario that you are trying to enable?

None in particular. I'm trying to reduce the API surface area to make it more understandable. It seems there are several api methods that differ only in subtle ways. That leads to confusion about what is the right choice for a given scenario. Another approach could be making those scenarios explicit options of one api. 

Even the above about LoadStartingWith putting docs in the session identity map, it's straight forward to put Stream docs into session if needed. 

Another difference is LoadStartingWith is ACID whereas the the Stream startsWith parameter is not. Subtle. Confusing.

 
I think that it is too complex to be really useful in many scenarios.
We might just replace it with an option to say "persist this map/reduce index results as documents".
 
I think that would be nice. It seems that with "no extra work", one could write a map/reduce on those persisted result documents and now you have unlimited reduce [endless loops aside]. Rollup by city/state/country/continent is straight forward to implement and you don't have to worry about "too much data" clogging up dynamic aggregation.


One additional one: I think Sharding is poorly understood and/or under-utilized. I think it should be more popular than it is.
 

Sharding probably requires us to do it completely on the server side with dynamic scale up & down.
That isn't a simple problem, and we aren't going to address it in 4.0 in any big way right now. 
 
Makes sense. I think there are "immediately achievable use cases" that aren't as popular as they should be. One example from recent forum activity would be Orders by Month. Each month gets a separate db and data growth is contained. 

Not sure if the lack of popularity is "an api problem" or there just needs to be more blogs / samples to give people more inspiration about what's possible.

Bruno Lopes

unread,
Feb 9, 2016, 6:06:08 PM2/9/16
to RavenDB - 2nd generation document database
Kijana, just wanted to chime in on Scripted Index Results and SQL Replication as happy customers.

We actually use both and are happy with how they work. SQL Replication gives us a very cheap way to just shuttle data off to a sql database for analytics. I hadn't looked at data subscriptions, but from the docs (https://ravendb.net/docs/article-page/3.0/csharp/client-api/data-subscriptions/what-are-data-subscriptions) it looks like I'd had to have a service running monitoring for changes, right?
In our case, we install one "instance" of the web app per tenant, so we'd probably also need a service per tenant. And it would be another app running in development.
YMMV, but we're happy with it, and I don't think it's a case of "demoware feature".

We also don't see much pain in scripted index results, and being able to do math in javascript helps a bit.

Perhaps we're just confortable enough with js for this to not be a pain? Or we're not pushing where it hurts you?

Tobi

unread,
Feb 9, 2016, 6:11:03 PM2/9/16
to rav...@googlegroups.com
On 09.02.2016 12:39, Oren Eini (Ayende Rahien) wrote:

> What are the things that you regret having in RavenDB? Pifalls, issues,
> confusing, etc?

- The session interface should not allow to do non-session related
operations via session.Advanced*

- Too much code between embedded client and embedded server, making the
pure embedded use case become unnecessary slow.

- I actually liked the attachments :-)

Tobias

Kijana Woodard

unread,
Feb 9, 2016, 6:27:03 PM2/9/16
to rav...@googlegroups.com
@Bruno

Most of my suggestions were not really meant as knocks on any particular feature, but more as "how can we tighten up the api and make it more discoverable / inuitive".

Agreed on SQL Replication. I don't think it's "demo ware" and concede it's a net positive to remain in the product.

It's been a while since I tried SIR. IIRC, the versioning/deployment story was awkward. When things don't work, it was hard to figure out why. Again, not a knock against it, but rather a question if one could achieve the same goals with other features. It's it providing "enough" value to keep it around given the weight of maintaining it and the uptake in it's usage? Clearly, you'd say yes. I can't imagine having more than a couple SIR scripts in an app vs it's easy to imagine dozens of data subscriptions.

Fwiw, you could run a data subscription in your web app. The api takes care of making sure only one "subscriber" is live at any given time.

Bruno Lopes

unread,
Feb 9, 2016, 6:43:12 PM2/9/16
to ravendb
Right, makes sense. I just wanted to add a positive note to try and keep the feature ;)

Hm, deployment of SIR for us is kinda like indexes. On startup, ensure the script is like this.  If we change the script in a major way, add a migration to reset the index to re-run it. I fear replacing it with "persist this map/reduce result as a document" might not be enough for us, because one of the use cases is to update some stats on some documents that need to be in order bys or filters (no, I don't think it would be possible with a simple map/reduce). So it's a partial update

I haven't had any use case for data subscriptions, but perhaps it's due to our requirements or the way we built the software, not sure.

Since I've got your attention, in case the web app goes away due to recycling or idling, would I have to manage the "last handled doc", or is it handled by the server via an id we provide?

--
You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/8qhAJ2hMjfI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.

Kijana Woodard

unread,
Feb 9, 2016, 7:01:15 PM2/9/16
to rav...@googlegroups.com
I haven't used data subscriptions enough, but from memory, the docs indicate that you can subscribe in a way that one of the "survivors" will pick up the subscription and it'll start after the last acknowledged batch.

Fwiw, I think data subscriptions would be a better replacement for SIR than "map/reduce to document". SIR / data subscriptions allow you to update "other documents".

Did you write the "update/migrate SIR script" yourself or is it part of the API?

Bruno Lopes

unread,
Feb 9, 2016, 7:21:08 PM2/9/16
to ravendb
Hm. You're right. Reading it a bit more it seems like what you say.

How would you do aggregations and what not with data subscriptions? Stuff like "map/reduce votes on entity id, count up, down, update "up" and "down" on entity"? Current SIRs, since they're based on map reducing indexes, could I already have all the math correctly done. Data subscriptions are document-centric, not index-centric.

A SIR is just a document, so all I do is store them when the app starts up.
For migrations we use a modified version of RavenMigrations (I've been meaning to split up our changes, but haven't gotten around to it). In this case, we have a base class for "reset an index". This was also needed in some other cases and I don't think we needed it all that often

We also have some code to load the javascript from an embedded resource (it's rather trivial, but makes the pattern very readable). Makes it easy to just have a .js file in the project and reference it, instead of managing a long c# string with javascript in it.




Oren Eini (Ayende Rahien)

unread,
Feb 10, 2016, 1:35:11 AM2/10/16
to ravendb
 
What do you need more than getting the json and outputting the SQL?
 
That simplicity why I suggested killing the feature in the first place. But as I said, I concede it's useful for marketing, convincing other stakeholders, etc. I think it'd be a nice article that would get people comfortable with raven in general and Data Subscriptions in particular.


Consider what this means from an ops perspective. 
With SQL Replication, you just deploy ravendb, and your ops team can manage replication, change it,modify it, track it, monitor it, the works.
With Subscriptions, you have to do all of that yourself, deploy additional endpoint, and any changes have to go through a dev cycle.
 
 
Out of the items you listed, only subscriptions and streams will actually give you the full data set.
 
IIRC, LoadStartingWith still gives everything. But you could also write a loop and page. That's probably not a great idea given stream and data subscriptions.
 

That isn't correct. It is paged like everything else.
 
Ok. Haven't tried it in a while with something that would break a page barrier, but iirc, you can just say pageSize: 5000 [or whatever]. I don't think it's limited to 1024.
 

Yes, it is limited to 1024. 
 

 
What is the scenario that you are trying to enable?

None in particular. I'm trying to reduce the API surface area to make it more understandable. 

That is why we are having this discussion, yes. I feel there is some craft in the API and I want to take the time in a point release to clear it.

That is
It seems there are several api methods that differ only in subtle ways. That leads to confusion about what is the right choice for a given scenario. Another approach could be making those scenarios explicit options of one api. 

For changes & subscriptions, I'm not really sure that those are subtle differences.

Even the above about LoadStartingWith putting docs in the session identity map, it's straight forward to put Stream docs into session if needed. 

Not really, no. You _can_ do that, but you wouldn't do that in any of the common scenarios involving streaming and large objects.
The streaming API is awkward _intentionally_ to consume in memory, remember. Very different usages.


Another difference is LoadStartingWith is ACID whereas the the Stream startsWith parameter is not. Subtle. Confusing.

Stream is ACID. No change.



Sharding probably requires us to do it completely on the server side with dynamic scale up & down.
That isn't a simple problem, and we aren't going to address it in 4.0 in any big way right now. 
 
Makes sense. I think there are "immediately achievable use cases" that aren't as popular as they should be. One example from recent forum activity would be Orders by Month. Each month gets a separate db and data growth is contained. 

But that is really trivial to do in RavenDB.
.ShardOn<Order>(x=>x.OrderDate.Year +"-" + x.OrderDate.Month);

Oren Eini (Ayende Rahien)

unread,
Feb 10, 2016, 1:37:02 AM2/10/16
to ravendb
We are dropping the distinction between embedded and standard code.
They'll both use the same exact mechanisms.

What do you mean, non session related?

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Tobi

unread,
Feb 10, 2016, 4:45:44 AM2/10/16
to rav...@googlegroups.com
On 10.02.2016 07:36, Oren Eini (Ayende Rahien) wrote:

> We are dropping the distinction between embedded and standard code.
> They'll both use the same exact mechanisms.

What exactly does that mean? Will embedded then even need to connect to
localhost?


> What do you mean, non session related?

Things like session.Advanced.Stream<>, session.Advanced.DocumentStore.*.
Both allow you to get entities not tracked by the session, so they are not
exactly related to the session.

Tobias

>
> */Hibernating Rhinos Ltd /*____
>
> Oren Eini* l CEO l *Mobile: + 972-52-548-6969
>
> Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811
>
> __
>
> __
>
>
> On Wed, Feb 10, 2016 at 1:10 AM, Tobi <lista...@e-tobi.net
> <mailto:lista...@e-tobi.net>> wrote:
>
> On 09.02.2016 12:39, Oren Eini (Ayende Rahien) wrote:
>
> > What are the things that you regret having in RavenDB? Pifalls, issues,
> > confusing, etc?
>
> - The session interface should not allow to do non-session related
> operations via session.Advanced*
>
> - Too much code between embedded client and embedded server, making the
> pure embedded use case become unnecessary slow.
>
> - I actually liked the attachments :-)
>
> Tobias
>
> --
> You received this message because you are subscribed to the Google
> Groups "RavenDB - 2nd generation document database" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to ravendb+u...@googlegroups.com
> <mailto:ravendb%2Bunsu...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "RavenDB - 2nd generation document database" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ravendb+u...@googlegroups.com
> <mailto:ravendb+u...@googlegroups.com>.

Oren Eini (Ayende Rahien)

unread,
Feb 10, 2016, 6:51:00 AM2/10/16
to ravendb
inline

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Wed, Feb 10, 2016 at 11:45 AM, Tobi <lista...@e-tobi.net> wrote:
On 10.02.2016 07:36, Oren Eini (Ayende Rahien) wrote:

> We are dropping the distinction between embedded and standard code.
> They'll both use the same exact mechanisms.

What exactly does that mean? Will embedded then even need to connect to
localhost?


Yes, it will connect to localhost to the same process
 

> What do you mean, non session related?

Things like session.Advanced.Stream<>, session.Advanced.DocumentStore.*.
Both allow you to get entities not tracked by the session, so they are not
exactly related to the session.


To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

Kijana Woodard

unread,
Feb 10, 2016, 7:00:47 AM2/10/16
to rav...@googlegroups.com
Please retain some mechanism to get from a session to the non-session methods. I've been caught in corners where those "references" were handy. While session.Advanced.Stream<> doesn't track entities in the session, it does set the database for the Stream.

Bruno Lopes

unread,
Feb 10, 2016, 7:02:12 AM2/10/16
to ravendb
+1

You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/8qhAJ2hMjfI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.

Kijana Woodard

unread,
Feb 10, 2016, 8:23:08 AM2/10/16
to rav...@googlegroups.com
On Wed, Feb 10, 2016 at 12:34 AM, Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
 
What do you need more than getting the json and outputting the SQL?
 
That simplicity why I suggested killing the feature in the first place. But as I said, I concede it's useful for marketing, convincing other stakeholders, etc. I think it'd be a nice article that would get people comfortable with raven in general and Data Subscriptions in particular.


Consider what this means from an ops perspective. 
With SQL Replication, you just deploy ravendb, and your ops team can manage replication, change it,modify it, track it, monitor it, the works.
With Subscriptions, you have to do all of that yourself, deploy additional endpoint, and any changes have to go through a dev cycle.
 
I accept all of that. I'll yield this point here.
 
 
Out of the items you listed, only subscriptions and streams will actually give you the full data set.
 
IIRC, LoadStartingWith still gives everything. But you could also write a loop and page. That's probably not a great idea given stream and data subscriptions.
 

That isn't correct. It is paged like everything else.
 
Ok. Haven't tried it in a while with something that would break a page barrier, but iirc, you can just say pageSize: 5000 [or whatever]. I don't think it's limited to 1024.
 

Yes, it is limited to 1024. 
 

 
What is the scenario that you are trying to enable?

None in particular. I'm trying to reduce the API surface area to make it more understandable. 

That is why we are having this discussion, yes. I feel there is some craft in the API and I want to take the time in a point release to clear it.
 
 Yup. So I'm probing for consolidation opportunities.

That is
It seems there are several api methods that differ only in subtle ways. That leads to confusion about what is the right choice for a given scenario. Another approach could be making those scenarios explicit options of one api. 

For changes & subscriptions, I'm not really sure that those are subtle differences.

Response below. 

Even the above about LoadStartingWith putting docs in the session identity map, it's straight forward to put Stream docs into session if needed. 

Not really, no. You _can_ do that, but you wouldn't do that in any of the common scenarios involving streaming and large objects.
The streaming API is awkward _intentionally_ to consume in memory, remember. Very different usages.

I see. I wrote an extension method which allows me to process them "normally", so I've hidden the awkward parts by converting the enumerator into an IEnumerable. Until recently, consuming them "too slowly" resulted in failure, so I either foreach or ToList the enumerable.


Another difference is LoadStartingWith is ACID whereas the the Stream startsWith parameter is not. Subtle. Confusing.

Stream is ACID. No change.
 
Even more reason to remove LoadStartingWith. Stream(startWith: 'foo/').ToList();

I'm not sure why you consider this so different. 
Assume I have an enumerable from Stream and I want to limit for safety: Stream(startWith: 'foo/').Take(1024).ToList().

Given Stream startsWith is ACID, I honestly can't think of a reason why I would continue using LoadStartingWith. In every case I've used it, I know there is a well-defined and bounded set of matches, otherwise I'd query or stream anyway.

If I'm sure there will always be less than 128 results, Stream(startWith: 'foo/').ToList() will fit inside the internal paging of Stream.

If I'm not sure how many results there will be *and* I want to page, using Skip/Take on the Query or the Stream enumerable is a "nicer api" than passing in options to LoadStartingWith.

If I ignore the details and look at the various mechanisms at a high level.

1. Load gets a document(s) by id and is ACID.
2. LoadStartingWith gets documents by id prefix and is ACID.
3. Query, properly, has defaults and limits and is BASE. 
4. Stream enhances query to "yup I'm going to go through all of them, I know what I'm doing" and is BASE.
4b. Stream might also be ACID if you use startsWith.
5. Changes is "what's happening now".
6. Data Subscriptions is a persisted Changes/Stream for "what happened since I last checked and, once I catch up, what's happening now".

I'd propose the "public api" collapse to 3 or 4 with Load and Query being obvious "winners".




Sharding probably requires us to do it completely on the server side with dynamic scale up & down.
That isn't a simple problem, and we aren't going to address it in 4.0 in any big way right now. 
 
Makes sense. I think there are "immediately achievable use cases" that aren't as popular as they should be. One example from recent forum activity would be Orders by Month. Each month gets a separate db and data growth is contained. 

But that is really trivial to do in RavenDB.
.ShardOn<Order>(x=>x.OrderDate.Year +"-" + x.OrderDate.Month);

Yup. And yet my perception, which could easily be wrong, is very few people use Sharding. I'm not sure why.
In other words, I think the API that exists solves problems that people have, but they aren't attracted to this solution for some reason.

Oren Eini (Ayende Rahien)

unread,
Feb 10, 2016, 8:33:30 AM2/10/16
to ravendb

Stream is ACID. No change.
 
Even more reason to remove LoadStartingWith. Stream(startWith: 'foo/').ToList();


* OutOfMemoryException
* Request takes 30 seconds to complete.

 
I'm not sure why you consider this so different. 

Because they have very different semantics by definition.
The streaming is supposed to be just that, you are doing something with each item as they come by, you don't hold on to the potentially very large set.
 
Assume I have an enumerable from Stream and I want to limit for safety: Stream(startWith: 'foo/').Take(1024).ToList().

Safe by default, not unsafe by default.
 

Given Stream startsWith is ACID, I honestly can't think of a reason why I would continue using LoadStartingWith. In every case I've used it, I know there is a well-defined and bounded set of matches, otherwise I'd query or stream anyway.

You aren't the only use of RavenDB, however. And we do see people who need the API to guide them toward the appropriate solution.


1. Load gets a document(s) by id and is ACID.
2. LoadStartingWith gets documents by id prefix and is ACID.

Has limits

3. Query, properly, has defaults and limits and is BASE. 
4. Stream enhances query to "yup I'm going to go through all of them, I know what I'm doing" and is BASE.

Streams are identical to load if you are asking by prefix or by etag.
If you stream a query, same as query.


4b. Stream might also be ACID if you use startsWith.
5. Changes is "what's happening now".
6. Data Subscriptions is a persisted Changes/Stream for "what happened since I last checked and, once I catch up, what's happening now".

Correct

Kijana Woodard

unread,
Feb 10, 2016, 9:34:16 AM2/10/16
to rav...@googlegroups.com
On Wed, Feb 10, 2016 at 7:33 AM, Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:

Stream is ACID. No change.
 
Even more reason to remove LoadStartingWith. Stream(startWith: 'foo/').ToList();


* OutOfMemoryException
* Request takes 30 seconds to complete.

 
I'm not sure why you consider this so different. 

Because they have very different semantics by definition.
The streaming is supposed to be just that, you are doing something with each item as they come by, you don't hold on to the potentially very large set.
 
Assume I have an enumerable from Stream and I want to limit for safety: Stream(startWith: 'foo/').Take(1024).ToList().

Safe by default, not unsafe by default.

Query is safe by default. Stream is a semantic shift to loosen those restrictions. Stream is *still* safe by default in that you're getting a page at a time in memory unless you write code to get around it [e.g. list.Add(enumerator.Current.Document).

Here's the problem with "Safe by default", it's only safe from one point of view: the health of the db and the server. Assume a developer thinks LoadStartingWith("foo/") should always and forever return <128 results. Data growth unexpectedly leads to 129+ results.  From a user's perspective, the app is broken. From the programmers perspective, it's awkward [at best] to detect this situation and deal with it using LoadStartingWith.

Various people have fought against the safe by default paradigm for Query and I disagree with that. I think Stream is enough of a "speed bump" to alert you that you're in more dangerous territory. The fact that you pass a Query to a Stream is brilliant in this regard. 

For Query, it's clear that you should setup paging or filtering to explore the data. If you're in a situation where you "want everything", Stream gives you that. If LoadStartingWith doesn't give you everything and has a sub-optimal paging mechanism, it's redundant.

Please note that LoadStartingWith has been one of my favorite APIs from my earliest usage of ravendb.
 

Given Stream startsWith is ACID, I honestly can't think of a reason why I would continue using LoadStartingWith. In every case I've used it, I know there is a well-defined and bounded set of matches, otherwise I'd query or stream anyway.

You aren't the only use of RavenDB, however. And we do see people who need the API to guide them toward the appropriate solution.
 
Agreed...which is why I think having disparate "top level" apis that do nearly the same thing is confusing. If it was an explicit StreamOption [or whatever], then it'd be clear the choice you're making without having to read the docs or make the semantic connection between Load and LoadStartingWith.


1. Load gets a document(s) by id and is ACID.
2. LoadStartingWith gets documents by id prefix and is ACID.

Has limits

3. Query, properly, has defaults and limits and is BASE. 
4. Stream enhances query to "yup I'm going to go through all of them, I know what I'm doing" and is BASE.

Streams are identical to load if you are asking by prefix or by etag.
If you stream a query, same as query.

Fwiw, it's awkward that you can't pass etag *and* prefix to Stream. I realize that adding that overload converges with Data Subscriptions...hence these suggestions. Further, looking at SusbscriptionOptions and options for Changes, I can see someone wanting to Stream over those as well - non-persistent Subscription.


4b. Stream might also be ACID if you use startsWith.
5. Changes is "what's happening now".
6. Data Subscriptions is a persisted Changes/Stream for "what happened since I last checked and, once I catch up, what's happening now".

Correct

I'd propose the "public api" collapse to 3 or 4 with Load and Query being obvious "winners".



Oren Eini (Ayende Rahien)

unread,
Feb 10, 2016, 9:46:16 AM2/10/16
to ravendb

Query is safe by default. Stream is a semantic shift to loosen those restrictions. Stream is *still* safe by default in that you're getting a page at a time in memory unless you write code to get around it [e.g. list.Add(enumerator.Current.Document).

Stream gets you all the data. And they certainly do that at a page at a time.
 

Here's the problem with "Safe by default", it's only safe from one point of view: the health of the db and the server. Assume a developer thinks LoadStartingWith("foo/") should always and forever return <128 results. Data growth unexpectedly leads to 129+ results.  From a user's perspective, the app is broken. From the programmers perspective, it's awkward [at best] to detect this situation and deal with it using LoadStartingWith.

Better to have an explicit thing like that then bringing the entire system down.,
 
Various people have fought against the safe by default paradigm for Query and I disagree with that. I think Stream is enough of a "speed bump" to alert you that you're in more dangerous territory. The fact that you pass a Query to a Stream is brilliant in this regard. 

Yes, that is entirely the point. You can do that to bypass that limitation explicitly. Noting that when you do that you have to take the onus of protecting yourself from those details.
That is why the API is the way it is.
 
For Query, it's clear that you should setup paging or filtering to explore the data. If you're in a situation where you "want everything", Stream gives you that. If LoadStartingWith doesn't give you everything and has a sub-optimal paging mechanism, it's redundant.

Except that it isn't. We use it quite often for the purpose it is meant for, and it work great for that.
 

Fwiw, it's awkward that you can't pass etag *and* prefix to Stream. I realize that adding that overload converges with Data Subscriptions...hence these suggestions. Further, looking at SusbscriptionOptions and options for Changes, I can see someone wanting to Stream over those as well - non-persistent Subscription.

The reason you can't is that this would force us to scan the entire dataset from that etag and filter everything.
This can lead to a very long pause in some cases.
 

João Bragança

unread,
Feb 10, 2016, 11:05:35 AM2/10/16
to RavenDB - 2nd generation document database
Oh one other thing: Get rid of Raven.Imports.Newtonsoft.Json (or internalize it). I f I understand the setup correctly, once the session serializes your data, the server just works with the json and doesn't know about your types (how could it since the server won't have your assembly?). Instead, expose a Func<object, string> and a Func<string, Type, object> on configuration somewhere so that we may use whatever serialization mechanism we want. See https://github.com/damianh/Cedar.CommandHandling/blob/master/src/Cedar.CommandHandling.Http/Http/CommandHandlingSettings.cs#L30 for an example of this.

Kijana Woodard

unread,
Feb 10, 2016, 11:15:33 AM2/10/16
to rav...@googlegroups.com
On Wed, Feb 10, 2016 at 8:45 AM, Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:

Query is safe by default. Stream is a semantic shift to loosen those restrictions. Stream is *still* safe by default in that you're getting a page at a time in memory unless you write code to get around it [e.g. list.Add(enumerator.Current.Document).

Stream gets you all the data. And they certainly do that at a page at a time.
 

Here's the problem with "Safe by default", it's only safe from one point of view: the health of the db and the server. Assume a developer thinks LoadStartingWith("foo/") should always and forever return <128 results. Data growth unexpectedly leads to 129+ results.  From a user's perspective, the app is broken. From the programmers perspective, it's awkward [at best] to detect this situation and deal with it using LoadStartingWith.

Better to have an explicit thing like that then bringing the entire system down.,

 ??? -  "Stream gets you all the data. And they certainly do that at a page at a time." 
 
Various people have fought against the safe by default paradigm for Query and I disagree with that. I think Stream is enough of a "speed bump" to alert you that you're in more dangerous territory. The fact that you pass a Query to a Stream is brilliant in this regard. 

Yes, that is entirely the point. You can do that to bypass that limitation explicitly. Noting that when you do that you have to take the onus of protecting yourself from those details.
That is why the API is the way it is.

We agree.
 
For Query, it's clear that you should setup paging or filtering to explore the data. If you're in a situation where you "want everything", Stream gives you that. If LoadStartingWith doesn't give you everything and has a sub-optimal paging mechanism, it's redundant.

Except that it isn't. We use it quite often for the purpose it is meant for, and it work great for that.

I'd assert that a non-deterministic result set with a snowflake paging method is precisely the sort of api cruft you're looking for. I'm not saying it's not useful functionality, quite the contrary. I'm saying it could/should be afforded under another api method.

Let's say we want to deal with documents by prefix. How is that handled across the api?

session.Advanced.LoadStartingWith<Employee>("employees/");
store.Subscriptions.Create(new SubscriptionCriteria { KeyStartsWith = "employees/"});
store.Changes.ForDocumentsStartingWith("employees/")
session.Advanced.Stream(startsWith: "employees/")
store.DatabaseCommands.StreamDocs(etag: null, startsWith: "employees/");

Seems random. Rather, it seems that it grew organically. For a new user though.....

So obviously, I think there's some opportunity for api consolidation here. If I'm alone in that, it's ok because I can use the bits and pieces that make sense to me.

Kudos on opening this thread in the first place.

sources:

StartsWith overload is missing from the docs

At first, the StreamDocs api makes it seem like you can use etag and prefix, but a comment in docs confirms you can't.
 

Fwiw, it's awkward that you can't pass etag *and* prefix to Stream. I realize that adding that overload converges with Data Subscriptions...hence these suggestions. Further, looking at SusbscriptionOptions and options for Changes, I can see someone wanting to Stream over those as well - non-persistent Subscription.

The reason you can't is that this would force us to scan the entire dataset from that etag and filter everything.
This can lead to a very long pause in some cases.

Would "collection name" and etag be better? IIRC, there are now mechanisms for indexing to be "collection aware".

Gareth Thackeray

unread,
Feb 10, 2016, 11:30:37 AM2/10/16
to RavenDB - 2nd generation document database
Here's my comments on what's been said in this thread:

1. SQL replication: think you roundly convinced Kijana of its usefulness!  I am not a fan of that the fact that errors can just be swallowed up however.  I'd like an alternate mode where it won't continue till the error is fixed.

2. SIR: this is at the core of our app to turn holiday bookings and "holds" into updates to availability in the search index.  We have two SIR's and while one of them in fact does just materialise map/reduce results into documents, the other has a significant amount of logic in the JS.

I was actually ruing this decision originally as we had many concurrency errors, but since v3 these have dried up (I think simply because the SIR bundle now has a retry on concurrency error).

All this said I've long assumed that as we scale we may need to replace this internal mechanism with external queues and updates but from the point of view of a) getting something up and running quickly without thinking / managing external queues and b) having a version that just works on each developer's machine without needing a queueing infrastructure has been very useful.

Short story: our app breaks quite badly if you remove SIR!

3. Streaming etc: I've experienced the problem Kijana mentioned with there being no clear way to to process each document in a collection exactly once.  As with him I've often taken to loading them all (or if they are large docs, their ids) into memory initially.  A clear and obvious way to achieve this would be good.

4. Just a few improvements to Studio:
- I miss from the SL version that any "starting with" matches do not come up as "related documents".  (E.g. products/1234/extradetails if you're looking at products/1234; the converse would also be nice)
- when you go to a doc it should say the name of the doc in the "go to document" box so you can easily change it
- choosing which columns you want in a result set is clunky

OTOH, I disagree with Chris about the tooling being substandard.  The studio is mostly very nice and improves over time.  I did a full db export and import yesterday and seemed much sturdier than I remembered.  I normally expect to have to use smuggler to do that on a db of any size.


Cheers,

G

Oren Eini (Ayende Rahien)

unread,
Feb 10, 2016, 2:51:52 PM2/10/16
to ravendb
That is certainly something that we would really like to do.
This might cause issues for users if we want to do certain things (because we can do stuff to the JSON).
We also need some way to avoid just passing strings all of the place, but I would really like to just use the standard json instead of my own copy, yes.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Oren Eini (Ayende Rahien)

unread,
Feb 10, 2016, 2:55:56 PM2/10/16
to ravendb
  • session.Advanced.LoadStartingWith<Employee>("employees/");
Load the first page of the relevant documents and track them in the session. typically done for specific reasons, and not over expected large data sets.
Common usage is something like this:
  • store.Subscriptions.Create(new SubscriptionCriteria { KeyStartsWith = "employees/"});
Create long running subscription that will give you the documents with this prefix.
Typically used for background processing, jobs, etc.
  • store.Changes.ForDocumentsStartingWith("employees/")
Get notified (transient) when those documents change. Does not give you the document themselves. Typically used for notifications, clearing caches, etc.
  • session.Advanced.Stream(startsWith: "employees/")
Tet all the documents (unlimited size) with the given prefix. Turn them into CLR types, but don't track them
Typically this is used to generate reports, etc.
  • store.DatabaseCommands.StreamDocs(etag: null, startsWith: "employees/");
This is just the lower level API of the previous call.

Tobi

unread,
Feb 10, 2016, 3:02:39 PM2/10/16
to rav...@googlegroups.com
inline

On 10.02.2016 12:50, Oren Eini (Ayende Rahien) wrote:

> On Wed, Feb 10, 2016 at 11:45 AM, Tobi <lista...@e-tobi.net
> <mailto:lista...@e-tobi.net>> wrote:
>
> On 10.02.2016 07:36, Oren Eini (Ayende Rahien) wrote:
>
> > We are dropping the distinction between embedded and standard code.
> > They'll both use the same exact mechanisms.
>
> What exactly does that mean? Will embedded then even need to connect to
> localhost?
>
>
> Yes, it will connect to localhost to the same process


I was afraid you would say that. Wouldn't this make things even slower as
they already are?

Tobias

Oren Eini (Ayende Rahien)

unread,
Feb 10, 2016, 3:04:03 PM2/10/16
to ravendb
No, they wouldn't. loopback calls are actually going to be faster, in the same sense that your tests show that running against the server is faster than embedded, because we do less work (stupid, I know, but that is what it ends up happening).

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 



Tobias

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.

Oren Eini (Ayende Rahien)

unread,
Feb 10, 2016, 3:16:04 PM2/10/16
to ravendb

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Matthias De Ridder

unread,
Feb 15, 2016, 5:50:09 AM2/15/16
to RavenDB - 2nd generation document database
Do not increase the Etag of documents during indexing. This increases the chance on concurrency exceptions a lot, while the document hasn't actually changed in the meaning of a concurrency exception.

Oren Eini (Ayende Rahien)

unread,
Feb 15, 2016, 5:55:48 AM2/15/16
to ravendb
We don't do that.
We only update the document etag if you use LoadDocument and the source document changed.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Mon, Feb 15, 2016 at 12:50 PM, Matthias De Ridder <orcr...@gmail.com> wrote:
Do not increase the Etag of documents during indexing. This increases the chance on concurrency exceptions a lot, while the document hasn't actually changed in the meaning of a concurrency exception.

--

Matthias De Ridder

unread,
Feb 15, 2016, 6:08:49 AM2/15/16
to RavenDB - 2nd generation document database
I'm sorry for explaining the cause in a wrong way, but the result is the same: concurrency exceptions might happen although the document has not changed.

Op maandag 15 februari 2016 11:55:48 UTC+1 schreef Oren Eini:

Chris Marisic

unread,
Feb 16, 2016, 10:36:24 AM2/16/16
to RavenDB - 2nd generation document database
You need to replay your command against the current version of the document, the document was modified concurrently or abandon the request as an error. There's no reason for you to care whether it was LoadDocument (which should very rarely ever used) or another user doing something.

njy

unread,
Feb 17, 2016, 10:33:30 AM2/17/16
to RavenDB - 2nd generation document database
> Worst mistakes in ravendb design

I know it has already been said, and multiple times, but please: throw an exception when a "safe by default" limit is hit (like the 128 docs/request).

I mean, if you don't want to change the default behaviour ok, fine, but at least provide a per-request or per-session or per-something flag: whatever you want except manually changing a global config file (showstopper on the cloud), but please give us something controllable to avoid ravendb silently limiting the results, because that is insane (sorry if I seem harsh, I'm just trying to communicate my raw feelings).

Just to be clear: i agree with the "safe by default" philosophy to avoid bad practices that will cause bad performance.

BUT.

What happens already, right now, when more than 30 req/session are executed? An exception is thrown, it's not that you get back a response (silently) limited to zero docs, right?
What happens already, right now, when a req takes more than 30 sec (or whatever timeout it is)? Again, an exception is thrown, it's not that you get back a response (silently) limited to the docs already processed in the allowed timeframe, right?
But what happens in RavenDB if you ask for 129 docs? You silently get back only the first 128 docs.

Sorry but there is no way this makes any sense, at all, on top of being not consistent with the rest of the normal behaviours inside RavenDB.

Case in point I saw a lot of people doing all sorts of tricks to get around this bad design, like:
- wrapping query calls with methods that check if there is no "take amount" or if it is > 128, and in that case throw an exception;
- post-analyzing the query stats and then throwing an exception in case of limits hit;
- creating some custom resharper rules that blocks the friggin' compilation of the entire project if a query does not have a "take amount" specified or if it is > 128;

I think these cannot be considered as "developer friendly", while RavenDB is posed in general as that, with "smart defaults" and stuff like that, and that is absolutely true!

It is absolutely true, except for this silent truncation thing. This is just an insanely bad design, sorry.


On Tuesday, February 9, 2016 at 12:40:02 PM UTC+1, Oren Eini wrote:
Guys,
We are doing a lot of work on 4.0, and one of the things we are looking at is not just adding new stuff, but removing bad old stuff.

What are the things that you regret having in RavenDB? Pifalls, issues, confusing, etc?


Kijana Woodard

unread,
Feb 17, 2016, 11:00:55 AM2/17/16
to rav...@googlegroups.com
First off, I understand what you mean [see my posts in this thread].

"But what happens in RavenDB if you ask for 129 docs? You silently get back only the first 128 docs."

If you *actually* ask for 129 [e.g. .Take(129)], you get back that many, if there are at least that many.
Consider Query as having an implicit Take(128).

Therefore, if you want "all the documents that match this query", you should use Stream.
Query should immediately tell the developer - "I'd better have paging set up for the client, or they should be ok with Top N, or I should switch to Stream". 

Since Stream enhances query, it's fairly easy to upgrade.

--

Kijana Woodard

unread,
Feb 17, 2016, 11:03:27 AM2/17/16
to rav...@googlegroups.com
Follow up on issues with the identity api - they don't replicate.
Fail over means reusing an identity and possible overwriting documents or generating replication conflicts.

Also, you can't [easily] delete them.

Oren Eini (Ayende Rahien)

unread,
Feb 17, 2016, 11:03:40 AM2/17/16
to ravendb
It would relatively easy to add, since we know the total number of the results.
We can do that when the take is implicit, and then throw, but that seems like a good way to break you up in prod.

We intentionally limit the size in this manner because we want to still work over time when the data size grows.

--

Oren Eini (Ayende Rahien)

unread,
Feb 17, 2016, 11:05:48 AM2/17/16
to ravendb
Identity isn't meant to replicate.
That is by design.

Otherwise you get into generating unique incremental number in a distributed fashion, which is complexi
And you can't get overwriting documents using identity

Kijana Woodard

unread,
Feb 17, 2016, 11:12:57 AM2/17/16
to rav...@googlegroups.com
"We can do that when the take is implicit, and then throw, but that seems like a good way to break you up in prod."

I could live with this compromise. Naked Queries without Take is not a good coding practice anyway. 

Oren Eini (Ayende Rahien)

unread,
Feb 17, 2016, 11:15:06 AM2/17/16
to ravendb
But they are incredibly common, and that would be killing your system as data grows.
It would be much more severe than "I can't see the oldest posts".

Kijana Woodard

unread,
Feb 17, 2016, 11:20:53 AM2/17/16
to rav...@googlegroups.com
That's why it's a compromise. ;-)

For the case @njy is talking about, to the users it's not "old posts" depending on sort order. It could be, there are no "new orders". The system is *broken* in their view. It would almost be better to crash, fix the code appropriately [Take, paging, Stream, whatever] then to have users making decisions from what they see as incorrect data. The system gets more broken as they limp along desperately resubmitting data.

The real question comes down to "what's the default behavior" and how can you opt into the opposite behavior.

Oren Eini (Ayende Rahien)

unread,
Feb 17, 2016, 11:21:57 AM2/17/16
to ravendb
In that case, I would rather have a better implicit take convention.
so you can explicit say something like: Without Take, limit to 10 items, and throw if there are more.

Kijana Woodard

unread,
Feb 17, 2016, 11:22:24 AM2/17/16
to rav...@googlegroups.com
"And you can't get overwriting documents using identity"

Server 1 - write "foos/1"
Fail to Server 2 - write "foos/"

Replication conflict resolver set to "Take latest. 
Doc is overwritten.

It's an extra config step to get there, but it was non-obvious you were getting into that decision matrix when you decided to use identities.

What's the "safe" way to use the identity api without having to consider knock on effects from decisions made about seemingly unrelated features?

Oren Eini (Ayende Rahien)

unread,
Feb 17, 2016, 11:24:30 AM2/17/16
to ravendb
With Hilo, we have the HiloPrefix to handle that scenario. 

With identity, it is a bit more complex, because you usually care about the sequence of numbers.

Kijana Woodard

unread,
Feb 17, 2016, 11:25:01 AM2/17/16
to rav...@googlegroups.com
I think many numbers are reasonable defaults [10, 25, 100, 128].

Should that be configurable too? I want to say no. use .Take() and set what you want. Use an extension method if you must define a central take.

Fwiw, for paging through lists we have extension methods for various reasons anyway, so our Take is centralized.

Kijana Woodard

unread,
Feb 17, 2016, 11:28:21 AM2/17/16
to rav...@googlegroups.com
"With identity, it is a bit more complex, because you usually care about the sequence of numbers."

So in what scenario would I want to reuse a number from the secondary?

For the case of "keep invoices sequenced without gaps", that sounds disastrous. You'll definitely get conflicts on fail over and the solution involves changing the id of one of the documents.

Oren Eini (Ayende Rahien)

unread,
Feb 17, 2016, 12:44:04 PM2/17/16
to ravendb
Yes, in those cases, you don't use identity, because you need to validate that you don't generate udplicate invoice numbers.

njy

unread,
Feb 17, 2016, 1:15:09 PM2/17/16
to RavenDB - 2nd generation document database
> Since Stream enhances query, it's fairly easy to upgrade.

I agree that it's not hard to upgrade, but imho the problem is still there: the default behaviour in this case is not aligned with the other standard ravendb behaviours (throw in case a limit is hit), and that creates confusion (see all the existing - and used - workarounds).
Just to be extra clear: I'm not proposing "if I ask raven 1000000 docs it should give me them", because I see the implications about performance, reliability and so on. What I'm pointing to is that it silently strip away some results, without telling you.

All in all, the problem it's not that there is not another way, but that there is a way (and, btw, the *common* way) that silently does not give you back what you ask for.

Kijana Woodard

unread,
Feb 17, 2016, 1:25:56 PM2/17/16
to rav...@googlegroups.com
How do you feel about the proposal to throw if there are more results than given by the implicit Take?
In that case, given 129 docs, Query<>() will throw, but Query<>().Take(128) [back to what the current default] won't throw.

How about the idea to limit the implicit Take to 10?
Config to opt out of this behavior?

njy

unread,
Feb 17, 2016, 1:28:07 PM2/17/16
to RavenDB - 2nd generation document database
Thanks for the quick response:

> We can do that when the take is implicit, and then throw, but that seems like a good way to break you up in prod.
I see your point, but enabling this behaviour using a per-query flag or a per-session (or even per-store) would not create surprises and suddenly break in prod, because the dev *explicitly* told raven to throw.
On top of that I think "breaking the prod" is way better than "not breaking the prod but silently skipping some data", because that is something way harder to detect, while the problem would still be there.
Think about it this way: the current way of handling the scenario is like try/catch with an empty catch block. Yes, I can think about some scenario where that could make sense, but 99% of the time you should handle the exception, and not pretend everyhing went fine.
Does my similitude make sense?

> We intentionally limit the size in this manner because we want to still work over time when the data size grows.
The fact is that it would not still work: it would not crash, ok, but it would not be doing what it should do, and would do that in an almost invisible, hard to detect way.

My 2 cents: the ideal situation I have in my mind is that raven should throw an exception (like it throws in the other cases where a limit is hit) but it should be possible to specify a flag to say something like "if there are too much resulting docs, limit them reasonably" (that is the current behaviour). In this way 99% of the times you will have an error and you'll need to do something about it, and in the 1% of the times when "less data than what asked" is ok, you can say that with a simple flag, per query. Doing this will instantly eliminate all the workarounds currently used, while leaving those few cases when it's ok to not have what asked possible, with a simple flag.

What do you think? Kijana, you?

Btw thanks for your time, it's very nice to see such an open discussion about such a delicate and internal topic, really appreciate that.

njy

unread,
Feb 17, 2016, 1:50:06 PM2/17/16
to RavenDB - 2nd generation document database
I'm making some tests on this subject, I'll come back to this as soon as I'm done (don't want to force you to loose more time on this because of some misunderstanding of mine).

Thanks for now.

Bruno Lopes

unread,
Feb 17, 2016, 2:10:07 PM2/17/16
to ravendb
Is this a solution to Njy's problem that can be used right away?
    public class ThrowIfDeveloperForgotAboutTake : IDocumentQueryListener
    {
        public void BeforeQueryExecuted(IDocumentQueryCustomization queryCustomization)
        {
            queryCustomization.BeforeQueryExecution(q =>
            {
                if (!q.PageSizeSet)
                    throw new InvalidOperationException(
                        "Forgot to set .Take(<number>);");
            });
        }
    }

Throwing if the return set is larger than the default page size sounds like a problem that will be caught in production but not during dev, and will crash the app.

With this listener the developer would be hit in the face with the problem even if on the local machine there's not enough data.
It doesn't involve wrapping queries, and seems like it would force the .Take to be explicit.


You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/8qhAJ2hMjfI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.

njy

unread,
Feb 17, 2016, 7:52:36 PM2/17/16
to RavenDB - 2nd generation document database
> If you *actually* ask for 129 you get back that many
I've made a mistake while exposing my arguments: I didn't mean to say 128, but 1024.
128 is the default take amount, whereas I meant to talk about the 1024 limit: sorry about that, I created a little bit of confusion.

Please see below in 2 minutes, I'll go on there for the sake of continuity.


On Wednesday, February 17, 2016 at 5:00:55 PM UTC+1, Kijana Woodard wrote:

njy

unread,
Feb 17, 2016, 8:17:02 PM2/17/16
to RavenDB - 2nd generation document database
Ok, tests done: as I just said to Kijana I've used the wrong limit while exposing my arguments.
When I said 128 what I really meant was 1024.

Having said that, my 2 points remain (now hopefully exposed more clearly and with the right values):
1) NO AUTO-TRUNCATION: if for example there are 8000 docs in a collection and i ask for 4000 of them, the system should not automaically truncate the results at 1024 docs, silently.
I understand that asking for, I don't know, 1000000 docs may crash/stall/hang the system, and I'm not suggesting that raven should return that many docs: what I'm saying is to "not truncate the result set without throwing an error".
So if I'm asking for something stupid that may pose a threat to the stability of the system, tell me that it's not possible to fulfill my request, upfront, in a clear way (exception). That way I will not have one result in dev and another in prod.
2) NO DEFAULT TAKE AMOUNT: if it is important to specify how many docs i want from in a request (and it is) make that mandatory. A "smart" default like 128 will only hide the problem, both in dev and in prod.

For both points think about the principle of least astonishment: https://en.wikipedia.org/wiki/Principle_of_least_astonishment

People are accustomed to the sql way of "if i don't say anything, give me all the rows". That is wrong, imho, because problems arise in prod, so raven wants to change that, which is good.
But the solution to that is not to have some default value, because that will just temporarily hide the problem, causing another kind of problems in prod, where not all data will be processed.
Neither is silently truncating the result set to 1024 docs if requesting too many of them.

What I'd like to see issomething like this: if i do a query with a take amount > 1024 an exception will be thrown, with a message like "It's not possible to request more than 1024 docs in a single request with this api, otherwise the system may become unstable or underperformant. Please use the streaming api" or something similar. And it should not matter how many docs there are: if I'm querying a collection with 3 docs, but I'm requesting 4000 of them, throw the exception right away, because that will be a problem in production nonetheless.
In this way dev/prod environments will be the same, and there would not be any surprise at all.

Does this make sense?

Again, sorry for having initially stated my points with the wrong values (128 instead of 1024), I know that created some confusion.

Kijana Woodard

unread,
Feb 17, 2016, 8:38:21 PM2/17/16
to rav...@googlegroups.com
Thanks for clarifying that point.

So starting with Oren's suggestion of "throw if take not specified" still holds...

You're asking for it to throw always and not try to calculate if there are more results. Seems easier to implement.

You're also saying if you specify more than the Take, throw always.

Should the max take configuration simply be removed? If you want more than 1024, Stream. Frankly, I have a hard time justifying more than 128 in a query, now that we have Stream.

I suppose there should be a config flag to opt into the old behavior for people that want that. Thoughts on that? I could be persuaded it's not necessary in the face of a major update, but friction is friction and you don't want people sticking to old versions.

From: njy
Sent: ‎2/‎17/‎2016 7:17 PM
To: RavenDB - 2nd generation document database
Subject: Re: [RavenDB] Re: Worst mistakes in ravendb design / api / etc

Oren Eini (Ayende Rahien)

unread,
Feb 17, 2016, 10:23:05 PM2/17/16
to ravendb
Did you see Bruno's listener solution?

Kijana Woodard

unread,
Feb 17, 2016, 10:35:47 PM2/17/16
to rav...@googlegroups.com
Yes. Does that work? [didn't verify it myself].

Are you considering putting in that listener by default and remove listener for current behavior?

Oren Eini (Ayende Rahien)

unread,
Feb 17, 2016, 10:37:55 PM2/17/16
to ravendb
It should.

njy

unread,
Feb 18, 2016, 4:37:06 AM2/18/16
to RavenDB - 2nd generation document database
You're asking for it to throw always and not try to calculate if there are more results. Seems easier to implement.
Yes, and more consistent I think: otherwise you'll have things like asking for 4000 docs in dev (where there are only 10 docs) and everything runs fine, then you go in prod and everything is wrong (either truncated result set like now or exceptions thrown like i'm proposing). Surprise surprise.
If it is decided that, for stability and performance, asking for 4000 docs with the normal .Query() api is wrong, that should throw right away.
What do you think, does it make sense?

You're also saying if you specify more than the Take, throw always.
Should the max take configuration simply be removed?
If I understood correctly, there is a max in the .Query() api to prevent people from causing perf & stability problems, which is a novel approach in respect of the traditional, sql way (if I ask for 1000000 rows, give them to me no matter what). And I second that!
What I'm saying is not to allow people to get 1000000 docs in a single query, but to throw an exception right away instead of silently limiting the result set to the first 1024 docs, which would cause a whole different class of problems, that would be also harder to spot.
And - probably more importantly - is different from how Raven already behaves when hitting the other safe limits (30 req/session, timeouts), that is to throw.

> If you want more than 1024, Stream.
Yes, absolutely! Just say that in the exception message.

> Frankly, I have a hard time justifying more than 128 in a query, now that we have Stream.
I can understand your point, and I don't know if the right value for the limit is 128, 1024 or whatever: the baseline is that I accept the concept of a limit on the .Query() api, because it prevent a lot of problems in production (it is not a way to do queries that scale well, in contrast with the streaming api).

I suppose there should be a config flag to opt into the old behavior for people that want that. Thoughts on that?
Totally agree: for me the most important thing is that if it is decided that the new behaviour (throw exc) is the right way, that should be the default. I mean, we are talking about a major version bump, V4. It's like when removing an obsolete feature: it can happen, and a major version bump is the right moment to do it.
But ideally there should be a flag to ease the upgrade for people coming from older versions, something like .Query(...).DontThrowIfTooManyResults() or .AutoLimitResults() or something like that (can't came up with a good name right now, sorry).
Maybe also a doc/session flag, to ease the upgrade of entire apps in an easier way, don't know right now.

> I could be persuaded it's not necessary in the face of a major update, but friction is friction and you don't want people sticking to old versions.
100% Agree.

Thanks!

Oren Eini (Ayende Rahien)

unread,
Feb 18, 2016, 4:52:30 AM2/18/16
to ravendb
Okay, how about the following?

If you ask for more than the max page size, you get an error, with a pointer to streams.
Implicit take size will be 25 (convention based), and if your query has more results, it will throw.
Explicit take will work as it does not.

The configuration would be controlled from the conventions.

njy

unread,
Feb 18, 2016, 4:57:59 AM2/18/16
to RavenDB - 2nd generation document database
Hi Oren, yes I saw that solution, and it's a nice one to avoid queries without a take specified, and could be extended to also stop queries requesting more than 1024 docs.

But more importantly is that I'd like to understand if you agree that (1) the current way raven handle requests with take > 1024 is wrong and not coherent with the other raven behaviours in similar situations, and if you agree that (2) not specifying a take amount is logically wrong, just like asking for "all the docs" (even though for different reasons).

I'm asking this because if you, as the creator/designer of raven, think that the current way is the right way then it is my problem to find a trick to workaround this strange (at least to me and some other people) raven behaviour.
Instead, if you come to think that in fact the new behaviour (throw exc) is the right one, then raven should change and it become (eventually) a problem of people upgrading from older versions (for which there's the flag me and Kijana talked about above).

Thanks

Oren Eini (Ayende Rahien)

unread,
Feb 18, 2016, 5:04:28 AM2/18/16
to ravendb
I don't think that the behavior is wrong. And I want to avoid making it awkward to do the obvious things.

Right now you have the option of defining the behavior we specified using a listener, which means that you can do so once, and pretty much forget about it then.

njy

unread,
Feb 18, 2016, 5:07:08 AM2/18/16
to RavenDB - 2nd generation document database
> Okay, how about the following?
> If you ask for more than the max page size, you get an error, with a pointer to streams.
Amazing, I couldn't have asked for more :-) !

> Implicit take size will be 25 (convention based), and if your query has more results, it will throw.
I'm not convinced about the implicit take, because in dev you have 10 docs and in works, and in prod you have 10000 docs and it will throw: I know it seems a little bit radical, but ehy, "safe by default" is a new thinking, we are no more in sql land :-)
My opinion is that just throwing right away (see: fail fast) if something as important as the take amount is not specified is the best thing: it does not default to "take all" like sql (which would be bad perf/stability wise) and it does not default to some number (which may lead to surprises in prod).
Thoughts?

> Explicit take will work as it does not.
Good.

> The configuration would be controlled from the conventions.
Yes, that is a good point to control this behaviour "globally", make sense.
Just to be clear: the new behaviour we are talking about would be the new default, right?

Oren Eini (Ayende Rahien)

unread,
Feb 18, 2016, 5:12:15 AM2/18/16
to ravendb
Implicit take is important to make for simple common queries.

njy

unread,
Feb 18, 2016, 6:47:02 AM2/18/16
to RavenDB - 2nd generation document database
> Implicit take is important to make for simple common queries.
I understand the ease, but I think it creates problems in prod, because silent auto-limiting of results is hard to spot.

Anyway, for me personally the main point would be throwing exceptions (pointing to streams) when a query is requesting too many docs, so to avoid the case of explicitly requesting 4000 docs and silently getting back only 1024, that is the most fundamental thing.

Instead when the take amount is not specified, I can live with some default value. Now that I think about it, we can have something like store.Conventions.DisableImplicitTakeAmount = true or store.Conventions.AllowImplicitTakeAmount = false , that would probably be the best of both worlds.

Oren Eini (Ayende Rahien)

unread,
Feb 18, 2016, 6:54:01 AM2/18/16
to ravendb

Bruno Lopes

unread,
Feb 18, 2016, 6:57:19 AM2/18/16
to ravendb
Oren,

And the case where the "truncation" happens server side ? (when we request 2000 but the server only answers with 1024?)


--
You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/8qhAJ2hMjfI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.

njy

unread,
Feb 18, 2016, 7:02:07 AM2/18/16
to RavenDB - 2nd generation document database
Bruno, I think it is the first line in the issue description: "If you ask for more than the max page size, you get an error, with a pointer to streams."

Bruno Lopes

unread,
Feb 18, 2016, 7:03:21 AM2/18/16
to ravendb
Right, brainfart, sorry for that.

njy

unread,
Feb 18, 2016, 7:06:17 AM2/18/16
to RavenDB - 2nd generation document database
This is awesome, thanks!

Can you please clarify a couple of things, please?

1) when you say "Explicit take will work as it does not" what do you mean exactly? I already sayd "good!" before, but re-reading the phrase I'm not sure I understood it correctly.
2) the code samples for the conventions look really really good, I'm just not sure if those are the default values or not

Thanks!

Oren Eini (Ayende Rahien)

unread,
Feb 18, 2016, 7:27:16 AM2/18/16
to ravendb
As it does now, I meant.

The conventions are not the default, not sure what those will be yet.

njy

unread,
Feb 18, 2016, 9:10:31 AM2/18/16
to RavenDB - 2nd generation document database
> The conventions are not the default, not sure what those will be yet.
Ok ,thanks for the clarification. Btw my vote is for more coherent defaults and probably V3 -> V4 being a major version change, that is a good place to change that ;-)

While we are on the subject: if you are thinking about what the defaults for V4 should be, can I suggest maybe a public poll?
With a minimum reasonable amount of votes that may be a good indication of what the preferred usage would be.

Oren Eini (Ayende Rahien)

unread,
Feb 18, 2016, 9:31:25 AM2/18/16
to ravendb
create an issue for this with the stuff you want to change, sure.

njy

unread,
Feb 19, 2016, 5:33:48 AM2/19/16
to RavenDB - 2nd generation document database

Chris Marisic

unread,
Feb 23, 2016, 5:25:16 PM2/23/16
to RavenDB - 2nd generation document database

#2 system administration needs to be made first class, the tools that exist now are more akin to dev tools than actual production server administrator tools.

On Tuesday, February 9, 2016 at 11:15:11 AM UTC-6, Oren Eini wrote:

2) Can you be more specific? 


Smuggler not moving all documents ... (?) https://groups.google.com/forum/#!topic/ravendb/X1yYIpj-_4g 

Is just flatly unacceptable for an administration tool.

Kamran Ayub

unread,
Feb 29, 2016, 6:32:56 PM2/29/16
to RavenDB - 2nd generation document database
I've been using RavenDB in production now for about a year, as an individual developer for a site I manage, hosted on Azure and RavenHQ.
  • Cheaper. I wish using Raven was cheaper for individuals. For 1 Dev DB and Standard Replicated Production (2 instances), at ~1GB per database, I pay $75/mo out of my pocket. I wish there wasn't such a charge on storage--$10/GB is steep, especially because RavenHQ doesn't auto-compact databases and I don't think compression is available (regardless I think it needs to be enabled at the start). Basic 2X is enticing but 2GB limit is way too small for $16/mo + cloud VM hosting because the next step up is a whopping $700/yr for Professional. Why is there no middle-ground, like Basic 3X or something? The pricing is just crazy complicated--why even force a subscription for Professional? I'd rather pay an affordable license fee ONCE for each major version, pay a reduced upgrade fee per major version, and then just handle hosting costs--at least then it feels more justified spending the upfront amount.

  • Smart, optimized load-balancing. I wish there was an option to create a smart load-balanced environment. Correct me if I'm wrong, but at the moment if I have master-slave replication, if I write to the master, it will replicate the change and BOTH databases will reindex at once. This means incoming read-only queries (to both instances) are slowed by the massive IO. I admit that partly the problem is with my initial design causing a save to trigger 15 indexes to rebuild at once, but I wish that Raven would smartly load balance read-only queries to optimize performance. Maybe there's an option I'm missing to limit the # of simultaneous indexing operations. I understand this is hard but man would it be nice--Raven is billed as being easy for developers but when it's so easy to make a bad design mistake early on, it costs you immensely downstream when it's hard to make a change or realize you made a bad decision. If Raven could offset my ignorance and help optimize my index load, that'd be amazing.

  • Patterns and Practices. Most of my time learning Raven was spent in the documentation but a lot of the practices I know now weren't really mentioned or recommended in the documentation, instead gleaned from StackOverflow questions or this group. For example, even though Int IDs are supported, they aren't supported well in certain scenarios causing grief. I wish I had just been advised to ALWAYS use string IDs and not bother with numeric IDs. Or how to realistically deal with concurrency issues and ETags, index staleness vs. UX, caching recommendations, etc.

  • Performance recommendations. It would be nice if Raven documented well-known or recommended performance practices, or even provided sample applications with performance baselines. In other words, I wish when I was learning Raven, I could have seen a "real" production application(s), run it locally, and understand what the performance baselines were to compare against.

  • Clearer performance stats in Studio. I couldn't tell you how bad or good my index performance is. I understand some of that information is available in the dashboard but it seems like I need a PhD to understand it. I want a Google Analytics for my stats, I want some clear values to see and know "Oh, okay, my indexing sucks--how can I fix that?" I need better insights into my data and indexes. I want to know how my indexes are performing over time, what factors are making them perform worse or better, etc. I like the "merge suggestions" feature, I want more of that and more high-level insights.

  • Stable updates with bug fixes. I can't tell you how frustrating it has been running into bugs and then being forced to wait until RavenHQ updates. I couldn't use MoreLikeThis from May 2015 until like September/October, when 3xxxxx builds came out on HQ. It's also impossible to recommend Raven internally at work until there's a way to patch an instance without risking issues with new features. Nobody is okay with saying, well, we have 50 dev, QA, and Production RavenDB databases, we have to update them to the latest unstable to fix this one bug and by the way, that project is done so no one has capacity to fix the app if there are incompatibility issues--no way that's going to fly, it's not realistic. Our platform team and DBAs support hundreds of applications, and if we introduce Raven and let's say over a year 20 apps are built with it, over the year there will certainly be bugs and updates that need to be applied, it's not realistic to expect to upgrade to an unstable version and possibly cause production outages, let alone doing this over the course of years. I understand this should be addressed with 3.5 and onwards.
That's what I can think of.

Oren Eini (Ayende Rahien)

unread,
Mar 1, 2016, 1:15:27 AM3/1/16
to ravendb
inline
You have an option for a one time fee for 1K $.
We are thinking about changing the model for 4.0, but we have nothing concrete yet.
 
  • Smart, optimized load-balancing. I wish there was an option to create a smart load-balanced environment. Correct me if I'm wrong, but at the moment if I have master-slave replication, if I write to the master, it will replicate the change and BOTH databases will reindex at once. This means incoming read-only queries (to both instances) are slowed by the massive IO. I admit that partly the problem is with my initial design causing a save to trigger 15 indexes to rebuild at once, but I wish that Raven would smartly load balance read-only queries to optimize performance. Maybe there's an option I'm missing to limit the # of simultaneous indexing operations. I understand this is hard but man would it be nice--Raven is billed as being easy for developers but when it's so easy to make a bad design mistake early on, it costs you immensely downstream when it's hard to make a change or realize you made a bad decision. If Raven could offset my ignorance and help optimize my index load, that'd be amazing.


That isn't actually how it works. A new document (or documents) coming in and being index shouldn't generate any additional I/O. And indexing does't happen all at once
 
  • Patterns and Practices. Most of my time learning Raven was spent in the documentation but a lot of the practices I know now weren't really mentioned or recommended in the documentation, instead gleaned from StackOverflow questions or this group. For example, even though Int IDs are supported, they aren't supported well in certain scenarios causing grief. I wish I had just been advised to ALWAYS use string IDs and not bother with numeric IDs. Or how to realistically deal with concurrency issues and ETags, index staleness vs. UX, caching recommendations, etc.

That is why we have this post. Because we want to be sure that the 4.0 version is aligned properly.Also, did you see the book? github.com/ayende/book 

  • Performance recommendations. It would be nice if Raven documented well-known or recommended performance practices, or even provided sample applications with performance baselines. In other words, I wish when I was learning Raven, I could have seen a "real" production application(s), run it locally, and understand what the performance baselines were to compare against.

 

  • Clearer performance stats in Studio. I couldn't tell you how bad or good my index performance is. I understand some of that information is available in the dashboard but it seems like I need a PhD to understand it. I want a Google Analytics for my stats, I want some clear values to see and know "Oh, okay, my indexing sucks--how can I fix that?" I need better insights into my data and indexes. I want to know how my indexes are performing over time, what factors are making them perform worse or better, etc. I like the "merge suggestions" feature, I want more of that and more high-level insights.
Have you seen the indexing stats we already have in the studio?
Inline image 1
 

  • Stable updates with bug fixes. I can't tell you how frustrating it has been running into bugs and then being forced to wait until RavenHQ updates. I couldn't use MoreLikeThis from May 2015 until like September/October, when 3xxxxx builds came out on HQ. It's also impossible to recommend Raven internally at work until there's a way to patch an instance without risking issues with new features. Nobody is okay with saying, well, we have 50 dev, QA, and Production RavenDB databases, we have to update them to the latest unstable to fix this one bug and by the way, that project is done so no one has capacity to fix the app if there are incompatibility issues--no way that's going to fly, it's not realistic. Our platform team and DBAs support hundreds of applications, and if we introduce Raven and let's say over a year 20 apps are built with it, over the year there will certainly be bugs and updates that need to be applied, it's not realistic to expect to upgrade to an unstable version and possibly cause production outages, let alone doing this over the course of years. I understand this should be addressed with 3.5 and onwards.

Yes, that is going to effect with 3.5 
 
That's what I can think of.

On Tuesday, February 9, 2016 at 5:40:02 AM UTC-6, Oren Eini wrote:
Guys,
We are doing a lot of work on 4.0, and one of the things we are looking at is not just adding new stuff, but removing bad old stuff.

What are the things that you regret having in RavenDB? Pifalls, issues, confusing, etc?


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 

--

Gareth Thackeray

unread,
Mar 1, 2016, 3:56:31 AM3/1/16
to ravendb
Oren,

I feel like you've waved away a couple of important points here.

1. RaccoonBlog: I hadn't actually looked at this for years since deciding it wasn't a very useful reference.  I just looked now and there is no readme, no documentation and no comments.  So I don't doubt that it's full of best-practices but you have to go searching through all the code to find them and even then they are not sign-posted in any way.  In particular it is not useful for to see what makes for good performance as who knows what isn't there - e.g. too many LoadDocuments in index definitions.  Of Raven's many and brilliant features there is not much to tell you what you should and shouldn't be making heavy use of.

2. Indexing stats:   try to look at that screenshot you posted as if you weren't super-familiar with the workings of Raven!  I'm a fairly smart guy but I've got a lot to do and whenever I look at the index stats in my app I normally just close it again, feeling fairly grateful I don't have any pathological performance to debug at this time.

I think what Kamran is saying (and what I would like) is something that tells you simply what your most expensive indexes are (in total + per document I guess) and what you could do to address it.  And maybe some documentation about what the various data in the different tabs means (apologies if this exists somewhere already).

Cheers,

Gaz

--
You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/8qhAJ2hMjfI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

cid:995505DB-77EC-41A1-989A-A3B4F8C53C25@clubworkspace

 

Gareth Thackeray
CTO

www: vidados.com
M: +44 (0) 7748 300359

skype: gareththackeray

This e-mail message is confidential and may contain privileged information. If you are not the above named addressee, it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information in this e-mail message. If you are not the intended recipient of this e-mail message, please delete this message.

Kamran Ayub

unread,
Mar 1, 2016, 9:10:40 AM3/1/16
to ravendb
Oren,

Per the performance issue--I just took a bunch of profiling screenshots of how the database performs during indexing. With 11+ indexes reindexing, queries to the DB take between 500ms and 1500ms--this is ridiculous, if what you say is true. Clearly there is some CPU/IO bottleneck in my Starter instance in RavenHQ to cause this. With <5 indexes, it's more reasonable and fluctuates between 20ms and 200ms, which I'd consider normal. Note that these aren't complex queries, my homepage is 3 queries, the first one is typically 304 Not Modified, and during indexing, that same query can take up to 1500ms just to return a 304.

I think I'm running into a problem so I'll create another thread for it with more details.

Gareth covered my other points.

Per the 1K per year--yeah, that's a THOUSAND DOLLARS (plus hosting costs!), that's insane for an individual to pay per year for a hobby project that I don't make money from (can't is more like it, according to TOS). That's a brand new Steelcase chair per year and I can hardly justify buying that to my wife, can you imagine trying to justify paying for $1000 software per year? I'm looking for something in-between... maybe $200-300 per year; that's at least a bit more justifiable (I still don't understand why it's a subscription, can't I just pay once? Maybe some people like the subscription model). I pay $75/mo which is about $900 per year, but remember at least in HQ it's replicated and hosted for me. Anyway, please consider this use case for future pricing since using Raven is price-prohibitive right now.

Daniel Häfele

unread,
Mar 1, 2016, 9:39:00 AM3/1/16
to RavenDB - 2nd generation document database
Ehm, the 1k is actually a one-time thing.
See the three attached images...
ravendb1.png
ravendb2.png
ravendb3.png

Kamran Ayub

unread,
Mar 1, 2016, 10:22:06 AM3/1/16
to RavenDB - 2nd generation document database
Oh man, that was totally not clear from the pricing page! I might consider this then, since at least it'd be $1000 one-time and then the cost of hosting which over the long-term would be cheaper than what I do now. Still, $1000 is a lot--I'd even be okay with 2-3 core power for less, like $600. The most I care about is not worrying about storage and having enough power to make it worth paying for my own hosting. Any VM with >4 cores is probably too much for me anyway.

Kamran Ayub

unread,
Mar 1, 2016, 10:44:58 AM3/1/16
to RavenDB - 2nd generation document database
Man it's even worse--apparently I can do $400/yr for Standard with no RavenFS or support. Why wouldn't the pricing page say "Starting from $400/yr"? Why would it default to including the RavenFS price, I don't use RavenFS! That's just silly. And before you ask, no, I never added Standard to cart because I saw $700/yr and said "Nope." I'm sure I'm not the only one...

With that in mind, it's a bit more reasonable but it isn't clear from the pricing page or the cart whether Standard License includes upgrades as that would influence my decision (since $1000 spread over 3 years saves me $200 overall vs. per year).

So overall, my feedback on pricing is--make it simpler to understand and advertise the cheap options first!

Gleb Chermennov

unread,
Mar 2, 2016, 9:46:06 AM3/2/16
to RavenDB - 2nd generation document database
For me the biggest hurdle was backup vs export in 2.5.x series. I couldn't restore backups on another machine, so I switched to Smuggler. Luckily, the application didn't have that much data, so it wasn't a problem.
I assume this problem is gone with Voron in 3.x - just sharing my 2 cents here

вторник, 1 марта 2016 г., 9:15:27 UTC+3 пользователь Oren Eini написал:

João Bragança

unread,
Mar 5, 2016, 7:42:40 AM3/5/16
to RavenDB - 2nd generation document database
Here's something I bumped into yesterday. I store a document of a certain type. Using the streaming api, I retrieve this document but wish to deserialize it into a different type (the second type has an extra property). Change the Id, assign the additional property and store it in the session. IOW I want to make a copy. RavenDb won't let me do that - I get an invalid cast exception. I guess TDocument is not being passed to deserialization???

As far as I'm concerned, if I store something as a Chicken and want to bring it back later as a Dog, that's my business.

Kijana Woodard

unread,
Mar 5, 2016, 10:34:25 AM3/5/16
to rav...@googlegroups.com
Yup. That gets awkward. If you now only have Dog defined in your project, what you say will work.

The existing behavior is using the type defined in metadata. For Load<object> or dynamic or loading a list of ids where the
types differ, this behavior is essential.

And yet, it would be nice to be able coerce to a new type in certain cases. 

From: João Bragança
Sent: ‎3/‎5/‎2016 6:42 AM
To: RavenDB - 2nd generation document database
Subject: [RavenDB] Re: Worst mistakes in ravendb design / api / etc

--

Kamran Ayub

unread,
Mar 5, 2016, 11:22:30 AM3/5/16
to rav...@googlegroups.com
There is also the awkwardness of using Stream API with starting with, and then it not able to deserialize the desired type because some documents (that match the prefix) don't match the desired generic type. You have to do that workaround of streaming "object" and then inspecting the type yourself, an extra step. The Stream API is also missing the "exclude" parameter so it has to get all the documents, you can't exclude any (unlike LoadStartingWith which does have exclude).

You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/8qhAJ2hMjfI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.

Kijana Woodard

unread,
Mar 9, 2016, 12:57:00 PM3/9/16
to rav...@googlegroups.com
I can't believe I forgot this one.

The server changes the casing of the id depending on how you get it back. That means your code always has to be on guard and do nasty case insensitive string comparisons for id equality else risk subtle bugs.

And yet, if you try to do a case insensitive search in a where clause, the linq code blows up with "could not understand query". This means you can't simply apply a style rule like "all string equality must be case insensitive". Each case has to be evaluated independently. The session handles the case insensitivity just fine, but the server behavior forces us to handle it manually everywhere in our own code.

I'd prefer that ids were either always forced lower case or the server always return them in the original casing.

Oren Eini (Ayende Rahien)

unread,
Mar 9, 2016, 1:24:10 PM3/9/16
to ravendb
The actual reason here is a bit complex.
Basically, when you load documents as a result of query, we go to the index to get the ids, and if you are just getting the values from the index, we get the document id stored in the index, which is stored as lower case (for Lucene reasons).

Kijana Woodard

unread,
Mar 9, 2016, 1:33:27 PM3/9/16
to rav...@googlegroups.com
Yes. But that plays havoc when you then use that Id somewhere else [property on another doc, posted value from client].

IIRC, we added something to always Store(Id) in the index which I think is supposed to preserve casing, but that wasn't bulletproof. In any case, we're always doing string insensitive compares which is annoying, but it's better than hitting a subtle bug weeks later.

Is there a better workaround?

Fwiw, when doing session.Load(listOfIds), I generally still do session.Load(id) to avoid this issue.

Oren Eini (Ayende Rahien)

unread,
Mar 9, 2016, 1:37:26 PM3/9/16
to ravendb
The actual reason we have to do that is to avoid hitting Esent for the id in the proper case. In 4.0, we are using Voron and we explicitly handle this case much better.

Kijana Woodard

unread,
Mar 9, 2016, 1:45:58 PM3/9/16
to rav...@googlegroups.com
Should Store(Id) work around the issue?

Oren Eini (Ayende Rahien)

unread,
Mar 9, 2016, 1:47:24 PM3/9/16
to ravendb
Yes, because it store that in a separate field from __document_id

Chris Marisic

unread,
Mar 9, 2016, 1:58:08 PM3/9/16
to RavenDB - 2nd generation document database
Related to this is the insanity that ensues when you get an ID that has a tab, carriage return, or linefeed included. I know that has been a constant break/fix scenario. have no idea if it's currently working everywhere or not.
...
It is loading more messages.
0 new messages