Fanout Total Count... any news or is there a better model?

108 views
Skip to first unread message

Piers Lawson

unread,
Oct 19, 2016, 6:03:33 AM10/19/16
to RavenDB - 2nd generation document database
We have a hierarchy where a Parent object contains a list of Child objects. Up till now we have stored these denormalised so a single Parent document also contains a fully populated list of Child objects. An index on the Parent documents was created that exposed properties of the Children. This allows us, for example to look for a parent who has both a child older than 8 and a female child. The index looks like this:

    public class Parent_ByChildProperties : AbstractIndexCreationTask<Parent>
    {
        public class Result
        {
            public string ParentId { get; set; }
            public string ParentName { get; set; }
            public IList<int> ChildAges { get; set; }
            public IList<string> ChildGenders { get; set; }
        }

        public Parent_ByChildProperties()
        {
            Map = funds => from parent in parents
                           select new
                           {
                               ParentId = parent.IdAsGuid,
                               ParentName = parent.Name,
                               ChildAges = parent.Children.Select(c => c.Age),
                               ChildGender = parent.Children.Select(c => c.Gender)
                           };
        }
    }


This works relatively well but has some issues.

Firstly, the number of Child properties we can search on is quite large and the number of children averages 8 but can go higher, which means the documents can get quite big. Since we only need the "Parent" information back from the search, we are pulling a lot more data back from the database than we need, effecting overall performance.

Secondly, our index looks at each Child property in isolation, which means it does not allow us to retrieve a Parent who has a Child that is older than 8 and that same child is also female.


We thought the solution to our problems would be to:
  •  Store the Parent document with just a list of Child Ids
  •  Store the Children as individual documents
  •  Use a fanout index to create an index entry for each Parent/Child combination
Our new index is along the lines of:

    public class Parent_ByChildProperties : AbstractIndexCreationTask<Parent>
    {
        public class Result
        {
            public string ParentId { get; set; }
            public string ParentName { get; set; }
            public int ChildAge { get; set; }
            public string ChildGender { get; set; }
        }

        public Parent_ByChildProperties()
        {
            Map = funds => from parent in parents
                           from childId in parent.childIds
                                  select new
                                  {
                                      ParentId = parent.Id,
                                      ParentName = parent.Name,
                                      ChildAge = LoadDocument<Child>(childId).Age,
                                      ChildGender = LoadDocument<Child>(childId).Gender
                                  };
        }
    }

 
This works fine and searches return the correct parent objects. The only problem is paging. The same issue was raised in this question https://groups.google.com/d/msg/ravendb/l2fkV6szrmQ/AryVlnS7CgAJ and similarly we don't want to force our clients to have to manage the "odd" paging method themselves... they don't care that paging doesn't work as they might expect because we are using RavenDB and it is paging through "Tampered Results". It seemed the only solution to Gluber's question was that he needed to rethink his design.


Are there any new thoughts on this issue? Can anyone suggest a re-modelling that might get round this problem?

As an aside, I realise the LoadDocument will make this index slow to apply... but in practice with a test database of 200,000 parents and 800,000 children, applying the index was relatively fast, if the index was already in the database before the data was loaded. It was extremely slow if the index was applied after the data was loaded. This appeared to be down to batch size. If the index was added to an already populated database, the batch size quickly ramped up (~13000) and the large batches worked at a much slower rate per Parent than the small batches (up to 5 times slower). If the index was in place before the data was loaded (200 parents at a time) the batch sizes stayed small (~1000) and the indexing was completed quickly. Is there a way to limit the batch size on an Index by Index basis?

Thank you for your time

Piers

Michael Yarichuk

unread,
Oct 19, 2016, 9:56:58 AM10/19/16
to RavenDB - 2nd generation document database
Couple of thoughts
1) Fanout causes performance issues in some use-cases. I'd try to avoid fanout if possible. 
2) It is possible to specify batch size limits for indexes only system-wide. I'd be very careful with those settings - I've seen legit looking config values cause heavy performance issues under several use-cases
3) LoadDocument in an index has another effect - when a document that is loaded in an index is changed, the relevant entry in an index will need to be reindexed as well. In the case you describe it would mean that changes to child documents would trigger change in Parent_ByChildProperties index. This will have perf overhead for the index in case of changes to child documents in parallel to indexing.
4) Why not make reverse relationship? --> I mean, if each child can have one parent, do an index over the children and load parent document into an index. This should avoid the fanout.

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Best regards,

 

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Michael Yarichuk l RavenDB Core Team 

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 

RavenDB paving the way to "Data Made Simple"   http://ravendb.net/  

Piers Lawson

unread,
Oct 19, 2016, 10:48:25 AM10/19/16
to RavenDB - 2nd generation document database
Thanks Michael

Yes, we considered using the reverse, i.e. having the index over the children. But that leaves us with the same / similar paging issue... say we want to return the first 25 parents, we don't know how many children we need to get from the "Child" based index in order to get 25 unique Parents. Remember, the client wants to page over Parents, not Children. When reading about Fanout indexes they seemed to give us exactly what we wanted... the ability to search for parents by their children. It wasn't until we put a Fanout index into practice that we realised that RavenDB was exposing how it is actually working under the covers rather than let us page nicely over the results. 

Note: the help on Fanout indexes https://ravendb.net/docs/article-page/3.0/csharp/indexes/fanout-indexes makes no reference to this limitation.

We could possibly do a Map Reduce over the children but that returns us to a situation where we can't search on a child that is both 11 and female.... we would only be able to find a parent that has a child that is 11 and a child that is female (i.e. they may have an 11 year old boy and an 6 year girl).

Is there a way to create an index that doesn't just contain value types and lists of value types? We really need an index entry which contains parent properties and a list in which each entry in the list is a "bag" of properties to represent each child. So each entry in the list would represent a child and have both an age and a gender property. Then we wouldn't need to fan out.

We realised using a Fanout index and LoadDocument would cause indexing to be slow. However, this database is loaded once a day (we get a daily feed of data) and therefore its contents are fixed once the index has been applied. Once loaded it works very well. The batch size issue is really just an aside... as the index was being applied, Raven DB seemed to be hunting for the best batch size, swapping between a large batch and a small batch ... but both the small and large batches were getting bigger over time. Consequently the rate at which the index was getting created slowed down dramatically the further on the process got. Our work around is to apply the index before loading the data, then the load process naturally keeps the batch size small.

Piers
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Oren Eini (Ayende Rahien)

unread,
Oct 20, 2016, 3:12:27 AM10/20/16
to ravendb
Piers,
Fanout causes problem when the size is unbounded, for example, you have 3000 index entries per document.
For smaller stuff (a few dozens), that shouldn't be an issue

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Piers Lawson

unread,
Oct 20, 2016, 5:50:05 AM10/20/16
to RavenDB - 2nd generation document database
Thanks for your information Oren.

Regarding the performance, in my test each parent only fanned out to four or five children, so it sounds like that shouldn't be an issue... these were the stats I saw as the index was applied:

#Inputs Total Time(ms)  Avg Time(ms) per input
102     2782            27.2745098
205     3974            19.38536585
409     9565            23.38630807
819     17611           21.5030525
1639    30572           18.6528371
3277    51196           15.62282576
6553    157914          24.0979704
9831    471126          47.92249008
5335    85172           15.96476101
11431   664623          58.14215729
5870    119291          20.32214651
12807   851373          66.47716093
7938    362983          45.72726127
13107   813348          62.05447471
13108   853396          65.10497406
13107   854950          65.22850385
13107   915089          69.81681544
13107   927475          70.76180667
13107   1553730         118.5420005
13108   985343          75.17111688
13107   1026358         78.30609598
13107   1041021         79.42481117
13107   1035646         79.01472496
2717    63526           23.38093485

This shows the batch size gradually increasing and as it does so the average time to process a single input also increased.

I realise now I should not have mentioned performance in the same question as I asked about the fanout functionality... the two responses so far have largely been about the performance of applying the index (for which we have a work around) ;-)

My real question is about how to page through search results without having to expose the internal workings of RavenDB to our clients. I get that the implementation of a Fanout Index means the underlying index has an entry for each Parent/Child combination, but I still think of it as an index against the Parent documents and would generally be using it to find Parents, therefore conceptually (perhaps my SQL experience coming out here) I would expect to be able to get a TotalCount that is based on Parents and be able to Skip / Take based on Parents. The fact that RavenDB implements this internally by applying a Distinct query would ideally not be something that is surfaced to me as a user of RavenDb... I would just see the TotalCount reflecting the number of matching Parent objects and SkippedResults always being zero.

I guess it works how it works and we will have to drop this approach and return to the less ideal index that provides the "right" statistics but doesn't let us look for a child with two matching properties. The only alternative I can see, that avoids our clients needing to track "SkippedResults" is for us to iterate through the pages until we get the page the client requested... which still doesn't address getting a TotalCount that reflects the correct number of parent objects that a search could return.

Thank you for your thoughts so far.

Oren Eini (Ayende Rahien)

unread,
Oct 20, 2016, 10:10:30 AM10/20/16
to ravendb
Focusing on just the fanout issue.
Index like this:
select new
                           {
                               ParentId = parent.IdAsGuid,
                               ParentName = parent.Name,
                               ChildAges = parent.Children.Select(c => c.Age),
                               ChildGender = parent.Children.Select(c => c.Gender)
                           };

Is preferred, because of the reasons you outlined.

To go with the SQL metaphor, fanout would be a cartesian product.

Note that you can do complex queries on the children, if you know what they are upfront.

select new 
{
   ChildQuery = parent.Children.Select(c=>new{ c.Age, c.Gender})
}

And then query using the full object on the client side.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Piers Lawson

unread,
Oct 20, 2016, 11:05:46 AM10/20/16
to RavenDB - 2nd generation document database
Thanks Oren

Yes, it looks like we will have to stick with the Index that stores the Children's properties in individual lists and put up with not being able to look for Parents who have an individual Child that matches both of the search criteria.

Regarding your complex queries... I think if I'm understanding you correctly you are suggesting if we are only going to support a small number of queries we could set up an index specific to each query. We do know the properties we want to search by, but not the combination that a client might want. Imagine the Child also has height, eye colour and hair colour properties... and that the client can search by any combination of properties (and ideally ranges such as aged between 5 and 10). Would your approach require 31 different indexes to cope with every combination of the 5 child properties? If not, I'm not quite sure what your index would look like... perhaps this?:

    public class Parent_ByChildProperties : AbstractIndexCreationTask<Parent>
    {
        public class Result
        {
            public string ParentId { get; set; }
            public string ParentName { get; set; }

            public ???? ChildQuery { get; set; }
        }

        public Parent_ByChildProperties()
        {
            Map = funds => from parent in parents
                           select new
                           {
                               ParentId = parent.IdAsGuid,
                               ParentName = parent.Name,

                               ChildQuery = parent.Children.Select(c => new {c.Age, c.Gender, c.Height, c.EyeColor, c.HairColor})
                           };
        }
    }


And how could we query that? Perhaps I'm misunderstanding your suggestion.

It is such a shame the fanout paging is difficult because the index works perfectly otherwise ;-)

Oren Eini (Ayende Rahien)

unread,
Oct 21, 2016, 1:05:58 AM10/21/16
to ravendb
Take a look at query intersection, it might be what you want
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Piers Lawson

unread,
Oct 21, 2016, 5:27:07 AM10/21/16
to RavenDB - 2nd generation document database
Intersections look interesting, but the results and statistics using it against my fanout index seem very strange. For example the SkippedResults is greater than the TotalCount. Perhaps the Query help pages could explain what information will be returned by Statistics() for each type of query. I certainly can't work out how I could page over the results from a query such as:

 var intersections = session
  .Query<Parent_ByChildProperties.Result, Parent_ByChildProperties>()
  .Statistics(out stats)
  .Where(x => x.Age == 6 && x.Gender == "Female")
  .Intersect()
  .Where(x => x.Age == 10 && x.Gender == "Male")
  .OrderBy(x => x.ParentName)
  .OfType<Parent>()
  .ToList();

Intersections are interesting, but are more complex than we are after... I simply want to be able have a Parent document that contains/refernces a list of Child objects and be able to write a query such as:

 var intersections = session
  .Query<Parent_ByChildProperties.Result, Parent_ByChildProperties>()
  .Statistics(out stats)
  .Where(child => child.Age == 6 && child.Gender == "Female")
  .OrderBy(x => x.ParentName)
  .OfType<Parent>()
  .ToList();

And to be able to get a count of the total number of distinct Parent objects that have a child that is both aged 6 and female. Then at the same time be able to use Skip and Take to a specific point in that distinct list of Parents without having to track a variable such as SkippedResults, for example I'd like to write:

 var intersections = session
  .Query<Parent_ByChildProperties.Result, Parent_ByChildProperties>()
  .Statistics(out stats)
  .Where(child => child.Age == 6 && child.Gender == "Female")
  .OrderBy(x => x.ParentName)
  .OfType<Parent>()
  .Skip(page * pageSize).Take(pageSize)
  .ToList();

The only approach that seems to allow me to write Where(child => child.Age == 6 && child.Gender == "Female") in the query (and for those to be applied to the same Child) is to use Fanout Indexes... but this type of index does not provide any way to find the total number of objects, of the type being indexed, that could be returned and no simple way to go to a specific page of the results.

I'm still not sure how your suggestion to do complex queries would work:

Note that you can do complex queries on the children, if you know what they are upfront.
 
select new 
{
   ChildQuery = parent.Children.Select(c=>new{ c.Age, c.Gender})
}
 
And then query using the full object on the client side.
Thank you for your time on this... it is very much appreciated... and this issue is certainly not stopping us using Raven DB (which is performing very well for us) and I'm learning more about its capabilities!

Oren Eini (Ayende Rahien)

unread,
Oct 21, 2016, 8:52:42 AM10/21/16
to ravendb
The problem is that doing the kind of query you want dynamically doesn't allow us to use an index, it would require a table scan, which isn't desired.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages