Document Counts in presence of fanout indices

61 views
Skip to first unread message

Gluber

unread,
Nov 27, 2015, 8:17:25 PM11/27/15
to RavenDB - 2nd generation document database
Hi !

I am currently working on a difficult problem, and am at the level where I think RavenDB cannot help me any further ( It's already doing enough in our case ) but i just want to check if i am missing something before i do some crazy stuff to make my case work :-)

I have "Offer" documents and an index that fans out and creates a few index entries per "Offer"

When i query this i need to get the actual documents only ( once ) even if multiple index entries would match for a document.

This already works fine when i project the query back to the actual document type.

However the count of the query is off and returning the number of matching index entries instead of the document count.
( I need the count for paging through the result set )

This behaviour makes total sense to me and i am prepared to work around it, however i am want to know if i a missing something and somehow ravendb could already solve this for me and return the count in documents instead of index entries quantities.

( In case you're wondering, I have a hierarchy of leaflet flight, leaflets, leafletpages and offers and on each level certain properties need to overridable and be inherited down to the acutal offers. but since there is some data sharing going on this leads to a classic diamond inheritance problem, where i have to create one index entry per inheritance path ) 

Here is my simple code: ( Case 4 returns the wrong count ) 

  public class LeafletFlight
    {
        public LeafletFlight()
        {
            LeafletReferences = new List<LeafletReference>();
            AvailabilityZipCodes = new List<string>();
        }

        public string                   Id { get; set; }

        public IList<LeafletReference>  LeafletReferences { get; set; } 

        public IList<string>            AvailabilityZipCodes { get; set; } 
    }

    public class LeafletReference
    {
        public string                   LeafletId { get; set; }
    }

    public class Leaflet
    {
        public Leaflet()
        {
            AvailabilityZipCodes = new List<string>();
            LeafletPageReferences = new List<LeafletPageReference>();
        }

        public string                       Id { get; set; }

        public string                       LeafletFlightId { get; set; }

        public IList<LeafletPageReference>  LeafletPageReferences { get; set; } 

        public IList<string>                AvailabilityZipCodes { get; set; }
    }

    public class LeafletPageReference
    {
        public string                       LeafletPageId { get; set; }
    }

    public class LeafletPage
    {
        public LeafletPage()
        {
            OfferReferences = new List<OfferReference>();
            LeafletIds = new List<string>();
        }

        public string                       Id { get; set; }

        public IList<OfferReference>        OfferReferences { get; set; } 

        public IList<string>                LeafletIds { get; set; } 
    }

    public class OfferReference
    {
        public OfferReference()
        {
            AvailabilityZipCodes = new List<string>();
        }

        public string           OfferId { get; set; }

        public IList<string>    AvailabilityZipCodes { get; set; } 
    }

    public class Offer
    {
        public Offer()
        {
            AvailabilityZipCodes = new List<string>();
            LeafletPageIds = new List<string>();
        }

        public string           Id { get; set; }

        public string           Name { get; set; }

        public IList<string>    LeafletPageIds { get; set; } 

        public IList<string>    AvailabilityZipCodes { get; set; }
    }

    public class OfferIndex : AbstractIndexCreationTask<Offer>
    {
        public class Entry
        {
            public IEnumerable<string> AvailabilityZipCodes { get; set; } 

            public int PathCount { get; set; }
        }

        public OfferIndex()
        {
            Map = offers => from offer in offers
                let paths =
                    offer.LeafletPageIds.Select(p => LoadDocument<LeafletPage>(p))
                        .SelectMany(p => p.LeafletIds.Select(l => new
                        {
                            LeafletPage = p,
                            Leaflet = LoadDocument<Leaflet>(l),
                            LeafletFlight = LoadDocument<LeafletFlight>(LoadDocument<Leaflet>(l).LeafletFlightId),
                            Offer = offer
                        }))
                from path in paths
                let offerAvailabilityZips = path.Offer.AvailabilityZipCodes 
                let offerReferenceAvailabilityZips = path.LeafletPage.OfferReferences.First(o => o.OfferId == path.Offer.Id).AvailabilityZipCodes
                let leafletAvailabilityZips = path.Leaflet.AvailabilityZipCodes
                let leafletFlightAvailabilityZips = path.LeafletFlight.AvailabilityZipCodes
                select new
                {
                   AvailabilityZipCodes = offerAvailabilityZips.Any() ? offerAvailabilityZips : (offerReferenceAvailabilityZips.Any() ? offerReferenceAvailabilityZips : (leafletAvailabilityZips.Any() ? leafletAvailabilityZips : leafletFlightAvailabilityZips))
                };

            MaxIndexOutputsPerDocument = 2048;
        }
    }

    public class Test : RavenTestBase
    {
        public void Execute()
        {
            using (var store = NewDocumentStore(port:8084,indexes:new AbstractIndexCreationTask[] {new OfferIndex()}))
            {
                using (var session = store.OpenSession())
                {
                    var leafletFlight = new LeafletFlight();

                    leafletFlight.AvailabilityZipCodes.Add("1010");
                    leafletFlight.AvailabilityZipCodes.Add("1020");

                    session.Store(leafletFlight);

                    var leaflet1 = new Leaflet();
                    leaflet1.AvailabilityZipCodes.Add("1020");
                    leaflet1.AvailabilityZipCodes.Add("1030");
                    leaflet1.LeafletFlightId = leafletFlight.Id;
                    
                    var leaflet2 = new Leaflet();
                    leaflet2.LeafletFlightId = leafletFlight.Id;

                    session.Store(leaflet1);
                    session.Store(leaflet2);

                    leafletFlight.LeafletReferences.Add(new LeafletReference() { LeafletId = leaflet1.Id});
                    leafletFlight.LeafletReferences.Add(new LeafletReference() { LeafletId = leaflet2.Id});

                    var leafletPage = new LeafletPage();
                    leafletPage.LeafletIds.Add(leaflet1.Id);
                    leafletPage.LeafletIds.Add(leaflet2.Id);
                    
                    session.Store(leafletPage);

                    leaflet1.LeafletPageReferences.Add(new LeafletPageReference() {LeafletPageId = leafletPage.Id});
                    leaflet2.LeafletPageReferences.Add(new LeafletPageReference() {LeafletPageId = leafletPage.Id});

                    var offer = new Offer();
                    offer.Name = "Test01";
                    offer.LeafletPageIds.Add(leafletPage.Id);

                    session.Store(offer);

                    leafletPage.OfferReferences.Add(new OfferReference() { OfferId = offer.Id});

                    session.SaveChanges();

                    WaitForIndexing(store);

                    ValidateCount(session,"1050",0);
                    ValidateCount(session,"1010",1);
                    ValidateCount(session,"1030",1);
                    ValidateCount(session,"1020",1);
                }
            }
        }

        private void ValidateCount(IDocumentSession session, string zipCode,int count)
        {
            RavenQueryStatistics statistics = null;

            var queryResult =
                session.Query<OfferIndex.Entry, OfferIndex>()
                    .Statistics(out statistics)
                    .Where(o => o.AvailabilityZipCodes.Any(z => z == zipCode)).As<Offer>().ToList();

            if (count != statistics.TotalResults)
            {
                Console.WriteLine($"INVALID for zip {zipCode}, expected {count}, got {statistics.TotalResults}");
            }
            else
            {
                Console.WriteLine($"VALID for zip {zipCode}");
            }
        }
    }

Gluber

unread,
Nov 27, 2015, 11:08:01 PM11/27/15
to RavenDB - 2nd generation document database
which explains paging through tampered results ( which is my case )

However this seems to be inadequate for my usage, since 
a) it requires to keep a running sum of skipped results ( hard to do in RESTful web api scenario since you would probably need the client to do this )  
b) It requires paging through in sequence ( so no jumping to the last page for example ) 
c) It does not give a proper total results and could be used to display a pagination ui in the first place.

Michael Yarichuk

unread,
Nov 28, 2015, 2:59:01 AM11/28/15
to RavenDB - 2nd generation document database
Before all else, note that your index is potentially doing _lots_ of LoadDocuments(). When doing this, RavenDB stores a reference between documents, so when one of them is changed, re-indexing will occur. This is needed because if a document that you loaded and used its values while building index entry has changed - the data in the index entry might be invalid.

Usually not an issue, but if used too much, this may become a performance issue.

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Best regards,

 

Michael Yarichuk

RavenDB Core Team

Tel: 972-4-6227811

Fax:972-153-4-6227811

Email : michael....@hibernatingrhinos.com

 

RavenDB paving the way to "Data Made Simple" http://ravendb.net/  

Gluber

unread,
Nov 28, 2015, 10:49:54 AM11/28/15
to RavenDB - 2nd generation document database
Thanks for your hint,

I am fully aware of that, and it sucks ( not from ravendb ) that i have to do it this way, but i see no other solution for our problem sadly.

Oren Eini (Ayende Rahien)

unread,
Nov 29, 2015, 4:09:45 AM11/29/15
to ravendb
Gluber,
The easiest solution would be for you to avoid emitting multiple index entries. You can write your index to be (gmail code):
Map = offers => from offer in offers
    let leafletPages = offer.LeafletPageIds.Select(p => LoadDocument<LeafletPage>(p))
    let paths = leafletPages
            .SelectMany(p => p.LeafletIds.Select(l => new
            {
                LeafletPage = p,
                Leaflet = LoadDocument<Leaflet>(l),
                LeafletFlight = LoadDocument<LeafletFlight>(LoadDocument<Leaflet>(l).LeafletFlightId),
                Offer = offer
            }))
    let offerAvailabilityZips = offer.AvailabilityZipCodes.Any() ?offer.AvailabilityZipCodes : null
    let offerReferenceAvailabilityZips_Temp = offerAvailabilityZips ??
          (leafletPages.SelectMany(x=>x.OfferReferences).FirstOrDefault(o => o.OfferId == path.Offer.Id).AvailabilityZipCodes)
    let offerReferenceAvailabilityZips = offerReferenceAvailabilityZips_Temp.Any() ? offerReferenceAvailabilityZips_Temp : null
    let leafletAvailabilityZips = path.Leaflet.AvailabilityZipCodes
    let leafletFlightAvailabilityZips = leafletPages.LeafletIds.Select(l=>LoadDocument<LeafletFlight>(LoadDocument<Leaflet>(l).LeafletFlightId)).SelectMany(x=>x.AvailabilityZipCodes)


    select new
    {
       AvailabilityZipCodes = 
        (   offerAvailabilityZips ??  offerReferenceAvailabilityZips ?? leafletFlightAvailabilityZips
        )
    };




Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Sat, Nov 28, 2015 at 6:08 AM, Gluber <glube...@gmail.com> wrote:

--

Gluber

unread,
Nov 29, 2015, 8:51:27 AM11/29/15
to RavenDB - 2nd generation document database
Thanks Oren for your help.

However your idea would not work ( as long as i have not missed anything ). If there was only one valid path to consider then yes this would work.

But in our case you have offers that get referenced in multiple pages with different settings for example.. so you get multiple valid paths/index values. 

The thing is has to be queried in an "OR" fashion, only if one path by itself is matched should the offer be found ( and if multiple match only once ) 

I'll write an example:

We have leaflet 1 and leaflet 2 ( btw we are talking about advertisment leaflets that get mailed out ) 
Both have the same first page ( but other pages might be different ) 

On this shared first page we have an offer for let's say beef. This is also shared.

Leaflet 1 gets mailed to zip codes 101 and 102 and is valid from dezember 1st
Leaflet 2 gets mailed to zip codes 103 and 104 and is valid from dezember 10th 

Now i have two distinct paths that must match by itself in full but are combined by an or...

So the offer should be returned IF if searching for zip code 101 from 1st dezember but if searching for 103 it should only be returned after dezember 10th ..... 

If i just combine everything this would not work anymore and if search for 103 would also return it for demzember 2nd for example

I omitted the  validity date issue before since i just wanted a simple example to illustrate the multiple path problem.

(also note that validity dates are not the only additional path values, e.g publish dates are included, visiblity settings,workflow states ) 

You might ask why such a complicated structure: The problem is that we get for example for one vendor here in Germany 1000 leaflets per week which are 90 % the same all over, but differ in some pages only.... And we need to extract the data of those via a manual process.. That's why we try to deduplicate/share as much data as possible.. I would also rather prefer something like have a simple leaflet-offer structure which would make all this a piece of cake, but we had that before and ran into the volume issues.

Kijana Woodard

unread,
Nov 29, 2015, 9:53:59 AM11/29/15
to rav...@googlegroups.com
Is it possible the model is doing too much? What happens if you separate leaflet design/construction from leaflet delivery?

From: Gluber
Sent: ‎11/‎29/‎2015 7:51 AM
To: RavenDB - 2nd generation document database
Subject: Re: [RavenDB] Re: Document Counts in presence of fanout indices

Gluber

unread,
Nov 30, 2015, 11:45:47 AM11/30/15
to RavenDB - 2nd generation document database
Thanks for the input, i just did not expect that much interest/help on my problem :-)

On your tip this is also not a solution.

I should probably explain a bit more: The company I am doing this for has nothing to do with leaflet delivery at all.
They just basically take leaflets from other vendors, and index them, and make the offers in there available to users via mobile apps.

Basically a "OK i need to buy some beer, where is this on offer in my vicinity" type of use case..

We have to model basically where offers in a leaflet are available location wise ( hence the zip codes ) but also when the offer is "valid" time wise. ( There are further criteria that i omitted )
The complicated model stems from the fact that for example a large chain of stores puts out a leaflet for EACH store EACH week, that contains 90% duplications of offers but some special offers PER store that have different validation dates etc.... 

 

Kijana Woodard

unread,
Nov 30, 2015, 11:53:46 AM11/30/15
to rav...@googlegroups.com
Yes. But your index to answer the question about leaflet availability [zip/dates] starts at the page level. 

If separated what [pages/store level leaflet] from when/where [zips/dates], it might sort things out naturally.

--

Chris Marisic

unread,
Nov 30, 2015, 12:28:52 PM11/30/15
to RavenDB - 2nd generation document database


On Saturday, November 28, 2015 at 9:49:54 AM UTC-6, Gluber wrote:
 but i see no other solution for our problem sadly.


The moment this statement is uttered, you have likely solved the entirely wrong problem. Which then leads to the cascade of problems from the wrong solution. 

Federico Lois

unread,
Nov 30, 2015, 4:27:43 PM11/30/15
to rav...@googlegroups.com
When you have such problems is when you are trying to solve a summarization process when storing the raw data for the summarization instead. For better performance and you sanity of mind my suggestion is to move to an out of process summarization using data subscriptions. Then your indexes will be dead simple, your code performance will be far better and your code will become something you can reason about again.

You may need to use a few hashing tricks to avoid processing an element twice, but that's about it.

From: Kijana Woodard
Sent: ‎30/‎11/‎2015 13:53
To: rav...@googlegroups.com

Subject: Re: [RavenDB] Re: Document Counts in presence of fanout indices

Reply all
Reply to author
Forward
0 new messages