Patterns for Date Range Index Queries

272 views
Skip to first unread message

Chris Mcvittie

unread,
Nov 22, 2012, 12:24:24 PM11/22/12
to rav...@googlegroups.com
Hi, 
I have the following model:

class Plan {
  string Name {get;set;}
  Datetime StartDate {get;set;}
  Datetime EndDate {get;set;}
  List<string> CategoryIDs {get;set;}
  int RequiredHours {get;set;}
}

class Log {
  DateTime Date {get;set;}
  int Hours {get;set;}
  string CategoryId {get;set;}
}

And this desired result:
class PlanSearchResult {
  string PlanName {get;set;}
  int RequiredHours {get;set;}
  int LoggedHours {get;set;}
}

And I need to calculate the "Logged Hours" for a plan, which is the sum of the Hours property on all logs within the plan's start and end date, where the log category matches the any of the plans categories.  I will typically have 2/3 plans, and 2000 or so logs, of which 200 will fall within the date range, of which most will match a category in the plan.

From an indexing strategy, I'm not sure what to do.  I need per-day granularity on the date range, so I could create an index with the sum of hours grouped by category id and YYYY-MM-DD, and then query this index for all my plans.

Or alternatively, as I'm fronting the DB with an API, and data access is fast do I just pull everything back with queries, and aggregate in API memory?

Or am I missing a better way of doing things, either from an indexing perspective or a modelling one?

Thanks, 
Chris




Matt Johnson

unread,
Nov 22, 2012, 4:50:27 PM11/22/12
to rav...@googlegroups.com
A few questions first:
 
1) Are Plan and Log only related by their CategoryIds?  If so, then you don't really log against a plan, but against a category, and then the plan contains allowed categories, correct?  If not, then I would expect a PlanId on each Log item.
 
2) What sort of DateTime values are in each Log?  Is it just a local calendar date, as in 5 hours on 1/1 and 6 hours on 1/2, or is it at specific time-of day?  How about on the dates in the Plan?
 
3) Are your StartDate and EndDate ranges fully inclusive?  Inclusive/Exclusive? Or something else?  How do you deal with overlaps?
 
4) Are all of your users in the same time zone? Are you having to translate between time zones, or back and forth to UTC?
 
5) Is your math always strictly a sum of integer hours? Or do you do ANY math against the actual date values?  If so, please elaborate.  It matters.
 
Context is very important with datetime concerns.  I can better answer with more information.
 
Thanks,
Matt

Oren Eini (Ayende Rahien)

unread,
Nov 22, 2012, 6:44:57 PM11/22/12
to rav...@googlegroups.com
In addition, is it possible to have a log against a category on a plan _outside_ the Plan's dates?

Chris Mcvittie

unread,
Nov 23, 2012, 5:17:35 AM11/23/12
to rav...@googlegroups.com
Replies inline.

Thanks


On Thursday, 22 November 2012 23:45:19 UTC, Oren Eini wrote:
In addition, is it possible to have a log against a category on a plan _outside_ the Plan's dates?
Yes - logs and plans are independent. Typically someone will have a 12 month plan, and then another for the next year.  They then just log to their journal on a daily/weekly basis.

On Thu, Nov 22, 2012 at 11:50 PM, Matt Johnson <mj1...@hotmail.com> wrote:
A few questions first:
 
1) Are Plan and Log only related by their CategoryIds?  If so, then you don't really log against a plan, but against a category, and then the plan contains allowed categories, correct?  If not, then I would expect a PlanId on each Log item.
Yes - logs and plans are independent. Typically someone will have a 12 month plan, and then another for the next year.  They then just log to their journal on a daily/weekly basis. 
 
2) What sort of DateTime values are in each Log?  Is it just a local calendar date, as in 5 hours on 1/1 and 6 hours on 1/2, or is it at specific time-of day?  How about on the dates in the Plan?
Logs have a startDateTime and endDateTime, however, for the purposes of plan hour calculation, it should be based on start date time.  So logging "Reading for 2 hours from 3pm-5pm on 23/11/12" - from a plan perspective 2 hours, and 23/11/12 are the only important bits of information". 
 
3) Are your StartDate and EndDate ranges fully inclusive?  Inclusive/Exclusive? Or something else?  How do you deal with overlaps?
Inclusive.  By using only start date a log will either contribute to a plan or it won't.  Plans can overlap, so the same log can contribute to multiple plans.
 
4) Are all of your users in the same time zone? Are you having to translate between time zones, or back and forth to UTC?
Presently same time zone.
 
5) Is your math always strictly a sum of integer hours? Or do you do ANY math against the actual date values?  If so, please elaborate.  It matters.
It is a sum of integer hours - no math on date values - they are only used to identify logs.

Matt Johnson

unread,
Nov 23, 2012, 2:40:14 PM11/23/12
to ravendb
Ok, so you say your Logs actually have StartDateTime and EndDateTime,
and you are calculating the Hours.  Be careful there  - even in the
same timezone, you may have abiguity around daylight savings time
transitions.  In general with .Net (not just Raven), you never want to
fully trust a duration calculated by the difference between two
DateTime values.  You can, however, trust a duration calculated by the
difference between two DateTimeOffset values.  So my first
recomendation would be to use DateTimeOffset in your Log items.  You
say you are "presently" in a single time zone.  If this were to ever
change, you would need to do that anyway.  You can certainly still
take the duration ahead of time, before putting it in the Hours field
 - or you could calculate this on the fly if you were to store both
start and end times.  You may want to consider a decimal hours field
instead of integer  - unless you are rounding to the nearest hour or
something.

You also need to consider what you want to do with Logs that cross
midnight.  On the surface, you may just say that it all goes to the
start date.  But what if it's an 10 hour duration and only 5 minutes
of it fell before midnight? Are you sure you want it all on the prior
day?  Something to consider in your application design.

With regard to the plan dates, consider a plan that goes from Jan 1
2012 to Dec 31 2012. What value are you storing in the EndDate?
 2012-12-31T00:00:00  or  2012-12-31T23:59:59 ? The latter will work,
but it's somewhat of a hack in that you have to manipulate the time
and be aware of the precision.  The former will only work if you never
log time on Dec 31, or if you are translating your log time to
midnight with .Date  - which is also a hack (IMHO).  A better solution
is to use inclusive/exclusive ranges, basically store
2013-01-01T00:00:00 for the end date.  A date is inside the range if
StartDate <= theDateInQuestion < EndDate.  However, I must admit, I am
extremely opinionated in this area, and many people prefer the fully
inclusive 23:59:59 approach.

Now to your original question, what to do about indexing.  You
identified that you need per-day granularity, so that helps.  I would
approach this with a map/reduce index of logs, grouped by category and
calendar date (at midnight via .Date property), summing the duration
in the reduce. I also think you are missing a UserId of some type
here?  You may want to include that in the Log and group by it as
well.  If you are just getting the plan totals for a single plan, then
you can query directly into this index with the date range in question
for the categories in your plan (proably you need to
use .In(categories))

However, if you want a list of all plans including their totals, then
you should build a multi-map index. You'll map from Plan and Log
separately, and then join them together in the reduce, using similar
logic (date between date range, category in categories, etc.).  This
may be more challenging to write, but it will avoid a SELECT N+1 issue
with just calling the other index for each plan.

Hope this sheds some light on the issues you will encounter.  BTW - I
have very similar concerns in my core work application doing employee
time-and-attendance tracking.  If I can be of any further help, please
ask.

-Matt

On Nov 23, 3:17 am, Chris Mcvittie <bikk...@gmail.com> wrote:
> Replies inline.
>
> Thanks
>
> On Thursday, 22 November 2012 23:45:19 UTC, Oren Eini wrote:
>
> > In addition, is it possible to have a log against a category on a plan
> > _outside_ the Plan's dates?
>
> Yes - logs and plans are independent. Typically someone will have a 12
> month plan, and then another for the next year.  They then just log to
> their journal on a daily/weekly basis.
>
>
>
> > On Thu, Nov 22, 2012 at 11:50 PM, Matt Johnson <mj1...@hotmail.com<javascript:>

Chris Mcvittie

unread,
Nov 27, 2012, 4:47:56 AM11/27/12
to rav...@googlegroups.com
However, if you want a list of all plans including their totals, then 
you should build a multi-map index. You'll map from Plan and Log 
separately, and then join them together in the reduce, using similar 
logic (date between date range, category in categories, etc.).  This 
may be more challenging to write, but it will avoid a SELECT N+1 issue 
with just calling the other index for each plan. 

For now, this is what  - a list of all plans and their totals.  So I tried it - but my effort isn't right.   I've put my code below, but I can't quite work out how to make a log's hours "count towards" multiple plans.  But maybe that is a symptom of me doing it wrong.

 public class PlanStats : AbstractMultiMapIndexCreationTask<PlanStats.PlanLogStats>
    {
        public class PlanLogStats
        {
            public int LogHours { get; set; }
            public DateTime LogStartDateTime { get; set; }
            public DateTime PlanStartDateTime { get; set; }
            public DateTime PlanEndDateTime { get; set; }
            public string PlanName { get; set; }

        }

        public PlanStats()
        {

            AddMap<Plan>(plans => from plan in plans
                                  select new
                                             {
                                                 LogHours = 0,
                                                 LogStartDateTime = DateTime.MinValue,
                                                 PlanStartDateTime = plan.StartDateTime,
                                                 PlanEndDateTime = plan.FinishDateTime,
                                                 PlanName = plan.Title,
                                             });

            AddMap<Log>(logs => from log in logs
                                select new
                                {
                                    LogHours = log.CpdHours,
                                    LogStartDateTime = log.StartDateTime,
                                    PlanStartDateTime = DateTime.MinValue,
                                    PlanEndDateTime = DateTime.MinValue,
                                    PlanName = default(string),
                                });

            Reduce = results => from result in results
                                group result by new
                                                     {
                                                         result.PlanName,
                                                         result.PlanStartDateTime,
                                                         result.PlanEndDateTime,
                                                     }
                                    into agg
                                    select new
                                               {
                                                   LogHours = agg
                                                                .Where(l => l.LogStartDateTime >= agg.Key.PlanStartDateTime && l.LogStartDateTime < agg.Key.PlanEndDateTime)
                                                                .Sum(x => x.LogHours),
                                                   LogStartDateTime = DateTime.MinValue,
                                                   agg.Key.PlanName,
                                                   agg.Key.PlanStartDateTime,
                                                   agg.Key.PlanEndDateTime,
                                               };

Matt Johnson

unread,
Nov 27, 2012, 12:13:42 PM11/27/12
to ravendb
Can you provide an example of actual data for a plan and a few logs?
A mock-up of desired results would also be good.
> ...
>
> read more »

Chris Mcvittie

unread,
Nov 28, 2012, 6:38:01 AM11/28/12
to rav...@googlegroups.com
Here is a simple test that is currently yielding "0" for me...  Have included stuff about categories, but these aren't accounted for in the index... yet...

 public class PlanTests : RavenTest
    {
        [Fact]
        public void CompletedHoursCalculation()
        {
            documentStore.ExecuteIndex(new PlanStats());

            using (IDocumentSession session = documentStore.OpenSession())
            {
                session.Store(new Plan
                                  {
                                      Title = "Plan 1",
                                      StartDateTime = new DateTime(2010, 01, 01),
                                      FinishDateTime = new DateTime(2013, 01, 01),
                                      TotalRequiredHours = 400,
                                      RequiredHours = new List<RequiredHours>
                                                          {
                                                              new RequiredHours
                                                                  {
                                                                      Category = new CategoryReference
                                                                                     {
                                                                                         Id = "category/1",
                                                                                         Name = "Learning by Doing"
                                                                                     },
                                                                      Hours = 60
                                                                  }
                                                          }
                                  });
                session.Store(new Log
                                  {
                                      Title = "Attending Conference",
                                      Content = "Some notes here on the log event",
                                      Category = new CategoryReference
                                                     {
                                                         Id = "category/1",
                                                         Name = "Learning by Doing"
                                                     },
                                      StartDateTime = new DateTime(2011, 01, 01),
                                      CpdHours = 20
                                  });

                var res = session.Query<PlanStats.PlanLogStats, PlanStats>().ToList();

                Assert.NotNull(res);
                Assert.Equal(20, res.Sum(x => x.LogHours));

Matt Johnson

unread,
Nov 28, 2012, 4:25:02 PM11/28/12
to ravendb
Chris, I'm working on a solution for you, but encountering a bug in
raven that is in the way. I am working on a test case so they can
work on it.

A couple of points while we're waiting

- You forgot to call session.SaveChanges() in your test.
- You will also need .Customize(x => x.WaitForNonStaleResults()) in
your query for the test. You won't need this in production.
- You aren't going to be able to filter by range in the summation part
of your reduce. There are multiple reduce steps from multiple
sources. You could do something like that in TransformResults, but
that wouldn't help with your need for aggregation.
- The main thing that binds your Plan and Log together is the
Category, but you've omitted it from your index. The CategoryId is
the main thing to group by. The other part is the date. In other
words, to join in the manner you are looking for, you will only have
precision to the nearest whole day, and you would group by the
{ categoryid, date } pair. In order to do this, you need your plan
projected for every discreet date within the range, which is where the
bug I encountered is.

The more I look at your model, the more it looks like you're creating
more complexity than is needed. Are you sure you want all time logged
against a category to apply to ALL plans that contain that category
and have the date in range? That's what your model currently does.
If you were only logging time against a single plan, then a PlanId on
the Log would be in order - and would drastically simplify your index.
> ...
>
> read more »

Matt Johnson

unread,
Nov 28, 2012, 5:53:54 PM11/28/12
to ravendb
For any lurkers:
http://issues.hibernatingrhinos.com/issue/RavenDB-757
> > > > > same timezone, you may have abiguity around daylight savings time...
>
> read more »

Matt Johnson

unread,
Nov 29, 2012, 1:27:13 AM11/29/12
to ravendb
Think about this some more, perhaps when you actually write the Log
you could simply query for matching plans and write an array of
PlanIds to the Log? That would make things much simpler to index.
> ...
>
> read more »

Chris Mcvittie

unread,
Nov 29, 2012, 7:28:36 AM11/29/12
to rav...@googlegroups.com
Hi Matt, 

The context for this is a "Continuing Professional Development" system, where users will log things than benefit their career, e.g. a training course.  Typically a user will be a member of 1 or more professional bodies, each of whom will require certain evidence of logs for a given time period.  We are creating the plan entity to group these logs together so that users can see how much CPD they need to  schedule for the next x months.  While generally a plan is known upfront, it is possible that it can be added at any time - and in that a case a user would expect retrospective logs of the appropriate category to apply to it.

So I think the complexity in the model is necessary.  But possibly the complexity of implementing it as a map-reduce is not.

The main view which is relevant to the discussion is the list of plans - of which there might be 3 active at any point in time.  This has plan name, completed hours, target hours, and number of logs as its basic view.  As this is all I want to maintain, it maybe makes sense to do the cleverness on save... on log update to update the plan stats, and on plan update/create, to find all affected logs.

On this subject - there is one thing I'm not sure about in Raven.  "Safe by Default" stipulates the max result set size, so where I want to do an operation, like finding all logs in date range - is there a sensible pattern for "applying it to all" while not override the safe 128 document page size?  (In any case my where statement doesn't work below, but that should be fixable...).  

e.g.
        public override object OnPut(Plan dto)
        {
            dto.Id = dto.Id.Replace("%2F", "/");
            using (var session = DBSession)
            {
                var categoryIds = dto.RequiredHours.Select(x => x.Category.Id);
                var logSum = session.Query<Log>()
                    .Where(x => x.StartDateTime > dto.StartDateTime && x.StartDateTime < dto.FinishDateTime && categoryIds.Contains(x.Category.Id))
                    .Sum(x => x.CpdHours);
                dto.CompleteRequiredHours = logSum;
                session.Store(dto);
            }

            return new HttpResult()
            {
                StatusCode = HttpStatusCode.NoContent,
                Headers = {
                                         { HttpHeaders.Location, dto.Id }
                                     }
            };
        }

Chris Marisic

unread,
Nov 29, 2012, 8:45:35 AM11/29/12
to rav...@googlegroups.com


On Thursday, November 29, 2012 7:28:36 AM UTC-5, Chris Mcvittie wrote:

On this subject - there is one thing I'm not sure about in Raven.  "Safe by Default" stipulates the max result set size, so where I want to do an operation, like finding all logs in date range - is there a sensible pattern for "applying it to all" while not override the safe 128 document page size?  (In any case my where statement doesn't work below, but that should be fixable...).  


The 128 is just picked semi-arbitary. The safe by design aspect really applies that you can't (normally) ever reach the N+1 problem, or load your entire database into memory the way you can easily in an ORM. The default max per query is 1024. This value can be changed on the server, but it is strongly recommended to not alter the 1024 limit.

This leaves you with working with paging: skip/take.

When we deal with large batch operations we walk the database sorted by Created data (linear order) and take batches of 1024. 

Chris Mcvittie

unread,
Nov 29, 2012, 9:02:04 AM11/29/12
to rav...@googlegroups.com
Going slightly more off-topic, but in the context of this being an "local" function - i.e. my API server (which is doing the query) and RavenDB are currently on the same box, and if not will be on a fast network.  Is there a technical benefit to this limit that warrants the extra code I have to write, and the extra connections that get made by sending say 5 queries instead of 1 for 5000 items?

Troy

unread,
Nov 29, 2012, 9:19:49 AM11/29/12
to rav...@googlegroups.com
This may not even be on base... But could you use a ScriptedPatch and UpdateByIndex? Or possibly and IndexedProperty?

Chris Mcvittie

unread,
Nov 29, 2012, 9:31:29 AM11/29/12
to rav...@googlegroups.com
I don't think I can see how to make any of those work - because I'm trying to update my "plans" with an aggregate of "logs", and if my understanding is correct, IndexedProperty applies within a document, whereas my logs are within a collection.

Troy

unread,
Nov 29, 2012, 9:43:24 AM11/29/12
to rav...@googlegroups.com
I was thinking you could update your Plans by CategoryId ... and looking at the class structure again, you are correct.

Matt Johnson

unread,
Nov 29, 2012, 11:23:06 AM11/29/12
to ravendb
Thanks for the context Chris. Now I have I much better idea of what
you need.

First, forget what I said about inclusive/exclusive intervals. Those
are good for up-to-the-moment instantaneous-time ranges, but your
ranges are whole calendar-time dates. So you are better served by
inclusive/inclusive DateTimes (with DateTime.Kind of "unspecified",
not local or utc). I imagine your UI will have start and end date
fields, which you will parse in to DateTimes. So your ranges would be
like [2010-01-01 - 2010-12-31], followed by the next range of
[2011-01-01 - 2011-12-31]. There is no risk of overlap or ambiguity,
because you are dealing with whole calendar dates. (I think I should
write a blog on this...)

Second, I think most of what you want to accomplish can be done in the
domain logic without anything fancy from Raven indexing. You were
trying to approach this with loose coupling between Log and Plan,
letting Raven be responsible for coupling them in the index. It's
probably easier and more efficient to make these more tightly
coupled. Consider the following:

public class Plan
{
public string Id {get; set;}
public string Title {get; set;}
public string UserId {get; set;}
public DateTime StartDate {get; set;}
public DateTime FinishDate {get; set;}
public List<PlanItem> {get; set;}
}

public class PlanItem
{
public string CategoryId {get; set;} // or use a CategoryReference
public decimal RequiredHours {get; set;}
public decimal CompletedHours {get; set;}
}

public class Activity // a better contextual name than "Log"
{
public string Id {get; set;}
public string UserId {get; set;}
public DateTime Date {get; set;}
public decimal LogHours {get; set;}
public string Description {get; set;}
public string CategoryId {get; set;} // or use a CategoryReference
public List<string> PlanIds {get; set;}
}

When you record activity, you can find the plans that are applicable,
update those plans CompletedHours, and record the planid with the
activity.

When you create or update a plan, you can examine the activities that
currently match the plan, and update the PlanItems that are in the
plan, and update activities to reference the new plan if needed.

You can either paginate over 1024 results, or use the Patching API to
make these changes. You would make sure to update both items in the
same unit of work, for transactional guarantees. You also may want to
retry if you encounter a concurrency exception.

This would give you a view from both sides, without relying on any
custom map/reduce indexing. You *could* still do some of it in
indexing if you wanted to - such as calculating CompletedHours as a
map/reduce summation. But you can now group by PlanId and CategoryId
instead of having to do anything complicated with dates.

Kijana Woodard

unread,
Nov 29, 2012, 12:05:01 PM11/29/12
to rav...@googlegroups.com
@matt +1

Having the PlanItem document gives you a hook for IndexedProperty as well.

Chris Mcvittie

unread,
Nov 29, 2012, 12:22:20 PM11/29/12
to rav...@googlegroups.com
Are there docs on IndexedProperties?

Matt Warren

unread,
Nov 29, 2012, 12:29:48 PM11/29/12
to ravendb
Not at the moment, currently the best guide is the tests here https://github.com/ayende/ravendb/tree/indexed-prop-array/Bundles/Raven.Bundles.Tests/IndexedProperties

I'm planning to do some docs soon as people keep asking for it. However I was chatting with Oren the other night and there are plans to overhaul the entire bundle. You would then be able to write JS patches to control how you move the properties from one doc to the other, similar to eval patching (http://ayende.com/blog/157185/awesome-ravendb-feature-of-the-day-evil-patching). This would make it much more powerful.

Having said all that, at the moment the IndexedProperties bundle allows you to update a "destination" doc when a "source" doc changes and causes a re-index (generally a Map/Reduce index).

Chris Mcvittie

unread,
Nov 29, 2012, 12:35:10 PM11/29/12
to rav...@googlegroups.com
@matt - thanks for your considerable time and effort in understanding my issue, and advising on it.  The solution seems reasonable to me, though I'm am a little surprised by the level of effort required versus the SQL equivalent... but maybe I just need to get that out of my head!
Reply all
Reply to author
Forward
0 new messages