Date Range vs Int Range and Distinct

52 views
Skip to first unread message

Imran Shafiq

unread,
Jun 30, 2015, 1:13:00 AM6/30/15
to rav...@googlegroups.com

RavenDb noob here.. so pardon my ignorance :)

My simplified data model

public class Patient
{

string Name;

List<Problem> Problems;
List<Visit> Visits; 

}

public class Problem 
{
string Code;
DateTime Date;
}

public class Visit
{
string Code;
DateTime Date;
}

I want to run a query that returns all patients that had any activity within a date range i.e. had a problem diagnosed or a visit occured in the queried date range. 
Currently I am getting all patients via streamed query and filtering client side whether a patient qualifies for the date range or not but of course I have to change that.

I could not create index on patient doc since it has the fan out problem. one patient document could spit out 10 dates or 300 dates, I cant know in advance. So raven index dint work for me.

I created a separate document myself called PatientMonthMap (to start with I decided on keeping month to make it easier .. I am assuming int range queries would work better somehow?)

PatientMonthMap has Month as "yyyymm" and patient as "patientId"

The idea being that I could query this document for starting month and ending month and it would give me all patient ids, I can use a Server Transform on that to load the patient documents in question ->  that I could then use the Streamed Query to load. 
I am populating this document manually when I import new patient data. Patient data never changes in RavenDB, it only changes when I import new data from source DB, import happens nightly or weekly.

So the first question is: Does this make any sense :)

Am I better off using the full date to range query on? Does this yyyyMM (intrange) gives me any benefit?

How do I load "distinct" patient documents on top of this? e.g. a patient may have activity in June and July so his id is repeated in PatientMonthMap  

I can replace this strategy with a better technique if you guys have one :)


Oren Eini (Ayende Rahien)

unread,
Jun 30, 2015, 4:24:42 AM6/30/15
to ravendb
What are the date ranges that you care about? Month? 

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Imran Shafiq

unread,
Jun 30, 2015, 6:09:16 AM6/30/15
to rav...@googlegroups.com
Ideally I should be getting patients per date range i.e.

Active patients during June 1st, 2015 and December 30th, 2015

but to start with I though I could just do June, 2015 and December 2015, I assumed int range would work faster/simpler so I wanted to query where period >= 201506 && period <=201512

My main questions are:
Am I better off with this int range approach or should I do date range
Do I have to keep this manual map document to avoid the fanout limitation or is there a better approach
How do I get distinct patients from this map document that would be given as query to streamed session as my patients may be in the >50K range?
I am processing each patients data as the stream comes in but the patients have to be distinct. 
The processing itself per patient takes some milliseconds <=3ms, load takes more depending on the size of the patient doc 

Oren Eini (Ayende Rahien)

unread,
Jun 30, 2015, 9:36:46 AM6/30/15
to ravendb
Why not create a map/reduce index that would list the number of visits per person per day?

Federico Lois

unread,
Jun 30, 2015, 1:07:26 PM6/30/15
to rav...@googlegroups.com
Up to 1m records and a fan out in the low 1000s you are safe using a map reduce with the fanout limitation disabled. After that you are a bit more limited in options and should probably either change the data structure or use subscriptions and essentially do the summarization yourself (if you are doing complex calculations).

If you are not even near there, don't worry and use whatever is easier. Use ticks for date ranges they are 2x faster than dates.

Federico

From: Oren Eini (Ayende Rahien)
Sent: ‎30/‎06/‎2015 10:36
To: ravendb
Subject: Re: [RavenDB] Date Range vs Int Range and Distinct

Imran Shafiq

unread,
Jun 30, 2015, 7:20:55 PM6/30/15
to rav...@googlegroups.com
So after some experimenting, I used a simple index in RavenDB over the patient document as one possible solution, please comment if it is OK and won't have any performance issues.

Index creates Array of YearMonths per Patient doc. 

YearMonth: [201409, 201502, 201506]   _document_id: 101
YearMonth: [201301, 201302]   _document_id: 102

Now patients are distinct as there is one record per patient and no fanout.

Also, I can now do query like YearMonth == 201409   as well as YearMonth>=201301 && YearMonth<=201312

However, I dont know whats the limit on array items/length in an index? e.g. in the above example patient 101 has 3 yearmonth items and patient 102 has 2 items
I may have some patients that have up to 60 YearMonth Items, is that OK?

Also, this index will be working on around 200K patient documents   

Oren Eini (Ayende Rahien)

unread,
Jul 1, 2015, 2:22:46 AM7/1/15
to ravendb
Yes, that would work just fine.

Reply all
Reply to author
Forward
0 new messages