best practice for storing and searching large amounts of data

210 views
Skip to first unread message

Tom Mount

unread,
Sep 15, 2015, 2:13:21 PM9/15/15
to Firebase Google Group
I have a webapp where visitors fill out a short survey, the results of which are stored in Firebase. A sample set of responses may look like this:

{
 
"results": {
 
"randomid1": {
   
"age": 29,
   
"height": 57,
   
"weight": 42,
   
"submit_date":
1442339543203
  },
 
"randomid2": {
   
"age": 34,
   
"height": 63,
   
"weight": 53,
   
"submit_date": 1442246673772
  }
 }
}

At the moment I have probably 2.1+ million records just like this. I store the aggregate counters in a separate branch:

{
 
"aggregates": {
 
"age": {
   
"19": 0,
   
"29": 1,
   
"39": 1
 
},
 
"height": {
   
"40": 0,
   
"50": 0,
   
"60": 1,
   
"70": 1
 
}
 
}
}

This is great for reporting - when a customer completes the survey, the actual results are written to results and the appropriate aggregate counters are updated. All fine and well, and for the overview report, where we're showing the aggregate totals, it's perfect. But the customer also wants to search and summarize the raw results in a one-off way, which makes things difficult, because individual searches can take over three minutes to return data, which doesn't seem right. Firebase should be faster at that, right? The client wants to search ranges of submit_date and age at the same time, for example. So what I find myself doing is this:

var matches = {};
ref.child('results').orderByChild('submit_date').startAt(startDate).endAt(endDate).once('value', function(s) {
 s
.forEach(function(result) {
 
if (result.val().age >= startAge && result.val().age <= endAge)
   matches[result.key()] = result.val();
 
});
});

What's taking forever (according to the web socket results panel in Chrome developer tools) is the actual retrieval of data. My current index is set up like this:

{
 
"rules": {
 
"results": {
   
".indexOn": [
   
"submit_date",
   
"age"
   
]
 
}
 
}
}

The way I understand the docs is, that should allow me to sort or search the entire results collection by submit_date pretty quickly. However, like I said, with over 2.1 million results, searching by those dates is taking several minutes at a time. And it looks like it's the actual search; the dev tools show data streaming in pretty quick once the search response is received. Is there any way to modify the indexes or restructure the data somehow so that searches don't take so long to return results?

Tom Mount

unread,
Sep 15, 2015, 5:58:21 PM9/15/15
to Firebase Google Group
Made two changes based on emails with Kato - first, I changed instances of .once('value') to .on('value') in order to ensure we didn't chew up our bandwidth and crash the server trying to load the entire instance into server memory for just one operation; second, I modified my security rules to ensure that the indexing was being done correctly:

{
 
"rules": {
 
"results": {
   
".indexOn": "submit_date"
 
}
 
}
}

Since the only thing I search on is submit_date, reducing the index to just that works just fine for me. So far, what I'm seeing is that the first query takes a little while to load the dataset into memory on the server-side, but subsequent queries are nearly instantaneous.

So I think at this point my problem is mostly solved. I'm okay with the first query taking 30 seconds if the next five take 2 seconds.

Mario Giambanco

unread,
Sep 15, 2015, 6:32:36 PM9/15/15
to Firebase Google Group
Tom,

See if my question and the resulting answers can help you.

Kinda sucks having to go outside of firebase and use a 3rd party tool (Apache Drill) but if thats the way it is now, might be better then having to dump the data into a relational database and querying it that way.


-Mario


On Tuesday, September 15, 2015 at 2:13:21 PM UTC-4, Tom Mount wrote:

Tom Mount

unread,
Sep 16, 2015, 10:43:06 AM9/16/15
to Firebase Google Group
The info came *out* of a relational database. While it's a clever workaround I don't think it fits what what my customer needs. I think setting the indexes correctly will do it for them, but time will tell.
Reply all
Reply to author
Forward
0 new messages