Handling Hot Data with ActiveRecord

Mohit Sindhwani

unread,

Jul 21, 2015, 10:31:20 AM7/21/15

to rubyonra...@googlegroups.com

Hello!

We are thinking of different ways to handle hot data in our system. We
get a lot of information that's very relevant for periods of time. So,
for example, we have some data that we query on the basis of
> counts over last 30 minutes
> data collected in the last 30 minutes
> items collected today
> items collected within the past week
...and so on.

We are looking at different strategies to keep this data updated and to
manage scaling the database. So, we are looking at horizontal sharding
[splitting data into multiple tables that inherit from a master and have
exclusion constraints] and are also considering having tables that hold
data as:
> All the data
> Data for the last 24 hours
> Data for the last 2 hours
That way, when we want something recent, we would just query the most
recent table, but in the few occasions that we need something more, we
go to the larger tables that are sharded.

Just for reference, we are doing something like this:
select count(*), count(distinct group_id) from
(select group_id, recorded_on from data_store_v2 order by id DESC limit
900000) td
where recorded_on >= (now() - '24 hr'::INTERVAL);

We are getting 800,000 data items a day right now and the above query
takes around 14 seconds on a single table that is not sharded and has
around 268million records. Every week, this table becomes slightly
slower since we add close to 6 million records every week.

I've read this:
https://www.amberbit.com/blog/2014/2/4/postgresql-awesomeness-for-rails-developers/
and am looking at ways that everything is managed under Rails, if possible.

So, the questions (and thanks for reading this far) are:
> What is a good way to do this while still working within ActiveRecord?
> This is a capability that we'd like to attach to any model that might
need it. What would be a good way to approach a gem for it?

I'm sure there will be a few more questions as we progress on this.

Thanks,
Mohit.

Elizabeth McGurty

unread,

Jul 21, 2015, 3:33:02 PM7/21/15

to Ruby on Rails: Talk

First I would work to learn patterns in time increments, which occur most often, and plan from there. Then I would design with regard to DB code and servers, optimal Master-slave replication. Pretty sure that in replication WHERE clauses like (where recorded_on >= (now() - '24 hr'::INTERVAL) could be eliminated at the user level. But I am not certain.

Mohit Sindhwani

unread,

Jul 22, 2015, 9:31:49 AM7/22/15

to rubyonra...@googlegroups.com

Hi Elizabeth,

Thank you for replying.

On 22/7/2015 3:33 AM, Elizabeth McGurty wrote:
> First I would work to learn patterns in time increments, which occur
> most often, and plan from there.

We do know that - we have a few peak hours when more data comes in, and
then there are hours (like night time) where the data drops to a trickle
for some of the records. However, there are some status emails, etc.
that come at a fixed frequency - few times an hour, irrespective of the
hour.

> Then I would design with regard to DB code and servers, optimal
> Master-slave replication. Pretty sure that in replication WHERE
> clauses like (where recorded_on >= (now() - '24 hr'::INTERVAL) could
> be eliminated at the user level. But I am not certain.

While this is something that might help, I'm more looking at solutions
that involve having hot data tables in the manner:

> All the data
> Data for the last 24 hours
> Data for the last 2 hours
That way, when we want something recent, we would just query the most
recent table, but in the few occasions that we need something more, we
go to the larger tables that are sharded.

Best Regards,
Mohit.

Scott Ribe

unread,

Jul 22, 2015, 9:54:36 AM7/22/15

to rubyonra...@googlegroups.com, Mohit Sindhwani

On Jul 21, 2015, at 8:30 AM, Mohit Sindhwani <mo_...@onghu.com> wrote:
>
> select count(*), count(distinct group_id) from
> (select group_id, recorded_on from data_store_v2 order by id DESC limit 900000) td
> where recorded_on >= (now() - '24 hr'::INTERVAL);

Some suggestions in a slightly different direction:

1) rewrite without that completely unnecessary subquery, and see if the query time improves ;-)

2) check that appropriate indexes exist and are being used by the optimizer

3) really analyze the query execution plan, and look for more advanced opportunities, for instance, order by recorded_on instead of id, since, presumably, the query will already access the rows by an index on ordered_on; consider dropping the limit & order altogether; take that query to a PostgreSQL list...

4) put some effort into learning SQL better; OK, we all make mistakes sometimes and maybe this is just that; but it sure looks to me like someone who doesn't really understand SQL struggling at throwing together various clauses until the correct answer pops out (note that in addition to the issue I pointed out in 1, the subquery is selecting a column which is completely unused--probably doesn't affect anything, but just another sign that the person writing the query did not understand it).

--
Scott Ribe
scott...@elevated-dev.com
http://www.elevated-dev.com/
https://www.linkedin.com/in/scottribe/
(303) 722-0567 voice

Elizabeth McGurty

unread,

Jul 22, 2015, 5:32:01 PM7/22/15

to Ruby on Rails: Talk

Amen! Amen! To what Scott has written. I actually wanted to offer an alternative SQL procedure but thought that I shouldn't as out of context with ROR...

Mohit Sindhwani

unread,

Jul 22, 2015, 11:11:14 PM7/22/15

to rubyonra...@googlegroups.com

Hi Scott,

Thanks for your email. Your inputs are certainly useful.

On 22/7/2015 9:54 PM, Scott Ribe wrote:
> On Jul 21, 2015, at 8:30 AM, Mohit Sindhwani <mo_...@onghu.com> wrote:
>> select count(*), count(distinct group_id) from
>> (select group_id, recorded_on from data_store_v2 order by id DESC limit 900000) td
>> where recorded_on >= (now() - '24 hr'::INTERVAL);
> Some suggestions in a slightly different direction:
>
> 1) rewrite without that completely unnecessary subquery, and see if the query time improves ;-)

We have tried this and the query is quite a bit slower. Filtering to
the last 900k records before doing the recorded_on part helped speed it up.

> 2) check that appropriate indexes exist and are being used by the optimizer
>
> 3) really analyze the query execution plan, and look for more advanced opportunities, for instance, order by recorded_on instead of id, since, presumably, the query will already access the rows by an index on ordered_on; consider dropping the limit & order altogether; take that query to a PostgreSQL list...

Your email got me going back to look at all the parts again since
obviously the query should be using the index and it was still slow.
Further search last night made me realize that it's not the indexes that
are a problem. The problem is the count(distinct group_id) part which
seems to be quite slow in PostgreSQL. This is a lot faster:
select count(*) from
(select distinct group_id from
data_store_v2 where recorded_on >= '2015-06-06') td;
than:
select count(distinct group_id) from
data_store_v2 where recorded_on >= '2015-06-06';

as explained here:
https://www.periscope.io/blog/use-subqueries-to-count-distinct-50x-faster.html

So, I guess the real problem was being masked by something else and an
incorrect assumption on my part :)

> 4) put some effort into learning SQL better; OK, we all make mistakes sometimes and maybe this is just that; but it sure looks to me like someone who doesn't really understand SQL struggling at throwing together various clauses until the correct answer pops out (note that in addition to the issue I pointed out in 1, the subquery is selecting a column which is completely unused--probably doesn't affect anything, but just another sign that the person writing the query did not understand it).

Thanks for the analysis :D
I do understand SQL and I thought I'm not throwing things together...
for my understanding, which column was unnecessary? I thought we needed
all:
> group_id for counting the distinct group_id
> recorded_on for the subsequent query on it
> id only for getting the most recent records

Most of my personal work is with SQLite3 on embedded platforms, but this
discussion resolves the problem for now. It now moves away from being a
Rails issue to being a PostgreSQL issue.

Best Regards,
Mohit.

Scott Ribe

unread,

Jul 22, 2015, 11:54:36 PM7/22/15

to rubyonra...@googlegroups.com, Mohit Sindhwani

On Jul 22, 2015, at 9:10 PM, Mohit Sindhwani <mo_...@onghu.com> wrote:
> We have tried this and the query is quite a bit slower. Filtering to the last 900k records before doing the recorded_on part helped speed it up.

I don't understand how that could possibly be the case if there's an index on recorded_on.

> Your email got me going back to look at all the parts again since obviously the query should be using the index and it was still slow. Further search last night made me realize that it's not the indexes that are a problem. The problem is the count(distinct group_id) part which seems to be quite slow in PostgreSQL. This is a lot faster:
> select count(*) from
> (select distinct group_id from
> data_store_v2 where recorded_on >= '2015-06-06') td;
> than:
> select count(distinct group_id) from
> data_store_v2 where recorded_on >= '2015-06-06';
>
> as explained here: https://www.periscope.io/blog/use-subqueries-to-count-distinct-50x-faster.html
>
> So, I guess the real problem was being masked by something else and an incorrect assumption on my part :)

I would expect the select count(distinct...) to be a major contributor to the time taken by the query, just given the amount of work it must do. The select count(*) from (select distinct...) alternative is a nice tip :)

>
> Thanks for the analysis :D
> I do understand SQL and I thought I'm not throwing things together... for my understanding, which column was unnecessary? I thought we needed all:
> > group_id for counting the distinct group_id
> > recorded_on for the subsequent query on it
> > id only for getting the most recent records

I apologize--I misread the query structure. I got it into my head as:

select ... from (select ... from ... where recorded_on ... order by ... limit ...)

I think you can see how THAT query would have better fit my description of being poorly constructed.

I'm glad that my somewhat off-base pontification still managed to point you in a useful direction!

Mohit Sindhwani

unread,

Jul 23, 2015, 1:28:24 AM7/23/15

to rubyonra...@googlegroups.com

Hi Scott,

On 23/7/2015 11:54 AM, Scott Ribe wrote:
> On Jul 22, 2015, at 9:10 PM, Mohit Sindhwani <mo_...@onghu.com> wrote:
>> We have tried this and the query is quite a bit slower. Filtering to the last 900k records before doing the recorded_on part helped speed it up.
> I don't understand how that could possibly be the case if there's an index on recorded_on.

Because it was the count(distinct x) that was the problem :)
Doing only a count(*) is faster without the subquery... and is what we
have switched to.

>
>> Your email got me going back to look at all the parts again since obviously the query should be using the index and it was still slow. Further search last night made me realize that it's not the indexes that are a problem. The problem is the count(distinct group_id) part which seems to be quite slow in PostgreSQL. This is a lot faster:
>> select count(*) from
>> (select distinct group_id from
>> data_store_v2 where recorded_on >= '2015-06-06') td;
>> than:
>> select count(distinct group_id) from
>> data_store_v2 where recorded_on >= '2015-06-06';
>>
>> as explained here: https://www.periscope.io/blog/use-subqueries-to-count-distinct-50x-faster.html
>>
>> So, I guess the real problem was being masked by something else and an incorrect assumption on my part :)
> I would expect the select count(distinct...) to be a major contributor to the time taken by the query, just given the amount of work it must do. The select count(*) from (select distinct...) alternative is a nice tip :)

...and that is what the cause was.

>> Thanks for the analysis :D
>> I do understand SQL and I thought I'm not throwing things together... for my understanding, which column was unnecessary? I thought we needed all:
>>> group_id for counting the distinct group_id
>>> recorded_on for the subsequent query on it
>>> id only for getting the most recent records
> I apologize--I misread the query structure. I got it into my head as:
>
> select ... from (select ... from ... where recorded_on ... order by ... limit ...)
>
> I think you can see how THAT query would have better fit my description of being poorly constructed.
>
> I'm glad that my somewhat off-base pontification still managed to point you in a useful direction!
>

Yes, thanks again.

Best Regards,
Mohit.

Scott Ribe

unread,

Jul 23, 2015, 8:12:40 AM7/23/15

to rubyonra...@googlegroups.com, Mohit Sindhwani

On Jul 22, 2015, at 11:27 PM, Mohit Sindhwani <mo_...@onghu.com> wrote:
>
> Hi Scott,
>
> On 23/7/2015 11:54 AM, Scott Ribe wrote:
>> On Jul 22, 2015, at 9:10 PM, Mohit Sindhwani <mo_...@onghu.com> wrote:
>>> We have tried this and the query is quite a bit slower. Filtering to the last 900k records before doing the recorded_on part helped speed it up.
>> I don't understand how that could possibly be the case if there's an index on recorded_on.
>
> Because it was the count(distinct x) that was the problem :)
> Doing only a count(*) is faster without the subquery... and is what we have switched to.

So does the limit reduce it to less than 1 day's rows?

Mohit Sindhwani

unread,

Jul 23, 2015, 9:14:40 AM7/23/15

to rubyonra...@googlegroups.com

Hi Scott,

On 23/7/2015 8:12 PM, Scott Ribe wrote:
>> Because it was the count(distinct x) that was the problem :)
>> Doing only a count(*) is faster without the subquery... and is what we have switched to.
> So does the limit reduce it to less than 1 day's rows?

Yes, the idea of the limit (900k records) was to limit it to the records
within 1 day and then run the query on the smaller record set. However,
after your comments, when I was looking around, I resolved that the
slowness is because of the count (distinct x) and not because of the
indexes, etc.

For reference, on a 230million record table, the numbers were roughly
along these lines:
* SELECT count(*), count (distinct group_id) with a limit of 900k
records based on ID DESC, followed by the recorded_on part = 8.6 seconds
* SELECT count(*), count (distinct group_id) on the whole table using
only recorded_on in the WHERE = 14 seconds
* SELECT count(*) only with a limit of 900k records based on ID DESC,
followed by the recorded_on part = 700ms
* SELECT count(*) on the whole table using only recorded_on in the WHERE
= 350ms
--> Clearly, the culprit was the count (distinct group_id) - that
benefits a lot by using a subquery to limit the number of records it
considers
* SELECT count(*) from (select distinct group_id from data_store_v2
where recorded_on >= '') --> takes around 900ms

So, we are combining these in the final query now... this takes around
900ms to get both values (count and count distinct)
-- get the fields from 2 different subqueries
select * from
-- first field is got for the count(*)
(select count(*) AS all_count from data_store_v2 where recorded_on >=
(now() AT TIME ZONE 'Asia/Singapore' - '24 hr'::INTERVAL)) as t1,
-- and Joining in the second one for the count (distinct group_id)
(select count(*) from
-- this is yet another subquery
(select distinct drive_id from data_store_v2 where recorded_on >=
(now() AT TIME ZONE 'Asia/Singapore' - '24 hr'::INTERVAL)) td ) as t2;

But as I mentioned, this is now a PostrgreSQL question not an
ActiveRecord or Rails question :)

Thanks for digging with me!

Best Regards,
Mohit.

Scott Ribe

unread,

Jul 23, 2015, 4:47:10 PM7/23/15

to rubyonra...@googlegroups.com

On Jul 23, 2015, at 7:14 AM, Mohit Sindhwani <mo_...@onghu.com> wrote:
>
> * SELECT count(*) only with a limit of 900k records based on ID DESC, followed by the recorded_on part = 700ms
> * SELECT count(*) on the whole table using only recorded_on in the WHERE = 350ms

That's kind of what I expect. That adding the count(distinct...) makes them slower is not a surprise, but I was surprised that adding the count(distinct...) inverts the relative performance of those two.

So, anyway, it sounds like you might have gotten it fast enough. If not, bounce the discussion over to pg's general mail list, and we can talk about how to maintain a summary table without going through the hassle of full-on sharding ;-)

Reply all

Reply to author

Forward