table design for massive GPS data

migurus

unread,

Sep 13, 2012, 5:47:42 PM9/13/12

to

Windows Server/SQL Server 2008, we need to store GPS readings from roughly 1000 vehicles, ~ 300 readings per vehicle per day. The data comes in in real-time. Records are never updated after insert. That gives us roughly 120 million records per year. We need to keep last 5 years of records.

The usage of stored data is to query vehicle positions and event attributes within entire day by given vehicle and date. Geospatial features and functions are not really needed for the reporting.

I'd like to discuss table design for these requirements.

Here is a preliminary record content:
- vehicle id
- gps reading X and Y
- gps reading time
- one-letter event attribute
- event id (nullable)

First idea is to keep data normalized in one table, one record per GPS reading. It looks very simple to implement, but I doubt the server will be happy with this high number of small records. I we were to do this, what would be a good candidate for primary key here?
Just for illustration I show table def for idea# 1
create table idea1 (
VEHICLE_ID int not null
, GPS_X real not null
, GPS_Y real not null
, GPS_TIME datetime not null
, EVENT_LETTER char not null
, EVENT_ID int null
);
I don't show indexes as I am not clear what to use.
The typical query would look like this:
select
GPS_X
, GPS_Y
, GPS_TIME
, EVENT_LETTER
, EVENT_ID
from
idea1
where
VEHICLE_ID=@id
and GPS_TIME between @from and @to

Second idea is to keep a small table where records are inserted as they come in the structure shown above and then on the daily basis aggregate it and store as one record per vehicle in the another table, which will be used for queries. By doing this we shrink number of records from 120 million to be ~ 350 thousands per year. This looks much more manageable. The cons here are additional effort to come up with the way of storing and retrieving gps readings as a blob.
So, the table would look:
create table idea2 (
VEHICLE_ID int not null
, RUN_DATE date not null
, GPS_READINGS varbinary or whatever appropriate to keep daily portion of readings
-- we maight need reading_count as int here to help extract them from blob
);
And typical query would look like:
select
GPS_READINGS
from
idea2
where
VEHICLE_ID=@id
and RUN_DATE=@date
This result blob will need to be 'expanded', we might use sp to do it.

Any comments are welcome, critique is appreciated.
Thanks in advance.
migurus.

bradbury9

unread,

Sep 14, 2012, 3:06:10 AM9/14/12

to

120.000.000 record per year up to 5 years = 600.000.000. That is a pretty big database as you already said.

I wonder if it is not better creating an indexed view. That way you avoid creating 547.000 extra records that consume disk space.

BTW, make sure you run the SQL Server quey optimizer so the select used in the view has good indexes.

Bob Barrows

unread,

Sep 14, 2012, 7:33:54 AM9/14/12

to

migurus wrote:
> Windows Server/SQL Server 2008, we need to store GPS readings from
> roughly 1000 vehicles, ~ 300 readings per vehicle per day. The data
> comes in in real-time. Records are never updated after insert. That
> gives us roughly 120 million records per year. We need to keep last 5
> years of records.
>
> The usage of stored data is to query vehicle positions and event
> attributes within entire day by given vehicle and date. Geospatial
> features and functions are not really needed for the reporting.
>
> I'd like to discuss table design for these requirements.
>
> Here is a preliminary record content:
> - vehicle id
> - gps reading X and Y
> - gps reading time
> - one-letter event attribute
> - event id (nullable)
>
> First idea is to keep data normalized in one table, one record per
> GPS reading. It looks very simple to implement, but I doubt the
> server will be happy with this high number of small records.

Why not? Narrow and long tables are perfect for relational databases

> I we
> were to do this, what would be a good candidate for primary key here?

What combination of columns identifies a row? Vehicle_ID and GPS_TIME? If
so, there's your clustered primary key (wtp?)
What is that event id and why is it nullable?

> Just for illustration I show table def for idea# 1
> create table idea1 (
> VEHICLE_ID int not null
> , GPS_X real not null
> , GPS_Y real not null
> , GPS_TIME datetime not null
> , EVENT_LETTER char not null
> , EVENT_ID int null
> );
> I don't show indexes as I am not clear what to use.
> The typical query would look like this:
> select
> GPS_X
> , GPS_Y
> , GPS_TIME
> , EVENT_LETTER
> , EVENT_ID
> from
> idea1
> where
> VEHICLE_ID=@id
> and GPS_TIME between @from and @to

The proposed clustered primary key above would suit this query perfectly

>
>
> Second idea is to keep a small table where records are inserted as
> they come in the structure shown above and then on the daily basis
> aggregate it and store as one record per vehicle in the another
> table, which will be used for queries. By doing this we shrink number
> of records from 120 million to be ~ 350 thousands per year. This
> looks much more manageable. The cons here are additional effort to
> come up with the way of storing and retrieving gps readings as a
> blob.
>

Nah, this data isn't really suitable for aggregation, IMO.

rja.ca...@gmail.com

unread,

Sep 14, 2012, 8:05:05 AM9/14/12

to

Off the top of a semi-qualified head:

If the table is indexed well, then query performance will depend on the
volume of data in a query, but not so much on the size of a table and the
rows that you aren't looking at.

If the typical query is all the records for one vehicle on one calendar day
(one working day), then a primary key of date, vehicle id, time, looks
possible. If you want to count work after midnight for the previous day,
that could be messier - you could think about using datetimeoffset.

I expect such a design also to be efficient when you are deleting
records that are 5 years old every day.

If you use datetime order, then vehicle, then the server has to scan
records of 1000 vehicles to get the data that you actually want to see.
On the other hand, if you want the query for 1000 vehicles, then
it's probably okay if they're mixed together.

If data arrives in real time, then a clustered index of date, vehicle, time,
may slow down updates, while favouring the query. Usually you want it
that way around. The clustered index makes SQL Server physically store
neighbouring records together. So you'll have 1000 distinct points in
the data file where each vehicle's records are being added. You'd also
see data filling pages that are then split into two, a lot. But I expect
the server still to perform well in that case. There also may be an
implication of the growth of the transaction log, but if data is inserted
one row at a time then it won't make much of a difference - the log ray be
pretty big whatever you do.

Your primary key is of course unique and doesn't allow NULL and by
default is a clustered index, but none of this is necessary,
including even having a primary key as such. Indexes that aren't
the primary key will work just as well. The clustered index does
matter; it's also desirable for its key to be short, since it's
incorporated into any nonclustered indexes, so if there isn't a need
to use the clustered index to reorganise the data then a clustered
index on an IDENTITY(int, 1, 1) column can be pretty good.
/Then/ you set your nonclustered primary key (which does have
to be unique and not NULL). Including a non-meaningful IDENTITY
in a primary key wouldn't be right, but it's right if it is,
in fact, the key of your table in relationships with other data.

The point is, if your data is not unique in the date, time, and
vehicle ID fields, then don't get sidetracked into worrying about
a primary key that you can't have. Alternatively, add 1 second
to one of the duplicate key values, and it'll be unique.

Primary key also is a constraint that must have a unique object name
(PK_nameoftable probably), whereas you can and I do name other indexes
IX_1, IX_2, IX_3, on every table in your database. This doesn't
help other developers understand the database, but other developers
don't often help me, so if they want to know what my indexes are,
they can look 'em up. I probably have 500 nearly identical tables
created at the same time. (Arguably I should have one table with
500 times as many rows and a couple of extra columns, but that isn't
how we do things here.) So I don't want to invent 500 meaningful
index names.

By the way, this project sounds important enough that you'll be
thinking carefully about data backup and recovering from disaster
or downtime, if you don't already have that covered thoroughly
for your platform. So, your thinking about backup and recovery
also may bear on your database design. For instance, do you
want that big, fat transaction log that holds everything that
happened in your database between the last backup and the failure -
or do you have files of the last 7 days' GPS data that you can
re-load as part of your disaster recovery process, so that it
doesn't matter if the database is blown away except for the
twice-weekly taken-off-site backup?

migurus

unread,

Sep 14, 2012, 1:13:10 PM9/14/12

to

On Friday, September 14, 2012 4:35:11 AM UTC-7, Bob Barrows wrote:
> migurus wrote:
>
> > Windows Server/SQL Server 2008, we need to store GPS readings from
>
> > roughly 1000 vehicles, ~ 300 readings per vehicle per day. The data
>
> > comes in in real-time. Records are never updated after insert. That
>
> > gives us roughly 120 million records per year. We need to keep last 5
>
> > years of records.
>
> >
>
> > The usage of stored data is to query vehicle positions and event
>
> > attributes within entire day by given vehicle and date. Geospatial
>
> > features and functions are not really needed for the reporting.
>
> >
>
> > I'd like to discuss table design for these requirements.
>
> >
>
> > Here is a preliminary record content:
>
> > - vehicle id
>
> > - gps reading X and Y
>
> > - gps reading time
>
> > - one-letter event attribute
>
> > - event id (nullable)
>
> >
>
> > First idea is to keep data normalized in one table, one record per
>
> > GPS reading. It looks very simple to implement, but I doubt the
>
> > server will be happy with this high number of small records.
>
>
>
> Why not? Narrow and long tables are perfect for relational databases
>
>

Thanks, I did not know that, as I have limited exposure in db field.

>
> > I we
>
> > were to do this, what would be a good candidate for primary key here?
>
>
>
> What combination of columns identifies a row? Vehicle_ID and GPS_TIME? If
>
> so, there's your clustered primary key (wtp?)
>
> What is that event id and why is it nullable?

Some events do not require id, this will not be any foreign key or anything

Thanks for your input.

migurus

unread,

Sep 14, 2012, 1:25:00 PM9/14/12

to

On Friday, September 14, 2012 5:05:05 AM UTC-7, rja.ca...@gmail.com wrote:
> Off the top of a semi-qualified head:
>
>
>
> If the table is indexed well, then query performance will depend on the
>
> volume of data in a query, but not so much on the size of a table and the
>
> rows that you aren't looking at.
>
>

Thanks, I had an impression that hundreds of millions of small records is not a sound design.

>
> If the typical query is all the records for one vehicle on one calendar day
>
> (one working day), then a primary key of date, vehicle id, time, looks
>
> possible. If you want to count work after midnight for the previous day,
>
> that could be messier - you could think about using datetimeoffset.
>
>
>
> I expect such a design also to be efficient when you are deleting
>
> records that are 5 years old every day.
>
>
>
> If you use datetime order, then vehicle, then the server has to scan
>
> records of 1000 vehicles to get the data that you actually want to see.
>
> On the other hand, if you want the query for 1000 vehicles, then
>
> it's probably okay if they're mixed together.
>

One vehicle at a time queries required.

>
>
> If data arrives in real time, then a clustered index of date, vehicle, time,
>
> may slow down updates, while favouring the query. Usually you want it
>
> that way around. The clustered index makes SQL Server physically store
>
> neighbouring records together. So you'll have 1000 distinct points in
>
> the data file where each vehicle's records are being added. You'd also
>
> see data filling pages that are then split into two, a lot. But I expect
>
> the server still to perform well in that case. There also may be an
>
> implication of the growth of the transaction log, but if data is inserted
>
> one row at a time then it won't make much of a difference - the log ray be
>
> pretty big whatever you do.
>

There is no updates, as I mentioned. Inserts only.
I love your idea of building three part index: date, vehicle_id, time.
This is right to the point and not a big overhead space wise.

I was thinking about an hourly log shipping and daily or weekly full backups.
Would really appreciate a comment on this idea.

Thanks for your thoughts.

Erland Sommarskog

unread,

Sep 15, 2012, 1:21:13 PM9/15/12

to

migurus (mig...@yahoo.com) writes:
> First idea is to keep data normalized in one table, one record per GPS
> reading. It looks very simple to implement, but I doubt the server will
> be happy with this high number of small records.

Why not? As long it is properly indexed, I don't see any problem.

What you should consider though, is to partition the table (which
requires Enterprise Edition). When five years has passed, and you
want to purge old data, this is a swift affair with partitioning,
provided that your partition aligns with what you want to purge.

> I we were to do this, what would be a good candidate for primary key here?

(VEHICLE_ID, GPS_TIME). Although, you should probably have GPS_TIME as the
first column in the clustered index to avoid fragmentation. GPS_TIME would
also be your partitioning column.

> So, the table would look:
> create table idea2 (
> VEHICLE_ID int not null
> , RUN_DATE date not null
> , GPS_READINGS varbinary or whatever appropriate to keep daily
> portion of readings

You may save some space with this, because you save row overhead. But you
get a much more complex solution which is more difficult to understand and
maintain. And if the neeeds changes in the future and people want to run
queries like "which vehicles have been in that area at some time?", that
is hopeless with the above. I strongly recommend against it.

--
Erland Sommarskog, SQL Server MVP, esq...@sommarskog.se

Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx

rja.ca...@gmail.com

unread,

Sep 17, 2012, 11:46:07 AM9/17/12

to

On Saturday, September 15, 2012 6:21:13 PM UTC+1, Erland Sommarskog wrote:
> migurus (mig...@yahoo.com) writes:
> > First idea is to keep data normalized in one table, one record per
> > GPS reading. It looks very simple to implement, but I doubt the
> > server will be happy with this high number of small records.
>
> Why not? As long it is properly indexed, I don't see any problem.
> What you should consider though, is to partition the table (which
> requires Enterprise Edition). When five years has passed, and you
> want to purge old data, this is a swift affair with partitioning,
> provided that your partition aligns with what you want to purge.

Partitioning might be not a great boost to performance, depending on
how it's done. With methods possibly out of date or wrong, I estimate
the original design row size is 23 bytes, counting one byte to map
NULL columns (for up to 8 table columns whether nullable or not),
and an overhead of 11 bytes per table row, although I don't remember
how I worked that out - maybe by trying it out. Then I think that
one day's new data is a little less than 10 megabytes, and five years
is around 20 GB. That's before indexing - which would probably be,
well, somewhat smaller than the data. If it's stupid to make tables
as large as that, I've been stupider.

300,000 transactions per day, however, I put at about 2 GB or more
in the transaction log if it's fully logged, so keeping that under
control might be an important part of the backup/maintenance plan.
On the other hand, since I'm assuming storing one 8 KB page in the
log for storing each 34 bytes row, it's a lousy way to keep a backup
version of the data as far as efficiency goes. But efficient isn't
the same thing as effective. If you can bear the inefficiency, it'll
work. Whoops, I forgot that these additions will probably update
the index as well, so, double that log requirement.

For deleting data /without/ taking table partitioning into account,
you might remove one day's records each day after five years, but
that won't be logged so badly if the rows are grouped together by
clustered index - you're probably okay even if they're ordered by
vehicle ID and /then/ date/time.

And then the physical storage of data rows will be merrily jumbled
within the data file - fragmented somewhat as time goes on -
but if you only want to query one day's and one vehicle's data
at a time, I think that should be fine.

If partitioning the table, instead, what size of partition would you
make - one day, one month, one year - and how would you maintain that?
Can it be done automatically?

Backup should be whatever you need to have in order to restore
as much as you need to restore, as promptly as you need to have it,
with depth of design in case you discover just at the wrong time
that your backup tape or disc isn't any good. There are some
horror stories that it's good to read, but not at bedtime.
So, what you should set up depends on the circumstances, and
the budget. You may need to explain some of the horror stories
to whoever pays for stuff.

I might include separately storing at least a day's worth of GPS
messages to re-load into your database from the external format,
in case of (a) equipment failure and data loss and (b) finding out
that you were mishandling and corrupting the messages for a while.
There are horror stories about that, too. And if you can restore
yesterday's data in that way, then you may only need a daily backup
of the entire database, just after midnight - which will save you
from storing copies of the transaction log.

On the other hand, your recovery scenario may be, you have database
backups, but someone stole all of your computers...

Also, would it be bad if somebody stole your data... such as,
to rob the vehicles. Like, if the truck full of money drives
the same route every Friday... sure, the bad guys could go out
and follow it anyway, but at least then they have to pay for gas.
I don't think you want to tell /us/ if you have trucks full
of money.

So, anyway, you should stop that from happening.