select query with index slow first time, fast second time

2,635 views
Skip to first unread message

Richard Males

unread,
Oct 23, 2013, 11:37:56 AM10/23/13
to spatiali...@googlegroups.com
As described in my post on 'Import Large CSV file via Virtual Table', I have created an 88 million row file.   This is a record of vessel locations.   Each row contains an integer field called IMO, that identifies the vessel.   There are many rows for each vessel, i.e.. multiple rows per IMO.   I created a non-unique index for the IMO field as

CREATE  INDEX aisdata_imo_idx

ON "aisdata"

(

imo

)


It took 1 hour and 17 minutes to create the index.

When I run a query from the spatialite  gui to select all rows for a given IMO, it may take 10 to 20 seconds to return the records.   When I re-run this query a second time, it takes a fraction of a second.   This is true even after a close and re-open the database.   I am also doing this processing for all vessels in a python script.  The same thing happens, it is slow the first time I run the script, fast the second and subsequent  times I run the script.

I would like to understand what is happening before I start rebuilding the index or trying other approaches.

Any suggestions much appreciated.  Thanks in advance.

Dick

a.fu...@lqt.it

unread,
Oct 23, 2013, 12:04:24 PM10/23/13
to spatiali...@googlegroups.com
On Wed, 23 Oct 2013 08:37:56 -0700 (PDT), Richard Males wrote:
> I would like to understand what is happening before I start
> rebuilding
> the index or trying other approaches.
>
> Any suggestions much appreciated. Thanks in advance.
>

Hi Dick,

there is absolutely nothing odd or strange in all this; it's exactly
what you can usually expect under such conditions, and is widely
known as the "cold cache / hot cache" behaviour.

a) SQLite's I/O efficiency widely depends on the filesystem own
efficiency
b) your first attempts encounters a "cold cache" condition;
i.e. every single byte to be processed must be physically read
from the hard disk, and this is obviously a slow task because
even the faster hard disk always is a bottle-neck slowing the
real power of your CPU.
c) any filesystem (NTFS, EXT4 and alike) is usually so smart to
preserve in memory as long as possible as much data as possible,
depending on how much unused free memory is available (I/O buffers)
d) so, your second attempt will then simply retrieve any required
data directly from the RAM buffers, and no physical access to
the HDD will be required (hot cache).

an useful "dirty trick" I always use before processing some huge DB;
just copy it into a different file, and then delete the copy so to
immediately recover any useless disk allocation.

this will force the operating system to read the whole file content
from the start to the end, thus effectively loading all data in RAM
(if you have a reasonable amount of free memory)
and any further SQL processing will benefit from a well established
"hot cache" condition ;-)

bye Sandro

Richard Males

unread,
Oct 23, 2013, 2:37:43 PM10/23/13
to spatiali...@googlegroups.com
Thank you.   I tried your approach, it appears to have removed the pre-existing hot cache and slowed down processing for those IMOs that I previously had processed, such that the times are the same as if each IMO is being handled for the first time. 

I have 40,000 separate vessels (IMOs).  At the present rate of processing that seems to be about 10.5 seconds per IMO under the cold cache condition, this will take around 116 hours to complete the Python script processing.   Under the hot cache, I was averaging 0.2 seconds per IMO for those previously processed.

The sqlite database is some 7.3 GB, I am running on a Windows 8 machine, i7 processor, with 8GB of RAM, 7,89 GB usable.

If anything else occurs to you as to how to speed up the overall process, I would appreciate it.  I would consider partitioning the database and copying to an in memory database if the would work better, or any other suggestions you might have.

Dick




Andrea P.

unread,
Oct 23, 2013, 2:40:55 PM10/23/13
to spatiali...@googlegroups.com
How fast is your hard-disk ?

I work with a 500GB SSD HD to have the faster speed possibile.

a.fu...@lqt.it

unread,
Oct 23, 2013, 3:35:38 PM10/23/13
to spatiali...@googlegroups.com
On Wed, 23 Oct 2013 11:37:43 -0700 (PDT), Richard Males wrote:
> The sqlite database is some 7.3 GB, I am running on a Windows 8
> machine, i7 processor, with 8GB of RAM, 7,89 GB usable.
>
> If anything else occurs to you as to how to speed up the overall
> process, I would appreciate it. I would consider partitioning the
> database and copying to an in memory database if the would work
> better, or any other suggestions you might have.
>

Dick,

by loading the whole DB into memory you'll surely achieve optimal
performances; in this case you obviously have to use some 64 bit
version of spatialite, because 32 bit addresses are physically
constrained within a 4 GB limit (on Windows this often means
between 0.8 and 1.5 GB only under ordinary conditions).

64 bit addresses are immune from any physical limit; however you
should carefully consider that:
a) if you completely consume all the available RAM the operating
system will become to use the swap-file on the HDD, thus
becoming painfully slow.
b) a reasonable amount of RAM should always be left free for the
operating system itself (I/O buffers and alike).

all this considered, splitting your DB in two halves (about 4 GB
each one) and then performing two distinct in-memory runs seems
to be the faster strategy you can adopt.

other useful options boosting sqlite's performances in a very
effective way:

----------
PRAGMA journal_mode=OFF;

this will completely disable journaling and transactions; it certainly
is a very dangerous option (your DB could be eventually completely
destroyed if any error, hw failure or system crash occurs).
anyway this usually isn't a so bad consequence for many single user
tasks, because in this case you can always patiently restart yet
again from scratch.

---------
PRAGMA cache_size=-4000000;

this will authorize sqlite to consume a lot of RAM for its own
internal cache; the value (please note: *negative*) is interpreted
as expressed in KB, so -4000000 = 4 GB (half your total RAM).
(obviously this option is completely useless if you load your DB
in-memory, but is often useful when working directly on the HDD)


last but not least ... VACUUM before starting your process.
a so big DB could easily be highly fragmented (I mean internal
page fragmentation, not fragmentation related to the Windows
filesystem); VACUUMing the DB is often an effective solution
allowing to reorganize all data pages in the most efficient way.

and don't forget the final "forbidden weapon"; disconnect the
network cable and then shut down your anti-virus ;-)
many anti-viruses have a very bad impact on the overall performances
of Windows.

bye Sandro

Jukka Rahkonen

unread,
Oct 23, 2013, 3:36:43 PM10/23/13
to spatiali...@googlegroups.com
Hi,

Based on a feeling how the SQLite data file is built but without real knowledge, I would have a try by rewriting the database by using "sort by ais". Then perhaps all the rows for each "ais" would be written consequently into the data file and if the index bites then there should be not much need to jump from one file block to another. Perhaps sort by ais and timestamp would be even better. But as I said, I am just thinking.

-Jukka Rahkonen-

Richard Males

unread,
Oct 24, 2013, 10:39:02 AM10/24/13
to spatiali...@googlegroups.com

Thanks to all for responses.

I adopted the suggestions that Sandro provided relative to the pragma approach, am currently letting it run, it gives some improvement in processing time, but not the order of magnitude I am looking for.  I will explore the in-memory option after I see about methods of reducing my database size.

Andrea, my hard disk is indeed relatively slow, but getting a new faster one is not currently an option. 

I am also interested in the approach that Jukka has suggested, I have some questions relative to that.

1) My assumption is that a select without an order by does not guarantee any particular order of retrieval, independent of how the records are inserted.  This has always been my understanding of the behavior of relational databases.

2) For my python script, I need my records sorted by IMO and then by timestamp.  I have timed the script, the only really time-consuming portion is the retrieval of the data from spatialite for each IMO.   The spatialite index is currently only on IMO.  I can create a new table that is ordered in that fashion (by IMO and timestamp) with an autoincrement field (PK_UID), such that successive rows will have increasing timestamp within a given IMO.   My question: is an order by PK_UID clause likely to be faster than an order by timestamp clause, e.g.:

select * from AISDATA where IMO=100001 order by PK_UID
or
select * from AISDATA where IMO=100001 order by timestamp

I will certainly post my findings once I have resolved this.

Dick

a.fu...@lqt.it

unread,
Oct 24, 2013, 4:06:27 PM10/24/13
to spatiali...@googlegroups.com
Bi Dick,

> select * from AISDATA where IMO=100001 order by PK_UID
> or
> select * from AISDATA where IMO=100001 order by timestamp
>

I read in some previous post that you've already defined an
index supporting AISDATA:

CREATE INDEX aisdata_imo_idx ON "aisdata" (imo);

please note; the optimal index supporting the aboves queries
respectively is:

CREATE INDEX aisdata_imo_idx ON "aisdata" (imo, PK_UID);
or
CREATE INDEX aisdata_imo_idx ON "aisdata" (imo, timestamp);

this way SQLite will directly fetch any row already in
the expected order, thus avoiding at all the need to
perform a final sort before returning the resultset.

bye Sandro


Andrea P.

unread,
Oct 24, 2013, 4:31:38 PM10/24/13
to spatiali...@googlegroups.com
Sorry for misunderstanding,
but the HD speed is a real important evaluation parameter for every performance strategy.
As example.
You could put a ramdisk or a memry db (the two are more different strategy) or put an attached temporary db on ramdisk.
But to understand if this have any chance to give a enhancemente always need start for intial condition nd the first parameter is the speed of own HD.

As example if you have an SSD (a very fast HD) the option to go on a ramdisk to have a performance gain is not so useful.
And instead you could try to split the DB in more sections.

Regards,

shaya...@gmail.com

unread,
Feb 26, 2019, 2:29:42 PM2/26/19
to SpatiaLite Users
Hi Sandro,

Maybe I'm a little bit late to join this conversation:D I'm facing the same problem and I found this thread on google. Can you explain more about your "dirty trick"? I tried copying the DB and deleting it but again when I run any query for the first time it takes a long time. However, fast on next attempts. Thanks

a.fu...@lqt.it

unread,
Feb 26, 2019, 4:20:08 PM2/26/19
to spatiali...@googlegroups.com
On Tue, 26 Feb 2019 11:29:42 -0800 (PST), shaya...@gmail.com wrote:
> Hi Sandro,
>
> Maybe I'm a little bit late to join this conversation:D I'm facing
> the
> same problem and I found this thread on google. Can you explain more
> about your "dirty trick"? I tried copying the DB and deleting it but
> again when I run any query for the first time it takes a long time.
> However, fast on next attempts. Thanks
>

Hi Shayan,

modern hardware usually has a big amount of RAM, and the operating
system will surely attempt to use as much memory as possible for
caching frequently accessed filesystem objects.

that said as a general principle, each operating system has its
own caching strategy; Linux obviously isn't the same as Windows,
and WinXT probably differed from Win10 under many details.
and there are many others aspects to be considered: how much
free memory is effectively available for disk caching, the
average CPU workload and so on.
on Windows some antivirus can often have a strongly negative
impact, thus practically vanishing any possible speed benefit
coming from disk caching.

there are so many variables to be taken in account that's
practically impossible to extrapolate any general law;
each individual machine has its own specific idiosyncrasies.

they are just "tricks"; sometimes they effectively work,
but other times no, because the whole thing is under
full control of the operating system.

if you are searching for a more robust and reproducible
mechanism allowing SQLite to take full profit from a
huge RAM I suggest you to issue an explicit directive:

PRAGMA cache_size = size;

this will enable SQLite to permanently allocate a
private cache of its own, that will be relatively
immune from random events happening somewhere
else in the system.

bye Sandro
> --
> You received this message because you are subscribed to the Google
> Groups "SpatiaLite Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to spatialite-use...@googlegroups.com [1].
> To post to this group, send email to
> spatiali...@googlegroups.com [2].
> Visit this group at https://groups.google.com/group/spatialite-users
> [3].
> For more options, visit https://groups.google.com/d/optout [4].
>
>
> Links:
> ------
> [1] mailto:spatialite-use...@googlegroups.com
> [2] mailto:spatiali...@googlegroups.com
> [3] https://groups.google.com/group/spatialite-users
> [4] https://groups.google.com/d/optout

Anıl KAYGISIZ

unread,
Apr 5, 2019, 6:30:10 AM4/5/19
to SpatiaLite Users
İngilizce anlatamayacağım için türkçe anlatayım.
Sqlite da bu durumu yaşadım.
Sonradan fark ettim ki projeyi x64 olarak build almıştım. Bu şekildeyken ilk sorgunun çalışması süresi 5sn üzerindeydi.
Projeyi System.Data.Sqlite.x86 dll ile tekrar x86 olarak build alınca ilk sorgu 1sn ye düştü.

23 Ekim 2013 Çarşamba 18:37:56 UTC+3 tarihinde Richard Males yazdı:

Mark Johnson

unread,
Apr 5, 2019, 6:50:47 AM4/5/19
to spatiali...@googlegroups.com


Anıl KAYGISIZ <anilka...@gmail.com> schrieb am Fr., 5. Apr. 2019 12:30:
İngilizce anlatamayacağım için türkçe anlatayım.
Sqlite da bu durumu yaşadım.
Sonradan fark ettim ki projeyi x64 olarak build almıştım. Bu şekildeyken ilk sorgunun çalışması süresi 5sn üzerindeydi.
Projeyi System.Data.Sqlite.x86 dll ile tekrar x86 olarak build alınca ilk sorgu 1sn ye düştü.

This is an english based forum, so posing your qustiomn or comment in English will heighten the chances for any reply. 

Mark 

23 Ekim 2013 Çarşamba 18:37:56 UTC+3 tarihinde Richard Males yazdı:
As described in my post on 'Import Large CSV file via Virtual Table', I have created an 88 million row file.   This is a record of vessel locations.   Each row contains an integer field called IMO, that identifies the vessel.   There are many rows for each vessel, i.e.. multiple rows per IMO.   I created a non-unique index for the IMO field as

CREATE  INDEX aisdata_imo_idx

ON "aisdata"

(

imo

)


It took 1 hour and 17 minutes to create the index.

When I run a query from the spatialite  gui to select all rows for a given IMO, it may take 10 to 20 seconds to return the records.   When I re-run this query a second time, it takes a fraction of a second.   This is true even after a close and re-open the database.   I am also doing this processing for all vessels in a python script.  The same thing happens, it is slow the first time I run the script, fast the second and subsequent  times I run the script.

I would like to understand what is happening before I start rebuilding the index or trying other approaches.

Any suggestions much appreciated.  Thanks in advance.

Dick

--
You received this message because you are subscribed to the Google Groups "SpatiaLite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spatialite-use...@googlegroups.com.
To post to this group, send email to spatiali...@googlegroups.com.
Visit this group at https://groups.google.com/group/spatialite-users.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages