import large csv file via virtual table?

827 views
Skip to first unread message

Richard Males

unread,
Oct 17, 2013, 11:58:40 AM10/17/13
to spatiali...@googlegroups.com

I have a large csv file, 88 million rows, 8 GB, originally exported from an Oracle database, that I want to import into spatialite.   I have tried directly through the spatialite gui, with no success, and also through the use of a virtual table.   I have been following the suggestions in the thread post 'Importing csv with GUI changes double to text' to use a create table statement to get the table fields formatted as I wish, and then load from the virtual csv file.   This works perfectly with smaller csv files, and I have used it on files with a million rows.  The procedure is failing on this large dataset.

Here is a small sample of the data file I am working with, as exported from Oracle:

"IMO","VESSELNAME","VESSELTYPE","DESTINATION","NAVSTATUS","LAT","LON","SOG","AISTIMESTAMP"
"9259604","HALUL 31","96","PS 3_HALUL","0","25.6666833333333","52.4118166666667","0","1243824597906"
"9084516","LISTA","81","SLAGEN","1","59.3276666666667","10.514","0","1243823578843"
"8323094","MV VERGINA","70","DJIBOUTI","1","11.61265","43.1387833333333","0","1243824482812"
"7812452","SETUBAL","90","SETUBAL","1","38.467","-8.846","0","1243824599000"
"9083172","GSP LICORN","52","CONSTANTA","0","44.1590333333333","28.64686","0","1243824664203"
"8118554","BIBI-M","0","KUWAIT","0","27.8633766666667","51.0075833333333","17","1243824640062"
"7402477","OCEAN CLEVER","51","DRAUGEN","0","64.3554","7.79892833333333","0","1243824652718"
"8618059","MED PRODIGY","79","ORAN","5","35.70935","-0.645566666666667","0","1243824564328"
"9162655","CAROLIEN  SCH81","30","MAURITANIE","0","20.670045","-17.4571633333333","8","1243824440437"

...

...

Here is the syntax for using the virtual table from sql.

CREATE VIRTUAL TABLE vtaisdata USING VirtualText('aisdata_20131016.csv','CP1252', 1, POINT, DOUBLEQUOTE, ',');

This does create a table vtaisdata, but with only a single field, ROWNO.   Creating the virtual table from the spatialite gui gives the same result.

I have tried this with a smaller subset dataset (1000 rows), using a fully qualified path name for the csv file, that works, as does importing the smaller dataset through the gui.

One thought that occurs to me is that there may be some kind of coding error in the 88 million row csv file that is throwing off the process, but I don't know how to verify that.

Any suggestions or thoughts as to how to accomplish loading this dataset into spatialite would be appreciated, including using external means (e.g. a python script).

Thanks in advance

Dick

Jukka Rahkonen

unread,
Oct 17, 2013, 3:25:42 PM10/17/13
to spatiali...@googlegroups.com
Hi,

CSV importer in Spatialite is fine tool but it does not accept any errors in the source data and as far as I know it does not print the reason or line number where the error occurs. Therefore I would use other tools with so big CSV file as you have.

I would first split the huge file to smaller pieces with something like "split" on Linux or "sfk split" from Swiss File Knife tools on Windows. Then I believe I would use the GDAL/OGR csv driver http://www.gdal.org/ogr/drv_csv.html for doing the conversion. It will take an hour to learn how to write the GDAL VRT file but with your data it will be worth it. I suppose you will enjoy about the possibility to define the data types for each column exactly by writing the one-line auxiliary .csvt schema file and that you can also map the LAT and LON values directly into geometries.

Once your VRT file is ready you can do the conversion into Spatialite as
ogr2ogr -f SQLite -dsco spatialite=yes -skipfailures output.sqlite input.vrt

If you have split the data as I suggested you can use ogr2ogr with -append switch for adding data into existing database and table. However, with such big amounts of data as you have it is frustrating to repeat all the steps in case of some error in conversion. I would convert data in batches of for example 10 million rows into separate Spatialite databases or at least separate tables. After each successful conversion you do not need to do it again  even if there happens to be troubles with some other batch.
Once all the data are converted it will be simple to gather all data into one table by attaching databases one by one to the main database.

-Jukka Rahkonen-

Jukka Rahkonen

unread,
Oct 17, 2013, 5:12:10 PM10/17/13
to spatiali...@googlegroups.com
I had to test this myself.

  • I stored your sample data as "aistest.csv"
  • I wrote a file "aistest.csvt" as
    "Integer","String","Integer","String","Integer","Real","Real","Integer","Integer"
  • I wrote VRT file "aistest.vrt as
    <OGRVRTDataSource>
        <OGRVRTLayer name="aistest">
            <SrcDataSource>aistest.csv</SrcDataSource>
            <GeometryType>wkbPoint</GeometryType>
            <LayerSRS>WGS84</LayerSRS>
            <GeometryField encoding="PointFromColumns" x="LON" y="LAT"/>
        </OGRVRTLayer>
    </OGRVRTDataSource>
  • Test with ogrinfo
    C:\temp>ogrinfo aistest.vrt -al -so
    INFO: Open of `aistest.vrt'
          using driver `VRT' successful.

    Layer name: aistest
    Geometry: Point
    Feature Count: 9
    Extent: (-17.457163, 11.612650) - (52.411817, 64.355400)
    Layer SRS WKT:
    GEOGCS["WGS 84",
        DATUM["WGS_1984",
            SPHEROID["WGS 84",6378137,298.257223563,
                AUTHORITY["EPSG","7030"]],
            TOWGS84[0,0,0,0,0,0,0],
            AUTHORITY["EPSG","6326"]],
        PRIMEM["Greenwich",0,
            AUTHORITY["EPSG","8901"]],
        UNIT["degree",0.0174532925199433,
            AUTHORITY["EPSG","9108"]],
        AUTHORITY["EPSG","4326"]]
    IMO: Integer (0.0)
    VESSELNAME: String (0.0)
    VESSELTYPE: Integer (0.0)
    DESTINATION: String (0.0)
    NAVSTATUS: Integer (0.0)
    LAT: Real (0.0)
    LON: Real (0.0)
    SOG: Integer (0.0)
    AISTIMESTAMP: Integer (0.0)
  • Convert into Spatialite
    ogr2ogr -f SQLite -dsco spatialite=yes aistest.sqlite aistest.vrt

Result is perfect with 9 rows. Let's see how it will go with your 88 million rows.

-Jukka-

Richard Males

unread,
Oct 18, 2013, 10:31:28 AM10/18/13
to spatiali...@googlegroups.com
Many thanks.  I went ahead and developed a python script to insert the data, it took 15 hours but completed successfully.   I know I can speed it up significantly.   I will check out your approach as well, and advise as to the results.

Dick

Jukka Rahkonen

unread,
Oct 20, 2013, 4:44:59 AM10/20/13
to spatiali...@googlegroups.com
Hi,

I noticed that my command is guaranteed to be terribly slow. You must use something like -gt 65536 for making much bigger transactions that the default which is only 200 rows.  Read also other performance hints from http://www.gdal.org/ogr/drv_sqlite.html. I would also consider to make the initial import without spatial index (-lco spatial_index=no). Using layer creation options may be possible only when the input layer is explicitly given. Thus the whole command would be

ogr2ogr -f SQLite -dsco spatialite=yes -gt 65536 -lco spatial_index=no aistest.sqlite aistest.vrt aistest
Reply all
Reply to author
Forward
0 new messages