Trying to use CSVRead with an auto-increment column

Lizard

unread,

Feb 15, 2010, 4:04:33 PM2/15/10

to H2 Database

So here's what I'm trying to do.
a)I have a CSV file on disk.
b)I read the headers from this file into a string array.
c)I I create a table, using a generated name and the fields I just
read, plus I add a column called 'ROW_ID' which is an auto-increment
field.
d)I want to then read the DATA from the CSV file into this new table.
The problem is, the column count is incorrect.

I tried: "INSERT INTO amp_20091124 VALUES (0, SELECT * FROM
CSVREAD('C:\Temp\Mark Issue 11302009\amp_20091124.csv'))", but this
had the same problem.

It's possible for me to specify all field names (except row_id) in the
"insert into" clause, and then "Select field1, field2, etc" in the
values clause, but I'm dealing with files with hundreds or thousands
of columns and that creates very unwieldly and hard-to-read strings
when I'm trying to debug my SQL, so I'd like to avoid that, if
possible.

(Meaning, I could do "INSERT INTO MYFILE (Field1, field2....field N)
values (Select field1, field2... fieldn from CSVREAD(...."). At least,
I assume I could, haven't tried it yet.

The other option is to create the table and then add the new column.
My reason for not doing it this way is I plan to deal with very large
tables (hundreds of thousands to millions of rows), and my experience
has been that adding all columns, and specifying indexes, is much
better done before you add data. So that's the worst-case solution at
the moment.

Thank you for any advice on this.

Thomas Mueller

unread,

Feb 17, 2010, 1:39:02 PM2/17/10

to h2-da...@googlegroups.com

Hi,

CREATE TABLE AS SELECT is faster than first creating the table and
then inserting the data. I would try to combine creating the table
with inserting the data. Example:

CREATE TABLE amp_20091124(ROW_ID IDENTITY,...) AS
SELECT NULL, * FROM CSVREAD...

Another idea is:

-- create an empty table
CREATE TABLE amp_20091124 AS SELECT * FROM

CSVREAD('C:\Temp\Mark Issue 11302009\amp_20091124.csv')

WHERE 1=0;

-- add the ROW_ID
-- see http://www.h2database.com/html/grammar.html#alter_table_add
ALTER TABLE amp_20091124 ADD COLUMN ROW_ID IDENTITY ;
-- maybe BEFORE firstColumnName

-- insert the data, with NULL as the ROW_ID
INSERT INTO amp_20091124 AS SELECT *, NULL FROM CSVREAD...

Regards,
Thomas

Lizard

unread,

Feb 17, 2010, 1:58:25 PM2/17/10

to H2 Database

Thank you. These solutions seem to be very helpful. I will try them.

Do you recommend creating the index after data load, or before? The
index is vital because I need to be able to select arbitrary row
ranges from tables, ie, gets 5000 to 6500. Experimenting with the
console on large tables (about 3 gig, 6 million rows, because that's
the size we're dealing with), using Select with Offset and Limit took
longer the further "down" the table one went; using an index and
SELECT... BETWEEN seemed to be mostly linear.

I know there's a speed hit on INSERT when you have an index, because
you need to update the index; I also know indexing a large table takes
a long time. I'm just wondering which hurts more.

On Feb 17, 1:39 pm, Thomas Mueller <thomas.tom.muel...@gmail.com>
wrote:

> Hi,
>
> CREATE TABLE AS SELECT is faster than first creating the table and
> then inserting the data. I would try to combine creating the table
> with inserting the data. Example:
>
> CREATE TABLE amp_20091124(ROW_ID IDENTITY,...) AS
> SELECT NULL, * FROM CSVREAD...
>
> Another idea is:
>
> -- create an empty table
> CREATE TABLE amp_20091124 AS SELECT * FROM
> CSVREAD('C:\Temp\Mark Issue 11302009\amp_20091124.csv')
> WHERE 1=0;
>
> -- add the ROW_ID

> -- seehttp://www.h2database.com/html/grammar.html#alter_table_add

Thomas Mueller

unread,

Feb 20, 2010, 5:05:02 AM2/20/10

to h2-da...@googlegroups.com

Hi,

> Do you recommend creating the index after data load, or before?

To be honest, I don't know. Probably, it depends on the data (I'm not
even sure about that). I suggest you try both variants.

Regards,
Thomas

Reply all

Reply to author

Forward