mixing Field across databases/tables

32 views
Skip to first unread message

Alok Menghrajani

unread,
Nov 12, 2015, 4:52:32 PM11/12/15
to jooq...@googlegroups.com
Hi,

I wrote some code to migrate data from one database to another. I
wanted my code to be generic, so I wrote something which boils down
to:

Result<R> result = src.selectFrom(srcTable).fetch();
Field[] fields = result.fields();
for (Record row : result) {
InsertSetStep insert = dst.insertInto(dstTable);
InsertSetMoreStep insert2 = null;
for (int i = 0; i < fields.length; i++) {
if (i == 0) {
insert2 = insert.set(fields[i], row.getValue(fields[i]));
} else {
insert2 = insert2.set(fields[i], row.getValue(fields[i]));
}
}
rows += insert2.onDuplicateKeyIgnore().execute();
}

(note: this code assumes that srcTable and dstTable have the same
schema and that they have more than one column.)

This almost works, except that the source table has converters (e.g.
timestamps are converted to long) while the destination table might
not apply the same converters (e.g. timestamps are stored as long).

Depending on the combination of source/destination database engines
and whether the converter is getting triggered or not, I ran into 3
rough edges. I understand mixing fields from one database or table
into a query for another database or table where the types don't
exactly match 1:1 is something crazy. I however have several
suggestions to improve things.

You can see my original code at
https://github.com/square/keywhiz/blob/alok/shadow-write/server/src/main/java/keywhiz/commands/SyncShadowDbCommand.java,
and [pom.xml: http://pastebin.com/aByUsb4F, App.java:
http://pastebin.com/h48wxtXd, TimestampConverter.java:
http://pastebin.com/bgrew4qt] is an attempt to isolate these issues.

The issues I ran into:
1. jooq generates invalid SQL ([insert into t2 (id, date) values
(cast(? as int), cast(? as timestamp))]). I can reproduce this when
going from h2 to h2 (full stack trace http://pastebin.com/f7H7zFnP)
2. the database engine detects that data is getting truncated and
raises an exception. Happens when going from h2 to mysql (stack trace
http://pastebin.com/j9yW9JsY)
3. data gets silently truncated. I end up with the year being stored
as the timestamp, i.e. (2015 instead of 1447364422). Happens when
going from postgres to mysql.

My suggestions to improve things:
1. given that a Field seems to be tied to a specific database/table,
the type system should not allow mixing them. I.e. the issues I ran
into at runtime could perhaps have been prevented statically.
2. the framework could check that a field came from another table and
figure out if a converter needs to be applied or not.
3. the framework could check that a field came from another table and
throw an exception instead of relying on the underlying database.

Thoughts? Also, could I have implemented moving data from one database
to another in some other way?

Alok
P.S. It's also unfortunate that InsertSetStep and InsertSetMoreStep
are two different things. I understand the reason (it prevents calling
execute() when no values have been set). Also, having a way to do
.set(String, Object) would have made things easier in this case, as I
could have simply grabbed the field's name.

Lukas Eder

unread,
Nov 17, 2015, 5:04:41 AM11/17/15
to jooq...@googlegroups.com
Hi Alok,

Thank you very much for your detailed report and sorry for my delay. I will comment inline

2015-11-12 22:52 GMT+01:00 Alok Menghrajani <al...@squareup.com>:
Hi,

I wrote some code to migrate data from one database to another. I
wanted my code to be generic, so I wrote something which boils down
to:

Result<R> result = src.selectFrom(srcTable).fetch();
Field[] fields = result.fields();
for (Record row : result) {
  InsertSetStep insert = dst.insertInto(dstTable);
  InsertSetMoreStep insert2 = null;
  for (int i = 0; i < fields.length; i++) {
    if (i == 0) {
      insert2 = insert.set(fields[i], row.getValue(fields[i]));
    } else {
      insert2 = insert2.set(fields[i], row.getValue(fields[i]));
    }
  }
  rows += insert2.onDuplicateKeyIgnore().execute();
 

(note: this code assumes that srcTable and dstTable have the same
schema and that they have more than one column.)

This almost works, except that the source table has converters (e.g.
timestamps are converted to long) while the destination table might
not apply the same converters (e.g. timestamps are stored as long).

Hmm, indeed, this approach won't work in general. You shouldn't use a field from srcTable in an INSERT statement into dstTable. As long as field names match, what you can do instead is this:

insert.set(dstTable.field(fields[i]), row.getValue(fields[i]));

By using Table.field(Field) (http://www.jooq.org/javadoc/latest/org/jooq/TableLike.html#field-org.jooq.Field-), you can search for a given field in another table, by its unqualified name.

Depending on the combination of source/destination database engines
and whether the converter is getting triggered or not, I ran into 3
rough edges. I understand mixing fields from one database or table
into a query for another database or table where the types don't
exactly match 1:1 is something crazy. I however have several
suggestions to improve things.

You can see my original code at
https://github.com/square/keywhiz/blob/alok/shadow-write/server/src/main/java/keywhiz/commands/SyncShadowDbCommand.java,
and [pom.xml: http://pastebin.com/aByUsb4F, App.java:
http://pastebin.com/h48wxtXd, TimestampConverter.java:
http://pastebin.com/bgrew4qt] is an attempt to isolate these issues.

So, your application expires in 2038, I suspect :) (with Integer timestamps)
 
The issues I ran into:
1. jooq generates invalid SQL ([insert into t2 (id, date) values
(cast(? as int), cast(? as timestamp))]). I can reproduce this when
going from h2 to h2 (full stack trace http://pastebin.com/f7H7zFnP)
2. the database engine detects that data is getting truncated and
raises an exception. Happens when going from h2 to mysql (stack trace
http://pastebin.com/j9yW9JsY)
3. data gets silently truncated. I end up with the year being stored
as the timestamp, i.e. (2015 instead of 1447364422). Happens when
going from postgres to mysql.

My suggestions to improve things:
1. given that a Field seems to be tied to a specific database/table,
the type system should not allow mixing them. I.e. the issues I ran
into at runtime could perhaps have been prevented statically.

I wish it could. I've been thinking about leveraging TableField<R, T> for this and allowing to specify only columns from the INSERT's target table to be specified with the INSERT, not from any "srcTable". This won't play well with a variety of use-cases, though, including plain SQL where we don't have a TableField, but just a Field...

Given your experience with this particular use-case, do you perhaps have an idea of how this kind of mistake could be prevented statically?
 
2. the framework could check that a field came from another table and
figure out if a converter needs to be applied or not.

The field[i] converter is indeed applied on the set() call, but it's the wrong converter in your case, the one you passed to the set() method, not the one in from the target table.
 
3. the framework could check that a field came from another table and
throw an exception instead of relying on the underlying database.

I don't think we should, for these reasons:

1) We like to rely on the underlying database. There are many things that would be extremely hard to check in the client, and very easy to check in the database. Many things in jOOQ work rather well, because we don't eagerly validate user input. I'm aware of competing frameworks that have tons of clever validators built in. Their user group is filled with 50% complaints about those validators' unexpected behaviour.
2) Let's assume you have a table A and a view V AS SELECT * FROM A WHERE [ some predicate ]. You now want to insert into the view V but use A's columns for some reason. Should we prevent that? I don't know... Perhaps we should in your particular case, but perhaps if that's possible in a different setup, it would be awesome, too.
3) Again, using plain SQL might be a use-case here. What if we generate a table A but only with columns COL1, COL2. The real table A also has COL3, but that's omitted from the code generator. In only a single case, we do want to insert into COL3, so we're resorting to plain SQL just for that column.

Preventing these things via validation is very tricky and probably breaks a lot of nice use-cases. However, if many people run into the same issues, I'm very open to improve the API to prevent such issues.

Thoughts? Also, could I have implemented moving data from one database
to another in some other way?

I would use the loader API and insert the whole result in one go. It's probably much faster, as you can fine-tune on three dimensions: bulk size, batch size, commit size:

The manual is outdated, I've just seen. In jOOQ 3.7, we've added support for loading Iterable<? extends Record> for use-cases similar to yours:

Alok
P.S. It's also unfortunate that InsertSetStep and InsertSetMoreStep
are two different things. I understand the reason (it prevents calling
execute() when no values have been set).

I agree that this is tedious. The DSL API is designed to mimick the SQL syntax in Java. We get a lot of requests to simplify its usage for dynamic SQL.

Interestingly, we do have a better API for dynamic SQL (the "model" API), which you can use via DSLContext.insertQuery(). For some reason, people prefer not to use that, because they like the SQL syntax mimicking of the DSL API more. It's a catch-22 situation :-) (which we're hoping to resolve better in jOOQ 4.0)
 
Also, having a way to do
.set(String, Object) would have made things easier in this case, as I
could have simply grabbed the field's name.

You can, but with an extra step: By wrapping the field's name in DSL.field(DSL.name(fieldName)). But there's certainly room for more convenience API, here. I've registered a feature request for this:

Thanks again for your detailed report and for your time. I hope I could answer some of your questions / concerns. If you have any further issues, just let me know, and I'll be very happy to help.

Best Regards,
Lukas
 

Alok Menghrajani

unread,
Nov 30, 2015, 1:41:17 PM11/30/15
to jooq...@googlegroups.com
Thanks for your response Lukas.

On Tue, Nov 17, 2015 at 2:04 AM, Lukas Eder <lukas...@gmail.com> wrote:
>
> 2015-11-12 22:52 GMT+01:00 Alok Menghrajani <al...@squareup.com>:
>>
>
> Hmm, indeed, this approach won't work in general. You shouldn't use a field
> from srcTable in an INSERT statement into dstTable. As long as field names
> match, what you can do instead is this:
>
> insert.set(dstTable.field(fields[i]), row.getValue(fields[i]));
>
>
> By using Table.field(Field)
> (http://www.jooq.org/javadoc/latest/org/jooq/TableLike.html#field-org.jooq.Field-),
> you can search for a given field in another table, by its unqualified name.

This seems to work! It wasn't clear to me that passing a Field to
field() would work, so I converted to/from a String which worked. The
javadoc says "Get a specific field from this table.", the doc for Row
turns out to be a little more clear: "...However, if this is a row
from an aliased table, the field will be aliased accordingly.".

>> Depending on the combination of source/destination database engines
>> and whether the converter is getting triggered or not, I ran into 3
>> rough edges. I understand mixing fields from one database or table
>> into a query for another database or table where the types don't
>> exactly match 1:1 is something crazy. I however have several
>> suggestions to improve things.
>>
>> You can see my original code at
>>
>> https://github.com/square/keywhiz/blob/alok/shadow-write/server/src/main/java/keywhiz/commands/SyncShadowDbCommand.java,
>> and [pom.xml: http://pastebin.com/aByUsb4F, App.java:
>> http://pastebin.com/h48wxtXd, TimestampConverter.java:
>> http://pastebin.com/bgrew4qt] is an attempt to isolate these issues.
>
>
> So, your application expires in 2038, I suspect :) (with Integer timestamps)

It's called job security :P

>> The issues I ran into:
>> 1. jooq generates invalid SQL ([insert into t2 (id, date) values
>> (cast(? as int), cast(? as timestamp))]). I can reproduce this when
>> going from h2 to h2 (full stack trace http://pastebin.com/f7H7zFnP)
>> 2. the database engine detects that data is getting truncated and
>> raises an exception. Happens when going from h2 to mysql (stack trace
>> http://pastebin.com/j9yW9JsY)
>> 3. data gets silently truncated. I end up with the year being stored
>> as the timestamp, i.e. (2015 instead of 1447364422). Happens when
>> going from postgres to mysql.
>>
>> My suggestions to improve things:
>> 1. given that a Field seems to be tied to a specific database/table,
>> the type system should not allow mixing them. I.e. the issues I ran
>> into at runtime could perhaps have been prevented statically.
>
>
> I wish it could. I've been thinking about leveraging TableField<R, T> for
> this and allowing to specify only columns from the INSERT's target table to
> be specified with the INSERT, not from any "srcTable". This won't play well
> with a variety of use-cases, though, including plain SQL where we don't have
> a TableField, but just a Field...
>
> Given your experience with this particular use-case, do you perhaps have an
> idea of how this kind of mistake could be prevented statically?

I don't understand why the plain SQL case prevents the use of
TableField<R, T>. Can't the plain SQL code return a TableField<R,
Table<R>> or TableField<R, Object>?

>> 2. the framework could check that a field came from another table and
>> figure out if a converter needs to be applied or not.
>
>
> The field[i] converter is indeed applied on the set() call, but it's the
> wrong converter in your case, the one you passed to the set() method, not
> the one in from the target table.
>
>>
>> 3. the framework could check that a field came from another table and
>> throw an exception instead of relying on the underlying database.
>
>
> I don't think we should, for these reasons:
>
> 1) We like to rely on the underlying database. There are many things that
> would be extremely hard to check in the client, and very easy to check in
> the database. Many things in jOOQ work rather well, because we don't eagerly
> validate user input. I'm aware of competing frameworks that have tons of
> clever validators built in. Their user group is filled with 50% complaints
> about those validators' unexpected behaviour.
> 2) Let's assume you have a table A and a view V AS SELECT * FROM A WHERE [
> some predicate ]. You now want to insert into the view V but use A's columns
> for some reason. Should we prevent that? I don't know... Perhaps we should
> in your particular case, but perhaps if that's possible in a different
> setup, it would be awesome, too.

Agree.

> 3) Again, using plain SQL might be a use-case here. What if we generate a
> table A but only with columns COL1, COL2. The real table A also has COL3,
> but that's omitted from the code generator. In only a single case, we do
> want to insert into COL3, so we're resorting to plain SQL just for that
> column.
>
> Preventing these things via validation is very tricky and probably breaks a
> lot of nice use-cases. However, if many people run into the same issues, I'm
> very open to improve the API to prevent such issues.

I will let you know if I run into other similar cases. Again, moving
data from one db/table to another is a rare thing so having rough
edges is probably fine. A middle ground between throwing exceptions
and silently truncating data could be having some kind of warning log
messages.

>> P.S. It's also unfortunate that InsertSetStep and InsertSetMoreStep
>> are two different things. I understand the reason (it prevents calling
>> execute() when no values have been set).
>
>
> I agree that this is tedious. The DSL API is designed to mimick the SQL
> syntax in Java. We get a lot of requests to simplify its usage for dynamic
> SQL.
>
> Interestingly, we do have a better API for dynamic SQL (the "model" API),
> which you can use via DSLContext.insertQuery(). For some reason, people
> prefer not to use that, because they like the SQL syntax mimicking of the
> DSL API more. It's a catch-22 situation :-) (which we're hoping to resolve
> better in jOOQ 4.0)

I'll give it a shot next time, I had completely forgotten about it.

>> Also, having a way to do
>> .set(String, Object) would have made things easier in this case, as I
>> could have simply grabbed the field's name.
>
>
> You can, but with an extra step: By wrapping the field's name in
> DSL.field(DSL.name(fieldName)). But there's certainly room for more
> convenience API, here. I've registered a feature request for this:
> https://github.com/jOOQ/jOOQ/issues/4745

DSL.name is another method I didn't know about. Your reply is filled
with wonderful tips, thanks so much!


I have another issue which is related to this topic (the root cause
seems to be how the field converters are applied). We test our app
with 3 types of database (postgres, mysql, h2). In theory, things
should work with every database that jOOQ supports.

I use converters to handle three cases in a database agnostic way:
- mysql doesn't have a boolean type, so we map tinyint to java.lang.Boolean.
- we use bigint to store timestamps in mysql but for legacy reasons
postgres uses timestamp.
- I ran into some issues with postgres, bigint and auto increments. As
a result, the primary keys are int in postgres but bigint in mysql.

My pain point is that jooq's codegen depends on the database (because
of the converters, right?). This means we need to publish a .jar for
each database type, which is a little painful. We could tell people to
compile the code from source (it's an open source project), but that
adds friction. It also means they would need to setup the right
converters for their database engine.

Do you have any suggestions for distributing our app in a way which
would work with all databases?

We use flyway to handle schema migrations. The flyway files are unique
to each database type because of syntax/quirks. Perhaps some
meta-schema which then gets converted to the actual schema could
result in a generic codegen and a generic flyway?

Alok

Lukas Eder

unread,
Dec 1, 2015, 5:08:43 AM12/1/15
to jooq...@googlegroups.com
Hi Alok,

2015-11-30 19:41 GMT+01:00 Alok Menghrajani <al...@squareup.com>:
> By using Table.field(Field)
> (http://www.jooq.org/javadoc/latest/org/jooq/TableLike.html#field-org.jooq.Field-),
> you can search for a given field in another table, by its unqualified name.

This seems to work! It wasn't clear to me that passing a Field to
field() would work, so I converted to/from a String which worked. The
javadoc says "Get a specific field from this table.", the doc for Row
turns out to be a little more clear: "...However, if this is a row
from an aliased table, the field will be aliased accordingly.".

Good point. We'll improve that:
 

--
You received this message because you are subscribed to the Google Groups "jOOQ User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jooq-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lukas Eder

unread,
Dec 1, 2015, 5:43:26 AM12/1/15
to jooq...@googlegroups.com
Hi Alok

(who ever thought that keyboard shortcuts in browsers were a good idea for E-Mail programs... o_O)

2015-11-30 19:41 GMT+01:00 Alok Menghrajani <al...@squareup.com>:
> By using Table.field(Field)
> (http://www.jooq.org/javadoc/latest/org/jooq/TableLike.html#field-org.jooq.Field-),
> you can search for a given field in another table, by its unqualified name.

This seems to work! It wasn't clear to me that passing a Field to
field() would work, so I converted to/from a String which worked. The
javadoc says "Get a specific field from this table.", the doc for Row
turns out to be a little more clear: "...However, if this is a row
from an aliased table, the field will be aliased accordingly.".

Good point. We'll improve that (and potentially other similar API methods)
 
> I wish it could. I've been thinking about leveraging TableField<R, T> for
> this and allowing to specify only columns from the INSERT's target table to
> be specified with the INSERT, not from any "srcTable". This won't play well
> with a variety of use-cases, though, including plain SQL where we don't have
> a TableField, but just a Field...
>
> Given your experience with this particular use-case, do you perhaps have an
> idea of how this kind of mistake could be prevented statically?

I don't understand why the plain SQL case prevents the use of
TableField<R, T>. Can't the plain SQL code return a TableField<R,
Table<R>> or TableField<R, Object>?

The advantage of plain SQL is to be very easy to write. If people were forced to provide formally correct table meta data every time they wanted to write up a quick column expression as a string, that would make this part of jOOQ rather hard to use.

Or, more specifically: What's <R> in the case of plain SQL? And what would TableField.getTable() return?

> 3) Again, using plain SQL might be a use-case here. What if we generate a
> table A but only with columns COL1, COL2. The real table A also has COL3,
> but that's omitted from the code generator. In only a single case, we do
> want to insert into COL3, so we're resorting to plain SQL just for that
> column.
>
> Preventing these things via validation is very tricky and probably breaks a
> lot of nice use-cases. However, if many people run into the same issues, I'm
> very open to improve the API to prevent such issues.

I will let you know if I run into other similar cases. Again, moving
data from one db/table to another is a rare thing

I have the full picture of all Stack Overflow Q&As, public / private messages about jOOQ, and I dare say I disagree :)
It's certainly rarer than "classic" insertion, but it's not as rare as you might think.
 
so having rough edges is probably fine.

It is never fine :) 
I'll trade usability for type safety any time - and I think we're discussing on a high level already. But usability is what jOOQ is loved for, and what JPA's Criteria API is dreaded for.

Earlier versions of jOOQ (1.x) made extensive use of TableField. That was a huge mistake, and a lot of features today wouldn't be possible if we still had that old jOOQ 1.x API.

A middle ground between throwing exceptions
and silently truncating data could be having some kind of warning log
messages.

But how to *know* that this was really a mistake that should be logged? I'm afraid of logging false positives, which will again result in more work / discussions...

I have another issue which is related to this topic (the root cause
seems to be how the field converters are applied). We test our app
with 3 types of database (postgres, mysql, h2). In theory, things
should work with every database that jOOQ supports.

I use converters to handle three cases in a database agnostic way:
- mysql doesn't have a boolean type, so we map tinyint to java.lang.Boolean.
- we use bigint to store timestamps in mysql but for legacy reasons
postgres uses timestamp.
- I ran into some issues with postgres, bigint and auto increments. As
a result, the primary keys are int in postgres but bigint in mysql.

Curious, what are those issues in PostgreSQL?
 
My pain point is that jooq's codegen depends on the database (because
of the converters, right?).
This means we need to publish a .jar for
each database type, which is a little painful. We could tell people to
compile the code from source (it's an open source project), but that
adds friction. It also means they would need to setup the right
converters for their database engine.

Do you have any suggestions for distributing our app in a way which
would work with all databases?

Converters are really very simple tools to convert from <T> to <U> and vice versa. You could implement data type bindings that take care of binding the user type <U> (which is always the same) to a database-specific type <T> or even <?> (which can differ at runtime). An example is given here:

It's a bit more work per data type, but it will make your use-case possible. Let me know if you have any questions related to this.
 
We use flyway to handle schema migrations. The flyway files are unique
to each database type because of syntax/quirks. Perhaps some
meta-schema which then gets converted to the actual schema could
result in a generic codegen and a generic flyway?

I can't really comment on "generic flyway" (I don't think that's possible), but you can certainly use database-agnostic code generation with jOOQ, e.g. by using the XMLDatabase. An example is given in this blog post:

Essentially, you need to bring your schema meta information into this form:

You can also implement your own jOOQ-meta Database to provide the code generator with meta information, if that's easier for you.

Hope this helps,
Lukas
Reply all
Reply to author
Forward
0 new messages