How to migrate to UUID references in db without messing up reference fields

144 views
Skip to first unread message

Ian W. Scott

unread,
May 14, 2014, 11:44:43 AM5/14/14
to web...@googlegroups.com
I need to take an existing db and implement a UUID referencing system so that I can sync local db's with a central remote version. But I'm concerned that this will break reference fields that refer to the newly synced rows.

My understanding is that the UUID field is necessary because a csv import will assign different row ids to the new entries in the target db than the ones they had in the source db (especially if new records have been added to the target db in the meantime). The UUID is supposed to overcome this, by allowing the db to recognize that rows are equivalent even if they have different ids. But in that case, won't reference fields in other tables often be pointing to the wrong ID number in the target db? (i.e., they'll keep the row ID of the db version in which they were created, and if this ID changes in the target db they will then be referencing a different record.)

Sorry if this explanation is overly complicated. I'm just trying to get things clear in my own mind.

Thanks

Ian W. Scott

unread,
May 24, 2014, 12:04:10 AM5/24/14
to
After 10 days I've received no help on this. Is there something about the question that is inappropriate? Is my question unclear?

Massimo Di Pierro

unread,
May 24, 2014, 3:08:58 AM5/24/14
to web...@googlegroups.com
Hello Ian, Sorry we overlooked your email.

You can easily add a UUID field

from gluon.util import web2py_uuid
db.define_table('person',Field('name'),Field('uuid',compute=lambda:web2py_uuid()))

You can also create table that reference that field:

db.define_table('thing',Field('name'),Field('owner',requires=IS_IN_DB(db,'person.uuid','name'))

Except the reference will not be enforced at the DB level, only at the web2py level.

Massimo



On Friday, 23 May 2014 23:04:10 UTC-5, Ian W. Scott wrote:
After 10 days I've received no help on this. Is there something about the question that is inappropriate? Is my question unclear?

On Wednesday, May 14, 2014 11:44:43 AM UTC-4, Ian W. Scott wrote:

Philip Kilner

unread,
May 24, 2014, 4:02:37 AM5/24/14
to web...@googlegroups.com
Hi Ian,

I see that Massimo has answered you, but on re-reading your question I
wasn't clear quite what you were asking. I have an application that
needs to use UUIDs myself, so thought it worth jumping in.

On 14/05/14 16:44, Ian W. Scott wrote:
> I need to take an existing db and implement a UUID referencing system so
> that I can sync local db's with a central remote version. But I'm
> concerned that this will break reference fields that refer to the newly
> synced rows.
>

Is this an existing web2py-generated db, or a legacy db? Am assuming the
former, but tell me if I'm wrong...

Do you have existing ID fields which you wish to /convert/ to UUIDs, so
that you are using UUIDs *instead* of IDs?

Or are the IDs exposed in the application, such that you wish to
/supplement/ them with an UUID field to support replication?


> My understanding is that the UUID field is necessary because a csv
> import will assign different row ids to the new entries in the target db
> than the ones they had in the source db (especially if new records have
> been added to the target db in the meantime). The UUID is supposed to
> overcome this, by allowing the db to recognize that rows are equivalent
> even if they have different ids. But in that case, won't reference
> fields in other tables often be pointing to the wrong ID number in the
> target db? (i.e., they'll keep the row ID of the db version in which
> they were created, and if this ID changes in the target db they will
> then be referencing a different record.)
>

It's the mention of CSV imports that I'm stuck on - is the intention to
use this as your replication mechanism, or are you saying that you are
importing data into each instance of the db, and need to preserve
imported IDs?

I may be misreading you, but I would very strongly recommend the use of
the db's existing native replication mechanism if at all possible. You
don't specify your back-end database, but we've had good results from
Postgres replication, FWIW.


> Sorry if this explanation is overly complicated. I'm just trying to get
> things clear in my own mind.
>

It does sound complicated, but I'm not 100% certain that I'm clearly
distinguishing between the outcome you want and your thoughts as to how
to achieve it.

FWIW, I have a similar issue, and what I would like to do is use UUID
PKs throughout in web2py, and use the db's *native* replication
mechanism, which should work OOTB with UUID PKs.

Am I right in thinking that this approach would solve your
ID/replication problem, but that your situation is complicated by the
fact that your starting point is an existing app with IDs?

Apologies for bombarding you with questions!


--

Regards,

PhilK


'a bell is a cup...until it is struck'

Ian W. Scott

unread,
May 24, 2014, 11:46:33 AM5/24/14
to web...@googlegroups.com
Thanks, Phil. I think you've understood my problem fairly well. If I'm unclear it's partly because I'm new to db mechanics, but I'll try to lay things out more clearly:

My existing setup:
  • I have two existing instances of the same web2py application
  • Currently the models just use web2py's native ID for referencing.
  • I have a lot of *existing* records with reference fields that use those ID numbers.
  • my back-end is sqlite
My goal:
  • I need to set up a way to synchronize the database information between the two instances (not just replicate an identical db instance).
  • It is quite likely that different data may be created in *both* instances between synchronizations.
My partial solution:
  • I wanted to allow either instance to draw in new (or updated) data from the other via a csv file (generated by the other instance).
  • I was focusing on this approach because it's the only one suggested in the web2py manual for syncing db instances.
  • Adding a UUID field to the existing model would allow me to recognize duplicate records accurately.
My problem with this solution:
  • I have a whole lot of reference fields in multiple tables that use the current ID numbers. When I start syncing based on UUID, these ID references will be broken on one or the other app instance.
Possible workarounds?
  1. I could follow the approach in the web2py manual (UUID fields, csv as the vehicle for getting new data) and migrate all of my reference fields to use the UUID field instead of ID. I'm a bit nervous about the potential for data corruption and errors in executing such a large-scale migration with my own migration script. This would also mean my reference fields are (as you say) only enforced at the web2py level, not at the db (sqlite) level. I think it would also limit my use of web2py's model features (e.g., list-reference fields would have to become simple list-string fields?).
  2. I could move the whole db to use UUIDs for the PK of each table. I much prefer this idea. It sounds more elegant and would allow me to continue using my current model structure (with regular reference fields, list-reference fields, etc.). But I have no idea how to go about setting this up, either on the sqlite end (can sqlite use UUIDs as PKs?) or on the web2py end.

Any help would be greatly appreciated!

Ian

Ian W. Scott

unread,
May 24, 2014, 11:58:39 AM5/24/14
to web...@googlegroups.com
Thanks Massimo. I explained my problem a bit better (I hope) in my reply to Philip below. There are a couple of reasons why the approach you suggest isn't ideal, from my point of view:
  1. I'm working with existing database instances (sqlite). So if I change all of the reference fields to use uuid connections as you suggest, I will have to perform a large-scale migration with a custom script. I'm nervous about the potential for error here.
  2. This approach would prevent my use of some great web2py dal features (reference fields, list-reference fields, recursive selects, cascading deletes, etc.) I'd really rather not have to sacrifice so much web2py functionality just to synchronize two db instances.

I wonder whether this doesn't point to a potential area for improvement in web2py. It seems like a framework like this would benefit from having an easier, built-in solution for synchronizing db instances without breaking dal features. So if you can think of a way to facilitate synchronization that's more elegant, I'd be happy to contribute back any code I write to implement it.

Thanks again,

Ian

Massimo Di Pierro

unread,
May 26, 2014, 11:49:20 AM5/26/14
to web...@googlegroups.com
I now understand your problem better.

Have you looked into?

db.export_to_csv_file(file)
db.import_from_csv_file(file,id_map={})

When importing from the file the id_map will recognize records from the uuid (assuming they have it) and "fix" the references accordingly. So the references in the imported records will be different but will point to the right record. This may fail if there are circular references because it would be unable to built the references. Moreover it requires comparing the entire db and cannot be used on one single table.

I agree a better solution would be desirable.

Massimo 

Philip Kilner

unread,
May 27, 2014, 12:42:49 PM5/27/14
to web...@googlegroups.com
Hi Ian,

I've spent some time looking at available options, and I think I can
solve my own issues without using UUIDs at all.

I really wanted to continue to use integer IDs, partly because in some
cases these fields are exposed to my users as order/document/transaction
IDs. I also wanted to stay as close to web2py's default way of working,
and use the database platform's own continuous replication, rather than
doing it in a batch.

Using Postgres, the ID fields are are populated by a generator, and
changing the generator DDL to prevent clashing IDs should work for me,
as there will only ever be a small number of instances of the database.

It's actually very easy to do - you can interleave the values, so if you
had two instances of the database, you could have one issue odd nos. and
the other even: -

Instance 1: -

START WITH 1 INCREMENT BY 2

Instance 3: -

START WITH 2 INCREMENT BY 2

Alternatively, you could assign a range to each instance: -

Instance 1: -

START WITH 1 INCREMENT BY 1
MAXVALUE 99999999

Instance 2

START WITH 100000000 INCREMENT BY 1
MAXVALUE 199999999

I don't know how useful this will be to you if you are using SQLite, but
you may find that Postgres is a more robust platform for replication, so
it could worth looking into in any event.

Ian W. Scott

unread,
Jun 3, 2014, 1:07:21 PM6/3/14
to web...@googlegroups.com
Thanks very much Massimo and Philip. I had missed the fact that the id_map parameter triggered smart updating of the reference fields. That's great and for the moment is solves my problem.

I'll keep thinking about a longer-term solution for syncing. Here's what I'm thinking would be desirable:

  1. 2-way intelligent sync (look at whatever update-time info is available in the table), maybe using websockets
  2. automated backup of the original data for rollback if necessary
  3. can be performed on one table
  4. reference field ids are updated on both ends as necessary
  5. no extra fields need be added to db model
  6. can easily be scheduled or triggered by user activity on one end
In the long run this is something I'll need, so if I make any progress I'll let you know.

Ian
Reply all
Reply to author
Forward
0 new messages