Search & replace an object in the database

38 views
Skip to first unread message

jools

unread,
Jan 21, 2022, 10:05:58 AM1/21/22
to Django users
Dear Django users,

here’s an interesting case and I’m curious whether somebody can point me in the right direction.

Here’s the problem:
My company has a database of customers. Over the years, many duplicates have been created. For cleaning up duplicates, I’d like to have a search-and-replace functionality.

I.e.: Replace all references in the database to customer <old> with customer <new>.
The database schema has quite a bit of complexity, so I’m aiming to find a very generic solution.

Approach:
Django has a bit of functionality to find all references to an existing object, namely django.db.models.deletion.Collector and I’m using it to find all references.

Though, the “replace” logic seems quite hard to get right:
  • It has to keep parent links intact
  • It has to recognize references in parent models (Customer model is derived from the concrete model Actor)
  • It has to recognize generic relations, built with Django’s content types

My stub implementation comes below.

  • Has anybody else implemented sth like this, e.g. in helper library?
  • Do you think the approach is right?
  • What would you differently?
  • Any caveats that you know of?
Best regards
Jools

Stub implementation:

from django.db.models.deletion import Collector, ProtectedError, RestrictedError


def replace_object(old_obj, new_obj, using='default'):
    """
    Replace an old object with a new object throughout the database.
    """
    collector = Collector(using)

    try:
        collector.collect([old_obj])
    except ProtectedError:
                pass
    except RestrictedError:
                pass

    for model, obj in collector.instances_with_model():
        for field_opts in obj._meta.get_fields():
            if __is_qualified(field_opts, new_obj):
                setattr(obj, field_opts.name, new_obj)
                obj.save()

def __is_qualified(field_opts, new_obj):
    if not hasattr(field_opts, 'related_model'):
        return False
   
    # This check is wrong for multi-table inheritance, I suppose?
    if field_opts.related_model != new_obj.__class__:
        return False
    if hasattr(field_opts, 'parent_link') and field_opts.parent_link is True:
        return False

    return True

jools

unread,
Jan 21, 2022, 10:44:17 AM1/21/22
to Django users
Here’s an update:

I sticked to the idea of using collector; it categorizes models from a deletion perspective into:
  • fast_deletes - models that can be easily deleted, without checking for other relations
  • update_fields - field that have SET_NULL, i.e. relations to the customer objects that are only updated
  • … all the rest
So I kept the structure from Collector, looping the three categories mentioned above. It works ok; for each instance, you’ll have the check instance._meta.get_fields() to find the field related to the old customer record.
Fixing GenericRelations require the most code, because you’ll have to find the content type, object id (and maybe content object) fields that are related to the old customer record.

Moreover, you’ll have to be aware that you may violate foreign key constraints when replacing the old id.

My implementation is attached, if it’s interesting for anybody.

Have a nice weekend y’all
jools
base.py

Yeboah Dominic

unread,
Jan 21, 2022, 2:43:03 PM1/21/22
to django...@googlegroups.com
Okay bro try this script hope it works

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/4cd485cd-cc20-481d-a7be-3179bcc5c98bn%40googlegroups.com.
deleteDB.py

Bernard Mallala

unread,
Jan 21, 2022, 3:16:35 PM1/21/22
to Django users
I would approach this differently and from the database side mostly
  1. Backup the database so you can restore if something happens
  2. In Django, Create an an empty copy of the Customer database e.g CustomerB
  3. Using a SQL query tool, select all duplicate copies from Customer and insert into CustomerB
  4. If needed, update the model for CustomerB so that you are able to associate or reference every duplicate in CustomerB with the right record in Customer
  5. Using a SQL query too delete all duplicate copies of Customers from Customer table
  6. If this breaks your application in any way, account for that by ensuring that every duplicate record in CustomerB has a primary record in Customer that suffices and returns results. This will ensure that all relations with other models and records are honored.
  7. Update or fix, both models to ensure that duplicates can no longer be inserted
  8. Perform any other schema updates

Thomas Lockhart

unread,
Jan 21, 2022, 5:37:11 PM1/21/22
to django...@googlegroups.com
Yes this exactly. A couple of possible additions:

I’ve had great luck using python and SQL queries to manipulate a database including converting from other schemas and merging duplicate records.

Make a test database from your backup, and beat it up with your scripted code until you are happy with the result. Just keep deleting the test database and reconstructing it from the backup to try again. The python code can do things easily such as sucking in all records, looking for overlaps on various fields, and merging the results.

If this is not a one-time thing, then you might consider doing something similar through the admin interface, but breaking the problem down into separate phases using intermediate models. Then you could review each possible merge before going on to the next phase of actually doing a merge, or choosing to ignore a merge candidate. You could also do additional screening to, for example, select *which* customer record should be considered complete and which ones should be retired.

hth

- Tom

jools

unread,
Jan 24, 2022, 4:48:09 AM1/24/22
to Django users
Thanks! I’ll consider this.

The whole thing is supposed to be a tool for everyday work (although: a very dangerous one!), so the “fix the database once” strategy isn’t going to work for me.

I’ve added some tests for checking different relation types (m2m, fk, one-to-one, generic relation, parent-child links) and it seems to work fine; I tested using a bunch of complex records from production.

> If this is not a one-time thing, then you might consider doing something similar through the admin interface, but breaking the problem down into separate phases using intermediate models. Then you could review each possible merge before going on to the next phase of actually doing a merge, or choosing to ignore a merge candidate. You could also do additional screening to, for example, select *which* customer record should be considered complete and which ones should be retired.

That’s a good point. I think I’ll do something similar but more simple by providing an overview of all merges beforehand. This way, people will know what’s going to happen and have an opportunity to fix any issues upfront.

cheers, j
Reply all
Reply to author
Forward
0 new messages