Find duplicates with regex

32 views
Skip to first unread message

yakka...@gmail.com

unread,
Jan 25, 2015, 11:58:23 PM1/25/15
to django...@googlegroups.com
I posted this question to stackover flow but didn't get a good answer: http://stackoverflow.com/questions/28080545/django-find-duplicates-with-queryset-and-regex

I want to find duplicates in db fields with a regex.

I know I can use this to find duplicates:
self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

I have a model like this:
class company(models.Model):
   Website = models.URLField(blank=True, null=True )

The problem is that www and non-www websites are marked as different websites.  I want some thing that will return duplicates where it realizes www and non-www are the same website.

I know I can use a regex like this for www and non-www:
Website__iregex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

Here is an example:
Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')
Company.objects.create(Website='http://example.org', Name='a')
Company.objects.create(Website='http://example.org', Name='b')

When I call: Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

It returns:
1.  http://example.org (from name=a) and http://example.org (from name=b)

This is missing that example.com and www.example.com are the same website and duplicates.

I want to use a regex so that I can tell django that example.com and www.example.com are the same websites.

I want the return to be:
1.  http://example.org (from name=a) and http://example.org (from name=b)
2.  example.com www.example.com
   

Stephen J. Butler

unread,
Jan 26, 2015, 12:20:53 AM1/26/15
to django...@googlegroups.com
The benefit to doing stuff like this in the database is that the
database can do it smarter than you can in Python. Is that true in
this case? I think not. If we were talking raw MySQL I'd suggest
running your Website column through something like REPLACE(Website,
'://www.', '') and then the count would work reasonably well.

But we're not talking raw MySQL. And if you want to use QuerySet I
assume you're still trying to stay database agnostic. That breaks it
down for me into two cases:

1. This is a query you'll be frequently and must run quickly. In that
case, I'd say to alter your model to include a normalized_website
field that is calculated from Website. Strip the 'www.' either in an
overridden save method or a pre_save signal. And don't use regexp for
this work; urlparse and urlunparse are much more appropriate (never
use a regexp to parse structured data when you can help it). Write a
data migration to handle the existing table. Then you can run your
query over normalized_website and things will work fine.

2. This is a query you'll run infrequently and doesn't need to run
quickly. In that case, just pull out all the records and do the
count/filter with Python code.
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-users...@googlegroups.com.
> To post to this group, send email to django...@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/452fad73-1319-4954-b004-7d0604705f30%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages