Removing Duplicates from a List

2,165 views
Skip to first unread message

Julien

unread,
Apr 9, 2010, 6:19:56 AM4/9/10
to Redis DB
hey,

We have a massive list : 3M items.

It has duplicates... and we want to nuke them. What is the best
strategy for this?

The only I can think of is to dump the whole list, remove duplicates
with a client and repopulate the list. Even though it works, I'm
looking for someone less "bold"!

Any idea?

Konstantin Merenkov

unread,
Apr 9, 2010, 6:24:48 AM4/9/10
to redi...@googlegroups.com
Consider using set or sorted set.

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

--
Best Regards,
Konstantin Merenkov

Demis Bellot

unread,
Apr 9, 2010, 6:25:29 AM4/9/10
to redi...@googlegroups.com
Sounds like you want to be keeping these items in a sorted set instead?

- Demis


Julien

unread,
Apr 9, 2010, 6:27:55 AM4/9/10
to Redis DB
Sorry, set is not an option, we use the list as a ring and we need a
very fast pop, push mechanism (up to 60k rpoplpush/sec at the
moment!).


On Apr 9, 12:25 pm, Demis Bellot <demis.bel...@gmail.com> wrote:
> Sounds like you want to be keeping these items in a sorted set instead?
>
> - Demis
>
>
>
> On Fri, Apr 9, 2010 at 11:19 AM, Julien <julien.genest...@gmail.com> wrote:
> > hey,
>
> > We have a massive list : 3M items.
>
> > It has duplicates... and we want to nuke them. What is the best
> > strategy for this?
>
> > The only I can think of is to dump the whole list, remove duplicates
> > with a client and repopulate the list. Even though it works, I'm
> > looking for someone less "bold"!
>
> > Any idea?
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Redis DB" group.
> > To post to this group, send email to redi...@googlegroups.com.
> > To unsubscribe from this group, send email to

> > redis-db+u...@googlegroups.com<redis-db%2Bunsubscribe@googlegroups.c om>

Julien

unread,
Apr 9, 2010, 6:28:37 AM4/9/10
to Redis DB
However, we could use sets on the side, to remove the duplicates is
that's the only option!


On Apr 9, 12:27 pm, Julien <julien.genest...@gmail.com> wrote:
> Sorry, set is not an option, we use the list as a ring and we need a
> very fast pop, push mechanism (up to 60k rpoplpush/sec at the
> moment!).
>
> On Apr 9, 12:25 pm, Demis Bellot <demis.bel...@gmail.com> wrote:
>
>
>
> > Sounds like you want to be keeping these items in a sorted set instead?
>
> > - Demis
>
> > On Fri, Apr 9, 2010 at 11:19 AM, Julien <julien.genest...@gmail.com> wrote:
> > > hey,
>
> > > We have a massive list : 3M items.
>
> > > It has duplicates... and we want to nuke them. What is the best
> > > strategy for this?
>
> > > The only I can think of is to dump the whole list, remove duplicates
> > > with a client and repopulate the list. Even though it works, I'm
> > > looking for someone less "bold"!
>
> > > Any idea?
>
> > > --
> > > You received this message because you are subscribed to the Google Groups
> > > "Redis DB" group.
> > > To post to this group, send email to redi...@googlegroups.com.
> > > To unsubscribe from this group, send email to

> > > redis-db+u...@googlegroups.com<redis-db%2Bunsubscr...@googlegroups.c om>

Julien

unread,
Apr 9, 2010, 6:33:31 AM4/9/10
to Redis DB
I guess LREM key count value with -2 would work, but it has an O(n)
complexity... and since our list is pretty big, that's scary!

Demis Bellot

unread,
Apr 9, 2010, 6:35:27 AM4/9/10
to redi...@googlegroups.com
Honestly for the best performance I think you're best to just read the entire list into a hash on the client and mark the values that need to be deleted then just send the appropriate remove commands.
Using a Redis sorted set would require much more traffic.

Of course you could always hassle Salvatore to implement the LREMDUPE op :)

- Demis


To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.

Demis Bellot

unread,
Apr 9, 2010, 6:41:33 AM4/9/10
to redi...@googlegroups.com
Actually this is best served via a special kind of LIST (which we don't have yet). There was talk of wishes to mark a list as 'LIST CAPPED 1000' so that it becomes a rolling list saving the redundant LTRIM ops per LPUSH.
This would be another candidate as well i.e. 'LIST UNIQUE', which would obviously be ideal to implement idempotent message queues with.

- Demis

Julien

unread,
Apr 9, 2010, 11:42:22 AM4/9/10
to Redis DB
Demis,

Yes, I think the best option for us would be to have LREMDUPE.
The problem is that right now lists don't even have a delete key value
command to delete a specific item.

I guess that for now the only option we have is to dump the whole
list, remove duplicates by hand and reload the list :(

Cheers,


On Apr 9, 12:41 pm, Demis Bellot <demis.bel...@gmail.com> wrote:
> Actually this is best served via a special kind of LIST (which we don't have
> yet). There was talk of wishes to mark a list as 'LIST CAPPED 1000' so that
> it becomes a rolling list saving the redundant LTRIM ops per LPUSH.
> This would be another candidate as well i.e. 'LIST UNIQUE', which would
> obviously be ideal to implement idempotent message queues with.
>
> - Demis
>

> On Fri, Apr 9, 2010 at 11:35 AM, Demis Bellot <demis.bel...@gmail.com>wrote:
>
>
>
> > Honestly for the best performance I think you're best to just read the
> > entire list into a hash on the client and mark the values that need to be
> > deleted then just send the appropriate remove commands.
> > Using a Redis sorted set would require much more traffic.
>
> > Of course you could always hassle Salvatore to implement the LREMDUPE op :)
>
> > - Demis
>

> >> > > > redis-db+u...@googlegroups.com<redis-db%2Bunsubscribe@googlegroups.c om>


> >> <redis-db%2Bunsubscr...@googlegroups.c om>
> >> > > > .
> >> > > > For more options, visit this group at
> >> > > >http://groups.google.com/group/redis-db?hl=en.
>
> >> --
> >> You received this message because you are subscribed to the Google Groups
> >> "Redis DB" group.
> >> To post to this group, send email to redi...@googlegroups.com.
> >> To unsubscribe from this group, send email to

> >> redis-db+u...@googlegroups.com<redis-db%2Bunsubscribe@googlegroups.c om>

Demis Bellot

unread,
Apr 9, 2010, 11:48:22 AM4/9/10
to redi...@googlegroups.com
 I've come across that limitation as well and am using a really lam0 hack using whats available by replacing the element at index (LSET) with a UUID/GUID then LREM the UUID/GUID by value.

- Demis

To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.

Marc Byrd

unread,
Apr 9, 2010, 12:20:05 PM4/9/10
to redi...@googlegroups.com
unless @redis.sismember(listSet, k)
  @redis.pipelined { |pipeline|
    pipeline.sadd(listSet,k)
    pipeline.lpush(listQ, k)
  }

Rather than removing dups, perhaps good to keep it clean in the first place...

If it's a one-time deal to nuke the dups, and you don't want to lose (circular) order, could use RPOPLPUSH or RPOP, check sismember, LPUSH - use LLEN to know you've gone all the way thru...

m
Reply all
Reply to author
Forward
0 new messages