Idiomatic way to remove duplicates in a slice

23914 views
Skip to first unread message

Sathish VJ

unread,
Aug 11, 2012, 10:33:54 PM8/11/12
to golan...@googlegroups.com
Is there an efficient, idiomatic way to remove duplicates in a slice?  
If possible, I do want to retain the original order of items.  

Hailiang

unread,
Aug 12, 2012, 12:47:40 AM8/12/12
to golan...@googlegroups.com
There is an implementation in a previous post "Set implementation benchmarking" by Karl Seguin, and I think it is simple enough and retain the order of elements:
func (this *OnDemandArraySet) RemoveDuplicates() {
  length := len(this.data) - 1
  for i := 0; i < length; i++ {
    for j := i + 1; j <= length; j++ {
      if (this.data[i] == this.data[j]) {
        this.data[j] = this.data[length]
        this.data = this.data[0:length]
        length--
        j--
      }
    }
  }
}

Hailiang

Paul Hankin

unread,
Aug 12, 2012, 2:43:20 AM8/12/12
to golan...@googlegroups.com
On Sunday, 12 August 2012 04:33:54 UTC+2, Sathish VJ wrote:
Is there an efficient, idiomatic way to remove duplicates in a slice?  
If possible, I do want to retain the original order of items.  

I'd do the obvious, and build a map of the things found so far when iterating over the slice, omitting ones that have already been found. Like this:

package main

import (
"fmt"
)

func RemoveDuplicates(xs *[]string) {
found := make(map[string]bool)
j := 0
for i, x := range *xs {
if !found[x] {
found[x] = true
(*xs)[j] = (*xs)[i]
j++
}
}
*xs = (*xs)[:j]
}

func main() {
xs := []string{"to", "be", "or", "not", "to", "be", "that"}
RemoveDuplicates(&xs)
fmt.Println(xs)
}


-- 
Paul

DisposaBoy

unread,
Aug 12, 2012, 3:11:39 AM8/12/12
to golan...@googlegroups.com


On Sunday, August 12, 2012 3:33:54 AM UTC+1, Sathish VJ wrote:
Is there an efficient, idiomatic way to remove duplicates in a slice?  
If possible, I do want to retain the original order of items.  

If your dataset isn't so large that you have to start to being paranoid about memory usage then the simplest thing I can think is to just use a map along-side the slice.
Two simple approaches are:

Michael Jones

unread,
Aug 12, 2012, 5:34:12 AM8/12/12
to DisposaBoy, golan...@googlegroups.com
These are all great answers. The "better" question seems less about slices and more about context--how do you know if there are duplicates (where 'how' stands for how many, how efficiently, how much extra storage, etc.)

The first answer is an "inverse insertion sort" that will use no extra memory and do O(n**2) comparisons. The other answers use Go's map feature and will be linear to logarithmic in n (depending on the hidden implementation and data sizes vs. table allocations.) This is a huge difference if you have 1,000,000 values to de-duplicate. 

Even so, the first way could* easily be faster for small n. If you know the range of n, the size of the slice's data type, and the available working storage, then you could properly decide between these two paths, since each is better in some cases.

Michael

*Pure conjecture...
--
Michael T. Jones | Chief Technology Advocate  | m...@google.com |  +1 650-335-5765

Glenn Brown

unread,
Aug 12, 2012, 6:21:11 AM8/12/12
to Sathish VJ, Glenn Brown, golan...@googlegroups.com

> Is there an efficient, idiomatic way to remove duplicates in a slice?
> If possible, I do want to retain the original order of items.

This one runs in linear time: http://play.golang.org/p/HJEm6qy5wb
It uses a map to track which entries have been seen, so it only works with map-able types.

package main

import "fmt"

func removeDuplicates(a []int) []int {
result := []int{}
seen := map[int]int{}
for _, val := range a {
if _, ok := seen[val]; !ok {
result = append(result, val)
seen[val] = val
}
}
return result
}

func main() {
x := []int{9, 1, 2, 3, 5, 6, 9, 2, 3, 3, 6, 8, 1}
x = removeDuplicates(x)
fmt.Println(x)
}

Tim Harig

unread,
Aug 12, 2012, 7:16:36 AM8/12/12
to golang-nuts
On Sat, Aug 11, 2012 at 07:33:54PM -0700, Sathish VJ wrote:
> Is there an efficient, idiomatic way to remove duplicates in a slice?
> If possible, I do want to retain the original order of items.

It seems to me, that if you do not want duplication, don't allow
duplicates into the slice. Keep the data itself in a map or btree
structure that will make duplicates obvious as you are trying to
store them. You can then use a slice of pointers to the objects in
the map/btree to preserve your order if you really want to preserver
linearity.

If you need to represent duplication in your slice at some point, then
use a form of indirection that includes reference counting. Then all
you have to do, is remove any pointer in your slice to an object that
contains a reference count greater than one. This can be done in the
data type itself represented by an interface or by using an intermediate
smart pointer.

ali oygur

unread,
Dec 14, 2016, 6:41:28 AM12/14/16
to golang-nuts
there is good gist.

https://gist.github.com/alioygur/16c66b4249cb42715091fe010eec7e33

12 Ağustos 2012 Pazar 06:33:54 UTC+4 tarihinde Sathish VJ yazdı:

Michael Jones

unread,
Dec 14, 2016, 12:15:01 PM12/14/16
to ali oygur, golang-nuts

Do we know anything about the structure of the slice?

 

It is inherent in the general notion that there must be a comparison of a new item with existing ones to establish uniqueness. Even so, knowledge about the structure of the input can make this comparison easier.

 

If the input is already ordered, then testing against the prior element is a test against all elements.

 

If the input is integers in a small range then a bit mask presence test is good.

 

If avoiding allocation is important one could sort the list O(n log n) and then scan it O(n)

 

The hash (map/dictionary) approach is good, general, and slightly more space efficient with a slice of struct{} since presence testing is all that is necessary.

 

An O(n) scan to build a histogram would let you know about ranges and potential sections for subsequent processing.

 

Many well-known approaches.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ayan George

unread,
Dec 14, 2016, 12:36:01 PM12/14/16
to golan...@googlegroups.com
On 12/14/2016 12:14 PM, Michael Jones wrote:
> Do we know anything about the structure of the slice?
>
>

Just FYI -- this thread was started on 8/11/2012 and you, Michael, even
responded to it at the time.. I think it was pretty well hashed out
back then and I'm not sure why this was resurrected:

https://groups.google.com/forum/#!topic/golang-nuts/-pqkICuokio

-ayan

Michael Jones

unread,
Dec 14, 2016, 1:07:11 PM12/14/16
to Ayan George, golan...@googlegroups.com
Ha ha! Thought it seemed a familiar question. Sometimes they resurface and it seems kind to treat new ones fairly. I should have looked though. Thank you!

John Souvestre

unread,
Dec 14, 2016, 1:45:01 PM12/14/16
to golang-nuts

Ø  there is good gist.

 

Ø  this algorithm 10x faster …

 

I don’t believe it is faster as the slice size increases.  What you propose is O(N**2) vs O(N).  Try a test with a slice 10 or 100 times the size.

 

John

    John Souvestre - New Orleans LA

--

Dave Cheney

unread,
Dec 14, 2016, 2:49:37 PM12/14/16
to golang-nuts, alio...@gmail.com
If you slice implements sort.Sort, you can remove duplicates using Kevin Gillete's set package.

Reply all
Reply to author
Forward
0 new messages