distinct() on selected columns only?

2,355 views
Skip to first unread message

Thomas Hopper

unread,
Nov 12, 2017, 4:52:58 PM11/12/17
to manipulatr
I have a function that reads in data from an external source, and I'd like to eliminate rows that duplicate previous rows (i.e. by using unique() or distinct()), but I want to ignore selected columns in determining whether a row is a duplicate.

It seems like a standard eval version of distinct() should be able to handle this, but I'm stumped as to how to code it.

Here's a minimal example:

library(dplyr)
my_df
<- data_frame(a = c(1:3,1), b = c(1:3,1), c = letters[1:4])
df_names
<- names(my_df)

# doesn't throw an error, but doesn't produce the right results
my_df
%>% distinct(!! df_names[df_names != "c"])

which produces

# A tibble: 4 x 3
      a     b     c
 
<dbl> <dbl> <chr>
1     1     1     a
2     2     2     b
3     3     3     c
4     1     1     d


What I expect is that row 4 would be dropped, because columns a and b are being duplicated.

In my real data set, most of the column names are generally not known ahead of time, but due to some pre-processing, column "c" is always named the same, but may contain the wrong data; this is why I want to drop column "c" from the determination of whether or not the rows contain duplicate data.

Can this be done in dplyr?

Thank you,

Tom

Brandon Hurr

unread,
Nov 12, 2017, 7:18:32 PM11/12/17
to Thomas Hopper, manipulatr
https://stackoverflow.com/questions/46245527/how-to-keep-only-unique-rows-but-ignore-a-column

Tom, 

Looks like an answer there. Rather than do -c, you name a and b and keep_all. 

distinct(my_df, a, b, .keep_all = TRUE)

HTH, 
B

--
You received this message because you are subscribed to the Google Groups "manipulatr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+unsubscribe@googlegroups.com.
To post to this group, send email to manip...@googlegroups.com.
Visit this group at https://groups.google.com/group/manipulatr.
For more options, visit https://groups.google.com/d/optout.

Tom Hopper

unread,
Nov 12, 2017, 7:52:17 PM11/12/17
to Brandon Hurr, manipulatr
Brandon,

I don’t think that helps. I don’t know the names of the other columns, or even how many columns will be imported. I only know there’s a column named “c” that needs to be excluded from the check for uniqueness.

Thanks,

Tom
signature.asc

Thomas Hopper

unread,
Nov 13, 2017, 6:24:51 AM11/13/17
to manipulatr
To expand on my example, here's some code that does what I want; I'm just hoping that I can do it simply in dplyr:

library(dplyr)
my_df
<- data_frame(a = c(1:3,1), b = c(1:3,1), c = letters[1:4])


df_names
<- names(my_df)


# What I'd like:
#my_df %>% distinct(!! df_names[df_names != "c"])


# What works:
cols_to_check
<- which(df_names != "c")
temp_df
<- my_df[,cols_to_check]
rows_to_keep
<- !duplicated(temp_df)
my_df
<- my_df[rows_to_keep,]


Thank you,

Tom

Thomas Hopper

unread,
Nov 13, 2017, 7:56:10 AM11/13/17
to manipulatr
I had a typo in that; cols_to_check assignment should read:

cols_to_check <- which(colnames(df_names) != "c")

Hadley Wickham

unread,
Nov 13, 2017, 11:30:10 AM11/13/17
to Thomas Hopper, manipulatr
Two problems:

library(dplyr)
my_df <- data_frame(a = c(1:3,1), b = c(1:3,1), c = letters[1:4])
df_names <- names(my_df)

# First you need to convert the variable names to symbols. That's
# because distinct has mutate semantics so you were doing the
# equivalent of distinct(df, "a", "b")
vars <- rlang::syms(setdiff(df_names, "c"))

# Second, you need to use !!! to generate distinct(a, b) instead of
distinct(c(a, b))
my_df %>% distinct(!!!vars)

In the future, we'll try and do a better job of labelling function as
either using select() or mutate() semantics as this is quite important
when you start using tidy eval.

Hadley
> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to manipulatr+...@googlegroups.com.
> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at https://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.



--
http://hadley.nz

Tom Hopper

unread,
Nov 13, 2017, 12:05:22 PM11/13/17
to manipulatr
Thank you, Hadley; that’s just what I was looking for!

Regards,

Tom
signature.asc

Brandon Hurr

unread,
Nov 13, 2017, 12:09:54 PM11/13/17
to Hadley Wickham, Thomas Hopper, manipulatr
Thanks for chiming in Hadley. I asked ROpenSci last night and had similar responses from Jenny and Romain. 

To functionalize it all: 
library(dplyr)
library(rlang)

df <- data_frame(a = c(1:3,1), b = c(1:3,1), c = letters[1:4])

my_distinct3 <- function(df, unsel) {
  df_names <- names(df)
  sel <- rlang::syms(setdiff(df_names, unsel))  
  
  df %>%
    dplyr::distinct(!!!sel, .keep_all = TRUE)
}

my_distinct3(df, "c")

# A tibble: 3 x 3
      a     b     c
  <dbl> <dbl> <chr>
1     1     1     a
2     2     2     b
3     3     3     c

I tried to figure out how to pass `c` without quotation, but wasn't able to get that to work. 

HTH, 
Brandon


> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at https://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.



--
You received this message because you are subscribed to the Google Groups "manipulatr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+unsubscribe@googlegroups.com.

Tom Hopper

unread,
Nov 13, 2017, 2:34:07 PM11/13/17
to manipulatr
Very nice, Brandon; thank you!
signature.asc

Romain Francois

unread,
Nov 14, 2017, 3:39:09 AM11/14/17
to manipulatr
I guess you are looking for a generalized version of Jenny's option, i.e. using select semantics to qualify the names of the variables you want to omit, something like: 

df <- data_frame(a = c(1:3,1), b = c(1:3,1), c = letters[1:4])

my_distinct4 <- function(df, ...) {
  names_df <- names(df)
  sel <- setdiff( 
    names_df, 
    select_vars( names_df, !!!quos(...) )
  )
  
  df %>%
    dplyr::distinct(!!!sel, .keep_all = TRUE)
}

my_distinct4(df, c)

This way you can use anything select would handle, e.g. 


my_distinct4(df, starts_with("c"))

> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at https://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.




--
You received this message because you are subscribed to the Google Groups "manipulatr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+...@googlegroups.com.

Brandon Hurr

unread,
Nov 14, 2017, 11:29:25 AM11/14/17
to Romain Francois, manipulatr
Romain, 

Thanks for following up with this. Really helps me understand what's going on a bit better. I had to fix the distinct call to change them to symbols as Hadley mentioned, and your function works exactly as you say with that change. 

Awesome. 

library(dplyr)
library(rlang)

df <- data_frame(a = c(1:3,1), b = c(1:3,1), c = letters[1:4])

my_distinct5 <- function(df, ...) {
  names_df <- names(df)
  sel <- setdiff( 
    names_df, 
    select_vars( names_df, !!!quos(...) )
  )
  print(sel)
  df %>%
    dplyr::distinct(!!!syms(sel), .keep_all = TRUE)
}

# naked name of column
my_distinct5(df, c)

#using a select function
my_distinct5(df, starts_with("c"))

# using multiple naked column names in case
#you need something that removes more than one column
my_distinct5(df, b, c)


B

To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages