Locating duplicate observations using dplyr

1,850 views
Skip to first unread message

Bob

unread,
Sep 20, 2014, 9:01:48 AM9/20/14
to manip...@googlegroups.com
Hi All,

A simple class example on finding duplicate observations accidently turned into an example of dplyr's impact on row names. I had just duplicated the first five rows of a data frame and told the class to look at the ID variable to verify this.

> mySubset <- Talent[1:5, 1:6]
> double5 <- rbind(mySubset, mySubset)
> double5
   ID School_Size Region Age Gender Career
1   1           3      1  17   Male     22
2   2           2      1  18   Male     34
3   3           3      1  17   Male     19
4   4           1      1  18   Male     36
5   5           3      1  17   Male      4
6   1           3      1  17   Male     22
7   2           2      1  18   Male     34
8   3           3      1  17   Male     19
9   4           1      1  18   Male     36
10  5           3      1  17   Male      4

Then I located the duplicates using subscripts and said that you could tell that it found the last five by looking at the row names:

> double5[ duplicated(double5) , ]
   ID School_Size Region Age Gender Career
6   1           3      1  17   Male     22
7   2           2      1  18   Male     34
8   3           3      1  17   Male     19
9   4           1      1  18   Male     36
10  5           3      1  17   Male      4

Finally, I said that you could do the selection using dplyr too:

> filter(double5, duplicated(double5))
  ID School_Size Region Age Gender Career
1  1           3      1  17   Male     22
2  2           2      1  18   Male     34
3  3           3      1  17   Male     19
4  4           1      1  18   Male     36
5  5           3      1  17   Male      4

For a second I thought dplyr had somehow selected the first five observations instead of the last, and then realized it was just dplyr's disregard for the sanctity of row names. :-)  The result was a handy example that taught two lessons instead of the one I expected.

Cheers,
Bob

Hadley Wickham

unread,
Sep 22, 2014, 8:53:05 AM9/22/14
to Bob, manipulatr
Yeah, dplyr always drops row names. I wonder if I should make
print.tbl_df() not display them to emphasise that they're not used by
dplyr.

Hadley
> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to manipulatr+...@googlegroups.com.
> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at http://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.



--
http://had.co.nz/
Reply all
Reply to author
Forward
0 new messages