Using separate() in tidyr on multiple columns?

John Mola

unread,

Mar 1, 2017, 8:08:39 PM3/1/17

to davi...@googlegroups.com

Hi all,

Not a very good drug user, so please be gentle with me.

Here is some dummy data:

Ind = c("SNP1","SNP1","SNP2","SNP2","SNP3")

SNP1 = c("AA","TT","AT","TT","TT")

SNP2 = c("GC","GG","GG","CC","GC")

SNP3 = c("GG","GC","GG","CC","GC")

df = data.frame(Ind,SNP1,SNP2,SNP3)

df_filt= distinct(df, Ind, .keep_all = TRUE)

df_filt

> df_filt

   Ind SNP1 SNP2 SNP3

1 SNP1   AA   GC   GG

2 SNP2   AT   GG   GG

3 SNP3   TT   GC   GC

(Dummy data loaded in a bit weird, because there's an intermediate step in there, where I only wanted the unique "SNP" rows)

So now, I'd like to split the AA's, GC's, etc into separate columns. I manage to do this with separate() in one column:

> separate(data = df_filt, col = SNP1, into = c("SNP1.1","SNP1.2"), sep=c(1))

Ind SNP1.1 SNP1.2 SNP2 SNP3

1 SNP1 A A GC GG

2 SNP2 A T GG GG

3 SNP3 T T GC GC

But I can't seem to figure out a way to automate this across many (eventually thousands) of columns at once. Separate does not allow you to call a vector of column names, for instance.

Any thoughts?

Thanks!

John

--

John M. Mola
jmm...@ucdavis.edu
john...@gmail.com

johnmola.weebly.com

John Mola

unread,

Mar 1, 2017, 8:12:08 PM3/1/17

to davi...@googlegroups.com

Oh yeah. Apologies for the column/row names matching. I was messing with my messing with data.

Here:

> Ind = c("Bee1","Bee1","Bee2","Bee2","Bee3")

> SNP1 = c("AA","TT","AT","TT","TT")

> SNP2 = c("GC","GG","GG","CC","GC")

> SNP3 = c("GG","GC","GG","CC","GC")

> df = data.frame(Ind,SNP1,SNP2,SNP3)

> df_filt= distinct(df, Ind, .keep_all = TRUE)

> df_filt

Ind SNP1 SNP2 SNP3

1 Bee1 AA GC GG

2 Bee2 AT GG GG

3 Bee3 TT GC GC

>

> separate(data = df_filt, col = SNP1, into = c("SNP1.1","SNP1.2"), sep=c(1))

Ind SNP1.1 SNP1.2 SNP2 SNP3

1 Bee1 A A GC GG

2 Bee2 A T GG GG

3 Bee3 T T GC GC

--
Check out our R resources at http://d-rug.github.io/
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/davis-rug.
For more options, visit https://groups.google.com/d/optout.

Vince S. Buffalo

unread,

Mar 1, 2017, 8:22:52 PM3/1/17

to davi...@googlegroups.com

So a few things —

First I'd use a tibble:

df <- tibble(Ind,SNP1,SNP2,SNP3)

df_filt <- distinct(df, Ind, .keep_all = TRUE)

and since you don't have a separator, you can't use spread() — you need to use extract() with a grouping regular expression. This is a bit permissive, but works:

df_filt %>% gather(snp, value, -Ind) %>% extract(value, into=c('h1', 'h2'), '(.)(.)')

# A tibble: 9 × 4

Ind snp h1 h2

* <chr> <chr> <chr> <chr>

1 Bee1 SNP1 A A

2 Bee2 SNP1 A T

3 Bee3 SNP1 T T

4 Bee1 SNP2 G C

5 Bee2 SNP2 G G

6 Bee3 SNP2 G C

7 Bee1 SNP3 G G

8 Bee2 SNP3 G G

9 Bee3 SNP3 G C

Note how I use gather here to gather all SNP columns. It's easier to apply this operation to long data and then recast to wide data using spread().

HTH,

Vince

Vince Buffalo

@vsbuffalo :: vincebuffalo.com

Coop Lab :: Population Biology Graduate Group
University of California, Davis

Jaime Ashander

unread,

Mar 1, 2017, 8:55:52 PM3/1/17

to davi...@googlegroups.com

As mentioned, the key thing is using gather to make the data long. You could still use
separate in place of Vince's extract, you'd just need to use the new column
name you passed to gather (value in this case) so this would work too:

df_filt %>% gather(snp, value, -Ind) %>% separate(col = value, into = c("SNP1.1","SNP1.2"), sep=1)

(when numeric, sep is interpreted as a position in the string)

John Mola

unread,

Mar 1, 2017, 9:39:58 PM3/1/17

to davi...@googlegroups.com

By god. It works. Thank you very much y'all!

Cheers,

John

Reply all

Reply to author

Forward