Extract a part of a string between two characters

Sharon Dabach

unread,

Mar 31, 2017, 4:15:25 PM3/31/17

to Davis R Users' Group

Hi,

Sorry for the simple question but I just can't wrap my head around this regexp thing.

I have a string:

"H3D2_clay no geo, obs 1/ObsNod.out"

I need to get from it only the "clay no geo" part.

I know the "H3D2_" part will always be present and also the comma.

I tried using "_(.*)," but it includes both the underscore and comma which I don't need.

Also "grep" gave me the whole string, not just the part of it I need.

Thanks for any help

Vince S. Buffalo

unread,

Mar 31, 2017, 4:19:43 PM3/31/17

to davi...@googlegroups.com

Is this what you're looking for?

> gsub('H3D2_([^,]+).*', '\\1', "H3D2_clay no geo, obs 1/ObsNod.out")

[1] "clay no geo"

Maybe wrap it in a function to make the code a bit cleaner.

HTH,

Vince

--
Check out our R resources at http://d-rug.github.io/
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/davis-rug.
For more options, visit https://groups.google.com/d/optout.

--

Vince Buffalo

@vsbuffalo :: vincebuffalo.com

Coop Lab :: Population Biology Graduate Group
University of California, Davis

Sharon Dabach

unread,

Mar 31, 2017, 4:28:57 PM3/31/17

to Davis R Users' Group

Yes, thanks!!!!

Can you explain it a bit?

Why the parentheses ()?

and the .* at the end?

and the \\1?

Vince S. Buffalo

unread,

Mar 31, 2017, 4:45:40 PM3/31/17

to davi...@googlegroups.com

Sure —

gsub('H3D2_([^,]+).*', '\\1', "H3D2_clay no geo, obs 1/ObsNod.out")

So the regular expression part is the fixed string "H3D2_" plus a capture group (everything in the parenthesis), plus .* to match everything after (and including) the comma. gsub() captures the stuff in the parenthesis and this is the first capture group, which we refer to as "\\1" later.

The important part is the regex pattern ([^,]+) which means capture everything until a comma. Inside brackets, ^ means negate (so any character that's not a comma). This is non-greedy, which is a more conservative (and generally good) way to construct regular expressions.

Capture groups are super useful. You can use them to capture multiple groups too. In genomics, we often need to extract out chromosome/start/end position info formatted as "chrom:start-end". As an example, this could be done using gsub/strsplit with:

> gsub('(chr\\w+):(\\d+)-(\\d+)', '\\1;;;\\2;;;\\3', 'chr13:123-12313')

[1] "chr13;;;123;;;12313"

then:

> strsplit(gsub('(chr\\w+):(\\d+)-(\\d+)', '\\1;;;\\2;;;\\3', 'chr13:123-12313'), ';;;')

[[1]]

[1] "chr13" "123" "12313"

which can then be coerced into different forms. Or, for the tidy way:

> library(tidyverse)

> tibble(pos='chr13:123-12313') %>% extract(pos, into=c('chrom', 'start', 'end'), '(chr\\w+):(\\d+)-(\\d+)', convert=TRUE)

# A tibble: 1 × 3

chrom start end

* <chr> <int> <int>

1 chr13 123 12313

HTH,

Vince

--

Check out our R resources at http://d-rug.github.io/
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/davis-rug.
For more options, visit https://groups.google.com/d/optout.

Sharon Dabach

unread,

Mar 31, 2017, 5:02:31 PM3/31/17

to Davis R Users' Group

So H3D2_ tells it the the capture starts after it and the [^,] tells it to capture until the comma?

So why did you need .* ?

Michael Hannon

unread,

Mar 31, 2017, 7:41:25 PM3/31/17

to davi...@googlegroups.com

Try it both ways:

> gsub('H3D2_([^,]+).*', '\\1', "H3D2_clay no geo, obs 1/ObsNod.out")
[1] "clay no geo"

> gsub('H3D2_([^,]+)', '\\1', "H3D2_clay no geo, obs 1/ObsNod.out")
[1] "clay no geo, obs 1/ObsNod.out"
>

gsub will return the unmatched portion of the string, so you need to
be sure your pattern matches the whole string, from start to finish.
The ".*' says to "consume" all the characters after you've found the
comma.

-- Mike

> --
> Check out our R resources at http://d-rug.github.io/
> ---
> You received this message because you are subscribed to the Google Groups
> "Davis R Users' Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to davis-rug+...@googlegroups.com.

Reply all

Reply to author

Forward