[R] help with regexp

Jannis

unread,

Oct 5, 2011, 7:56:52 AM10/5/11

to r-h...@stat.math.ethz.ch

Dear list memebers,

I am stuck with using regular expressions.

Imagine I have a vector of character strings like:

test <- c('filename_1_def.pdf', 'filename_2_abc.pdf')

How could I use regexpressions to extract only the 'def'/'abc' parts of these strings?

Some try from my side yielded no results:

testresults <- grep('(?<=filename_[[:digit:]]_).{1,3}(?=.pdf)', perl = TRUE, value = TRUE)

Somehow I seem to miss some important concept here. Until now I always used nested sub expressions like:

testresults <- sub('.pdf$', '', sub('^filename_[[:digit:]]_', '' , test))

but this tends to become cumbersome and I was wondering whether there is a more elegant way to do this?

Thanks for any help

Jannis

______________________________________________
R-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Albert-Jan Roskam

unread,

Oct 5, 2011, 8:37:14 AM10/5/11

to Jannis, r-h...@stat.math.ethz.ch

Hello!

library(gsubfn)

test <- c('filename_1_def.pdf', 'filename_2_abc.pdf')

gsubfn("(.+_)([a-z]+)(\\.pdf)", "\\2", test)

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>________________________________
>From: Jannis <bt_j...@yahoo.de>
>To: r-h...@stat.math.ethz.ch
>Sent: Wednesday, October 5, 2011 1:56 PM
>Subject: [R] help with regexp

[[alternative HTML version deleted]]

Eik Vettorazzi

unread,

Oct 5, 2011, 9:11:30 AM10/5/11

to Jannis, r-h...@stat.math.ethz.ch

Hi Jannis,
just use the backreferences in gsub, see ?gsub, -> replacement

test <- c('filename_1_def.pdf', 'filename_2_abc.pdf')

gsub(".*_([A-z]+)\\.pdf", "\\1", test)

hth.

--
Eik Vettorazzi
Institut für Medizinische Biometrie und Epidemiologie
Universitätsklinikum Hamburg-Eppendorf

Martinistr. 52
20246 Hamburg

T ++49/40/7410-58243
F ++49/40/7410-57790

--
Pflichtangaben gemäß Gesetz über elektronische Handelsregister und Genossenschaftsregister sowie das Unternehmensregister (EHUG):

Universitätsklinikum Hamburg-Eppendorf; Körperschaft des öffentlichen Rechts; Gerichtsstand: Hamburg

Vorstandsmitglieder: Prof. Dr. Guido Sauter (Vertreter des Vorsitzenden), Dr. Alexander Kirstein, Joachim Prölß, Prof. Dr. Dr. Uwe Koch-Gromus

Gabor Grothendieck

unread,

Oct 5, 2011, 11:13:31 AM10/5/11

to Jannis, r-h...@stat.math.ethz.ch

On Wed, Oct 5, 2011 at 7:56 AM, Jannis <bt_j...@yahoo.de> wrote:
> Dear list memebers,
>
>
> I am stuck with using regular expressions.
>
>
> Imagine I have a vector of character strings like:
>
> test <- c('filename_1_def.pdf', 'filename_2_abc.pdf')
>
> How could I use regexpressions to extract only the 'def'/'abc' parts of these strings?
>
>
> Some try from my side yielded no results:
>
> testresults <- grep('(?<=filename_[[:digit:]]_).{1,3}(?=.pdf)', perl = TRUE, value = TRUE)
>
> Somehow I seem to miss some important concept here. Until now I always used nested sub expressions like:
>
> testresults <- sub('.pdf$', '', sub('^filename_[[:digit:]]_', '' , test))
>
>
> but this tends to become cumbersome and I was wondering whether there is a more elegant way to do this?
>

Here are a couple of solutions:

# remove everything up to _b as well as everything from . onwards
gsub(".*_|[.].*", "", test)

# extract everything that is not a _ provided it is immediately followed by .
library(gsubfn)
strapply(test, "([^_]+)[.]", simplify = TRUE)

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Jannis

unread,

Oct 6, 2011, 2:01:50 PM10/6/11

to Gabor Grothendieck, E.Vett...@uke.de, fo...@yahoo.com, r-h...@stat.math.ethz.ch

Thanks to all who replied! With all these possible solutions it will be hard to find the best one :-).

--- Gabor Grothendieck <ggroth...@gmail.com> schrieb am Mi, 5.10.2011:

Reply all

Reply to author

Forward