Using regex to match quotation parts across tab-spaced text

Christoph Ruehlemann

unread,

May 23, 2018, 1:14:23 PM5/23/18

to corplin...@googlegroups.com

Hi,

I have story transcripts such as this one with quote marks around coherent pieces of direct speech:

> story

                                                                   V1
1 Mim:\ty’ know how teachers go “E::wawawawawu” (’n tha’ sor’ o’ th’)
2                           \tso I say, I sa’, when you meet anybody
3                            \tand they say to him “how's teaching?”
4                                                    \t“Re::wawawawu”
5                                                \tI say “J’s STOp it
6                                  Ter:\tHheh heh heh [heh    he he]
7                 Mim:\t\t     [you’re really] actually enjoying it”
8                                        \t[(so I say “don't do it”)]

Using regex I want to extract all instances of direct speech wrapped into left quote marks “ and right quite marks ”.

This regex does extract all direct speech that's within a single line:

pattern <- "“[^”]*”" #
matches <- gregexpr(pattern, story$V1)
quotes <- regmatches(story$V1, matches)
quotes <- unlist(quotes)

quotes
[1] "“E::wawawawawu”" "“how's teaching?”" "“Re::wawawawu”" "“don't do it”"

It fails to match, however, the direct speech spread from line 5-7 (“J’s STOp it ... [you’re really] actually enjoying it”)

I've tried to include \t as an optional element, thus:

pattern <- "“((\\t{1,})?)[^”]*”"

But that finds the same as the above regex.

Can anybody help?

Thanks in advance!

Best

Chris

--

https://www.uni-marburg.de/fb10/iaa/institut/personal/ruehlemann

ἰχθύς

Stefan Th. Gries

unread,

May 23, 2018, 1:42:42 PM5/23/18

to CorpLing with R

If you read the story in line by line (sep="\n"), the regex will of
course not go across vector elements to find the beginning double
quote in line 5 (“J’s STOp it), everything in line 6, and then the end
of the direct speech part in line 7 (Mim:\t\t [you’re really]
actually enjoying it”). In other words, the tab has nothing to do with
it, it's that regexes won't search across vector elements. One way to
address this is to paste together the story into one string.

Bob Green

unread,

May 23, 2018, 5:24:38 PM5/23/18

to corplin...@googlegroups.com

Hello,

Does anyone know of sources of autobiographical texts of famous/well
known people in a format suitable for text analysis?

Any assistance is appreciated,

Bob

Christoph Ruehlemann

unread,

May 24, 2018, 3:42:55 AM5/24/18

to corplin...@googlegroups.com

Thanks, worked!

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with-r+unsubscribe@googlegroups.com.
To post to this group, send email to corpling-with-r@googlegroups.com.
Visit this group at https://groups.google.com/group/corpling-with-r.
For more options, visit https://groups.google.com/d/optout.