Can regex match underlining or italics?

429 views
Skip to first unread message

Christoph Ruehlemann

unread,
Oct 25, 2019, 7:18:58 AM10/25/19
to corplin...@googlegroups.com
Hi all,

a simple question: can underlined text or italicized text be matched in regex? For example, if you have this vector:

example <- c("hello", "that's it!", "what?", "what the hell")

which regex would find either element 2 (italicized) and/or element 4 (underlined)?

Best
Chris

--
Albert-Ludwigs-Universität Freiburg
Projekt-Leiter DFG-Projekt "Analyse multimodaler Interaktion im Geschichtenerzählen"
ἰχθύς

Earl Brown

unread,
Oct 26, 2019, 10:11:52 PM10/26/19
to CorpLing with R
I don't think R preserves the bold, italics or underline on text once imported into an R session. However, here's a way to do what you want if the formatted text is in a .docx file (like the attached .docx file). First, manually change the .docx extension to .zip, and then unzip it (cf. thread here). When you unzip it, a new directory named "word" will be created, and within it there are several XML files, one of which is named "document.xml". The follow R code should return the paragraphs with bold, italicized or underlined text (after changing the pathway to "document.xml" on your hard drive):
library(xml2)
library(stringr)

doc <- read_html("/Users/ekb5/Downloads/word/document.xml")

paragraphs <- unlist(str_extract_all(as.character(doc), "<p.*?/p>"))

extract_formatted_text <- function(txt) {
  # Return paragraphs with bold, italicized, or underlined text
  if (str_detect(txt, "</[biu]>")) {
    output <- str_replace_all(txt, "<.*?>", "")
  } else {
    output <- NA
  }
  return(output)
}

# keep paragraphs with some bold, italicized, or underlined text
result <- unlist(lapply(paragraphs, function(x) extract_formatted_text(x)))
result <- result[!is.na(result)]
print(result)
Cheers!

sample.docx
Reply all
Reply to author
Forward
0 new messages