Ngram splitting.rule?

38 views
Skip to first unread message

Johan de Joode

unread,
Jan 10, 2018, 5:26:44 AM1/10/18
to computationalstylistics
Dear all

First of all thanks for a great piece of software, well-documented, easy to use, and relevant. Does not get any better than that I think.

I have a practical question. Is it possible to respect segment boundaries when using stylo(), the function I mean? For instance, is it possible to make sure that sentence boundaries are not crossed when extracting ngrams?

It is relevant for us as we have a fragmented corpus.

I see that there is an option splitting.rule is load.corpus.and.parse. Would that be helpful in this case? If so, does stylo() feed its options to load.corpus.and.parse?

Thanks
j


Maciej Eder

unread,
Jan 16, 2018, 10:21:58 AM1/16/18
to computationalstylistics
Hi Johan,

your question turned out to be really tricky. Frankly, it took a long while to find a solution. In short, 'stylo' does not respect sentence boundaries by default. You can blame me for that, but truth be told, I did not think of it when the first versions of the package were released. Now, it would involve quite a lot of coding to implement the feature.

Below, I provide a tailored script to overcome the problem. It invokes a few low-level functions from the package 'stylo', as well as some primitive R functions, such as lapply(), sapply(), unlist(), and so on. The script is fairly slow -- I didn't even try to optimize the CPU performance, assuming that this is but a provisional solution. Have fun!

All the best,
Maciej




library(stylo)

# first, we load the texts from the specified subdirectory
# (type help(load.corpus) to get the applicable options)
raw_texts = load.corpus(files = "all", corpus.dir = "corpus")

# we start an empty list
tokenized_texts = list()

# next, we iterate over the loaded texts 
for(i in 1:length(raw_texts)) {

    # a tricky way of iterating over the elements of a given texts (i.e. its lines),
    # getting tokenized words for each element
    # type help(parse.corpus) for further details
    current_words = sapply(raw_texts[[i]], parse.corpus, language = "English.all")

    # sanitizing the input, so that each line (paragraph, sentence) can be split
    # into n-grams; in other words, making sure that each line 
    # has at least N words (>= ngram.size)  
    current_words = current_words[sapply(current_words, length) > 3]

    # finally, splitting the lines into n-grams
    current_features = unlist(lapply(current_words, make.ngrams, ngram.size = 3))
    
    # ... and getting rid of insanely long names of the elements
    names(current_features) = NULL

    # aggregating the results 
    tokenized_texts[[i]] = current_features
}

# inheriting the names of the texts from the original corpus
names(tokenized_texts) = names(raw_texts)

# running `stylo` with a pre-defined corpus
stylo(parsed.corpus = tokenized_texts)









--
You received this message because you are subscribed to the Google Groups "computationalstylistics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to computationalstylistics+unsub...@googlegroups.com.
Visit this group at https://groups.google.com/group/computationalstylistics.
For more options, visit https://groups.google.com/d/optout.

Maciej Eder

unread,
Jan 16, 2018, 10:32:33 AM1/16/18
to computationalstylistics
Hi again, 

I've forgotten to say that your input texts should be slightly pre-processed. The procedure assumes that the input lines contain the units you want to keep the boundaries of. It can be sentences, paragraphs, or whatever else. The procedure does not build any n-grams that would span two (or more) lines in your input dataset.

Best,
Maciej


To unsubscribe from this group and stop receiving emails from it, send an email to computationalstylistics+unsubscr...@googlegroups.com.

Johan de Joode

unread,
Jan 18, 2018, 3:28:13 PM1/18/18
to computationalstylistics
Dear Maciej

I am ever so grateful for your response. I was not able to run the script on my dataset (650k tokens). I adapted it to extract char-grams, but after using 14GB of memory, the interpreter blocked.

I did write a quick Python alternative. Just to make sure I understand it correctly and my results are consistent, here is a quick follow up question. What exactly is the input for a parsed.corpus? Is it a list of character vectors where each item in a vector is an ngram? How does make.ngrams use that vector?

Would the following method make sense? I want to run characters 3-grams for our fragmented corpus. If I use Python or R to parse the text into 3grams respecting the fragmentation boundaries, and then write these 3grams to one 3gram per line, can I read these files as character vectors and feed them to stylo as parsed.corpus? If so, I guess it would only make sense to use word-uni-grams or char-3-grams as settings in the GUI. Is that correct?

Sample:

# files in corpus are parsed as follows: (a trigram per line)
# this assumes the trigrams respect the desired boundaries
Thi
his
 
is
s i
 
is
is
s a
 a
a t
 te
tes
est
st
.


# first, we load the texts from the specified subdirectory
# (type help(load.corpus) to get the applicable options)
raw_texts
= load.corpus(files = "all", corpus.dir = "corpus")

# we start an empty list
tokenized_texts
= list()

# next, we iterate over the loaded texts
for(i in names(raw_texts)) {
   
   
# aggregating the results
    tokenized_texts
[[i]] = readLines(paste('corpus/', i, sep=""))

}

# inheriting the names of the texts from the original corpus
names
(tokenized_texts) = names(raw_texts)

# running `stylo` with a pre-defined corpus


# readLines(paste('corpus_ngramified/', i, sep=""))

x
<- stylo(parsed.corpus = tokenized_texts)

Many thanks
Johan

Johan de Joode

unread,
Jan 23, 2018, 9:24:21 AM1/23/18
to computationalstylistics
Dear Maciej

Could you just confirm that with your solution, the configuration of the GUI should be: Char-grams, length: 3.

Is that right?

Many thanks
Johan

Maciej Eder

unread,
Feb 12, 2018, 10:40:34 AM2/12/18
to computationalstylistics
Dear Johan,

it seems I haven't responded your question! It is as simple as this:

the input format for the function stylo() is either text files (obviously), or a table of frequencies to be loaded via

library(stylo)

stylo(frequencies = table_with_frequencies)


or using a parsed corpus, via the parameter parsed.corpus:

stylo(parsed.corpus = the_corpus)


note that "the_corpus" is a list, the elements of which are the tests, and their elements being the features. In the case of character 3grams, it would contain texts, and the 3grams as elements. E.g. the 5th sample might look as follows:

sample 5 

"tib_carm_3"

   [1] i   m

   [2]   m a

   [3] m a r

   [4] a r t

   [5] r t i

   [6] t i s

   [7] i s  

   [8] s   r

   [9]   r o

  [10] r o m

   ... ... 



Please find a tiny test variable attached. You load it into R using the following command:


load("test.Rdata")



then you can inspect it:

char_3grams



and finally use stylo() with it:

stylo(parsed.corpus = char_3grams)


and then, on GUI, you use standard word 1grams, because there is no need to split your features again!


I hope this helps.

All the best,
Maciej





--
You received this message because you are subscribed to the Google Groups "computationalstylistics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to computationalstylistics+unsub...@googlegroups.com.
test.Rdata
Reply all
Reply to author
Forward
0 new messages