Decreasing the dimension of word co-occurrence matrix.

15 views

Skip to first unread message

Emre

unread,

Apr 27, 2015, 7:56:21 AM4/27/15

to s-spac...@googlegroups.com

Hi all,

I'm constructing my word co-occurrence matrix with the code below. But since my corpus is very large, the matrix is very big. So processing this matrix is hard. For solving this problem I want to decrease the size of the matrix. For example I want to ignore the words that occur less than 20 times in my corpus(i.e I want to restrict my vocabulary to all lemmas occurring at least 20 times in my corpus). Can I do this operation with S-space?

My code :

*******************

int windowSize = 2;

GenericWordSpace gws = new GenericWordSpace(windowSize);

BufferedReader in;

File folder = new File("D:\\sspace_example_files");

File[] listOfFiles = folder.listFiles();

for (File file : listOfFiles)

{

in = new BufferedReader(new FileReader(file));

gws.processDocument(in);

}

List<String> wordsInSpace = new ArrayList<String>(gws.getWords());

List<DoubleVector> vectors = new ArrayList<DoubleVector>(wordsInSpace.size());

for (String word : wordsInSpace)

vectors.add(Vectors.asDouble(gws.getVector(word)));

Matrix cooccurrenceMatrix = Matrices.asMatrix(vectors);

*******************

Thanks in advance...

David Jurgens

unread,

Apr 28, 2015, 11:28:28 AM4/28/15

to s-spac...@googlegroups.com

Hi Erme,

Yes, you can do this type of filtering, but it need to be done prior to constructing the word space instance. The S-Space Package handles filtering in a somewhat-strange way (due to legacy code) where you must supply which words you would like included (or excluded) in a file, rather than specifying some criteria or using a list.

However, the current code does support only retaining vectors for specific words, which might solve your problem. In your code below, you could do something like this:

import edu.ucla.sspace.util.Counter;

import edu.ucla.sspace.util.LineReader;

import edu.ucla.sspace.util.ObjectedCounter;

Counter<String> wordCounts = new ObjectCounter<String>();

File folder = new File("D:\\sspace_example_files");

for (File file : folder.listFiles()) {

for (String line : new LineReader(file)) {

// If you care about punctuation, you can also strip it off the tokens here.

for (String token : line.split("\\s+"))

wordCounts.count(token);

}

int MIN_FREQUENCY = 20;

Set<String> wordsToInclude = new HashSet<String>();

for (Map.Entry<String,Integer> wordAndCount : wordCounts) {

if (wordAndCount.getValue() >= MIN_FREQUENCY)

wordsToInclude.add(wordAndCount.getKey());

}

int windowSize = 2;

GenericWordSpace gws = new GenericWordSpace(windowSize);

// Using this method, the GenericWordSpace will use all words

// as contextual features but only keep vectors for the words

// in the set, which will make the memory usages much lower.

gws.setSemanticFilter(wordsToInclude);

// YOUR CODE GOES HERE ...

If you really want to exclude those rare words as features, you could add a final step after processing all the documents when saving the matrix that excludes columns mapped to words that not in the wordsToInclude set.

I hope this answers your question but if not, please let me know.

Thanks,

David

--

---
You received this message because you are subscribed to the Google Groups "S-Space Package Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages