Decreasing the dimension of word co-occurrence matrix.

15 views
Skip to first unread message

Emre

unread,
Apr 27, 2015, 7:56:21 AM4/27/15
to s-spac...@googlegroups.com
Hi all,

I'm constructing my word co-occurrence matrix with the code below. But since my corpus is very large, the matrix is very big. So processing this matrix is hard. For solving this problem I want to decrease the size of the matrix. For example I want to ignore the words that occur less than 20 times in my corpus(i.e I want to restrict my vocabulary to all lemmas occurring at least 20 times in my corpus). Can I do this operation with S-space?

My code :

*******************
int windowSize = 2;
GenericWordSpace gws = new GenericWordSpace(windowSize);

BufferedReader in;

File folder = new File("D:\\sspace_example_files");
File[] listOfFiles = folder.listFiles();
for (File file : listOfFiles) 
{
in = new BufferedReader(new FileReader(file));
gws.processDocument(in);
}
 
List<String> wordsInSpace = new ArrayList<String>(gws.getWords());

List<DoubleVector> vectors = new ArrayList<DoubleVector>(wordsInSpace.size());
for (String word : wordsInSpace)
vectors.add(Vectors.asDouble(gws.getVector(word)));

Matrix cooccurrenceMatrix = Matrices.asMatrix(vectors);
*******************

Thanks in advance...

David Jurgens

unread,
Apr 28, 2015, 11:28:28 AM4/28/15
to s-spac...@googlegroups.com
Hi Erme,

  Yes, you can do this type of filtering, but it need to be done prior to constructing the word space instance.  The S-Space Package handles filtering in a somewhat-strange way (due to legacy code) where you must supply which words you would like included (or excluded) in a file, rather than specifying some criteria or using a list.

  However, the current code does support only retaining vectors for specific words, which might solve your problem.  In your code below, you could do something like this:

import edu.ucla.sspace.util.Counter;
import edu.ucla.sspace.util.LineReader;
import edu.ucla.sspace.util.ObjectedCounter;

Counter<String> wordCounts = new ObjectCounter<String>();

File folder = new File("D:\\sspace_example_files");
for (File file : folder.listFiles()) {
    for (String line : new LineReader(file)) {
        // If you care about punctuation, you can also strip it off the tokens here.
for (String token : line.split("\\s+"))
wordCounts.count(token);
  }
}

int MIN_FREQUENCY = 20;

Set<String> wordsToInclude = new HashSet<String>();
for (Map.Entry<String,Integer> wordAndCount : wordCounts) {
    if (wordAndCount.getValue() >= MIN_FREQUENCY)
        wordsToInclude.add(wordAndCount.getKey());
}

int windowSize = 2;
GenericWordSpace gws = new GenericWordSpace(windowSize);
// Using this method, the GenericWordSpace will use all words 
// as contextual features but only keep vectors for the words
// in the set, which will make the memory usages much lower.
gws.setSemanticFilter(wordsToInclude);

// YOUR CODE GOES HERE ...


If you really want to exclude those rare words as features, you could add a final step after processing all the documents when saving the matrix that excludes columns mapped to words that not in the wordsToInclude set.

I hope this answers your question but if not, please let me know.

  Thanks,
  David



--

---
You received this message because you are subscribed to the Google Groups "S-Space Package Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages