I'm constructing my word co-occurrence matrix with the code below. But since my corpus is very large, the matrix is very big. So processing this matrix is hard. For solving this problem I want to decrease the size of the matrix. For example I want to ignore the words that occur less than 20 times in my corpus(i.e I want to restrict my vocabulary to all lemmas occurring at least 20 times in my corpus). Can I do this operation with S-space?
int windowSize = 2;
GenericWordSpace gws = new GenericWordSpace(windowSize);
BufferedReader in;
File folder = new File("D:\\sspace_example_files");
File[] listOfFiles = folder.listFiles();
for (File file : listOfFiles)
{
in = new BufferedReader(new FileReader(file));
gws.processDocument(in);
}
List<String> wordsInSpace = new ArrayList<String>(gws.getWords());
List<DoubleVector> vectors = new ArrayList<DoubleVector>(wordsInSpace.size());
for (String word : wordsInSpace)
vectors.add(Vectors.asDouble(gws.getVector(word)));
Matrix cooccurrenceMatrix = Matrices.asMatrix(vectors);
Thanks in advance...