Creating word co-occurrence matrix using S-Space?

38 views
Skip to first unread message

Emre

unread,
Apr 11, 2015, 10:09:56 AM4/11/15
to s-spac...@googlegroups.com
Hi all,

 I have a big corpus. I need a tool for creating word co-occurrence matrix. I want to use sliding-window concept.I want to give fixed size of the window and the tool must construct the matrix. Then I will take this matrix to another environment.

Does S-Space package can help me? If yes, can you tell the steps that I must follow?

Thanks in advance.

David Jurgens

unread,
Apr 13, 2015, 5:31:22 PM4/13/15
to s-spac...@googlegroups.com
Hi Erme,

  Yes, it should be possible to construct this type of co-occurrence matrix and then port it to a new environment.  The procedure is a bit clumsy at the moment, but you can do something like this:


int windowSize = 2; // Adjust as necessary
GenericWordSpace gws = new GenericWordSpace(windowSize);
// Process your documents here
// gws.processDocument(...);

// Get the words in the semantic space whose vectors will form the rows of the
// matrix.  (These are the words in the center of the sliding window)
List<String> wordsInSpace = new ArrayList<String>(gws.getWords());
List<DoubleVector> vectors = new ArrayList<DoubleVector>(wordsInSpace.size());
for (String word : wordsInSpace) {
    vectors.add(Vectors.asDouble(gws.getVector(word)));
}

// Convert the vectors into a matrix
Matrix cooccurrenceMatrix = Matrices.asMatrix(vectors);

// TODO: save your matrix.  Check out the MatrixIO class for easy to-do writing

// If you want to know the features associated with each column, you can do this
// too, but it's completely optional.  These words are the words appearing in
// the sliding window as features (but not the target word).
// 
// (NOTE: since the features are contiguous,
// you could do this code with a List<String> where the indices are the Map's
// keys, but I used a map to make it conceptually easier.)
Map<Integer,String> wordFeaturePerColumn = new HahsMap<Integer,String>();
for (int dimension = 0; dimension < gws.getVectorLength(); ++dimension) {
    wordFeaturePerColumn.put(dimension, gws.getDimensionDescription(dimension));
}

   I hope this helps and let me know if you have any questions.

  Thanks,
  David

--

---
You received this message because you are subscribed to the Google Groups "S-Space Package Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Emre

unread,
Apr 15, 2015, 4:37:42 AM4/15/15
to s-spac...@googlegroups.com
Hi David,

Thanks very much for answering.

I tried your code. As an example I construct a file with the contents : A D C E A D F E B A C E D

When I construct the matrix and print the content of the matrix with the code below I got this result :

Code :

for(int i=0; i<cooccurrenceMatrix.rows(); i++)
{
                for(int j=0; j<cooccurrenceMatrix.columns(); j++)
                {
                    System.out.print(cooccurrenceMatrix.get(i, j) + " ");
                }
                System.out.println();

Result :

2.0 3.0 0.0 3.0 1.0 1.0 
0.0 1.0 1.0 1.0 1.0 0.0 
2.0 0.0 3.0 2.0 0.0 1.0 
0.0 2.0 2.0 4.0 1.0 0.0 
4.0 2.0 3.0 0.0 1.0 1.0 
1.0 0.0 1.0 1.0 0.0 1.0 

The window size is 2.

But actually I expected a result like this :

0 1 3 2 3 1
1 0 1 0 1 1
3 1 0 2 2 0
2 0 2 0 4 1
3 1 2 4 0 1
1 1 0 1 1 0

As you see my expected matrix is symetrical and diagonal elements are 0. For example matrix(4,5)=4 because D(4) co-occur with E(5)  4 times.

Am I wrong?

Thanks...


14 Nisan 2015 Salı 00:31:22 UTC+3 tarihinde David Jurgens yazdı:

Emre

unread,
Apr 15, 2015, 7:39:58 AM4/15/15
to s-spac...@googlegroups.com
Hi again,

After studying on the issue, I realized that the matrises actually the same :) But columns are different order !

Rows are alfebatically ordered i.e first row for A, second row for B and so on. 

But the colums order is D C A E F B.

The matrix position is now :

     D  C  A  E  F  B
A
B
C
D
E
F

But my desired matrix should be :

      A  B  C  D  E  F
A
B
C
D
E

How can I achieve this?

15 Nisan 2015 Çarşamba 11:37:42 UTC+3 tarihinde Emre yazdı:

Emre

unread,
Apr 20, 2015, 3:32:35 AM4/20/15
to s-spac...@googlegroups.com
David, you didn't reply. Do you have any idea? This is very important to me.

Thanks... 

15 Nisan 2015 Çarşamba 14:39:58 UTC+3 tarihinde Emre yazdı:

David Jurgens

unread,
Apr 28, 2015, 11:03:05 AM4/28/15
to s-spac...@googlegroups.com
Hi Erme,

  The GenericWordSpace class associates each dimension with a single word, which you can recover using the getDimensionDescription() method.  If you need a specific order, you can first enumerate over the dimensions to get which word is associated with which dimension and then order the columns for matrix according to your desired word order.  
  If this isn't clear, let me know and I can go into more detail.

  Thanks,
  David
Reply all
Reply to author
Forward
0 new messages