Using S-Space for email content analysis

Dileepa Jayakody

unread,

Jun 3, 2014, 4:19:48 AM6/3/14

to s-spac...@googlegroups.com

Hi All,

My name is Dileepa Jayakody from Sri Lanka. I'm a Msc research student, and a GSOC student this year working on my project ReputationBox using Apache Isis [1].

ReputationBox is an email reputation analysis tool to evaluate the goodness of emails based on importance of content and the senders.

To evaluate the goodness of the email content I'm planning to use TFIDF, LSA and email speech act theories to cluster emails and classify new emails based on similarity.

I came across S-Space and I feel it's a great library for NLP and semantic space related work. Thank you for your wonderful efforts. I think I can perform most of my email content analysis tasks using S-Space.

I will ask questions along the way and hope you guys will give me advice to make my project a success.

Thanks,

Dileepa

[1] http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/dileepaj/5662278724616192

David Jurgens

unread,

Jun 11, 2014, 9:32:20 AM6/11/14

to s-spac...@googlegroups.com

Hi Dileepa,

Great to hear. If you run into any issues, please let us know!

Thanks,

David

--

---
You received this message because you are subscribed to the Google Groups "S-Space Package Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dileepa Jayakody

unread,

Jun 12, 2014, 10:35:49 AM6/12/14

to s-spac...@googlegroups.com

Hi David and all,

I would like to know if there is any incremental data analysis technique. TFIDF, LSA and most of the analysis techniques require the documents corpus (N number of documents) to be predefined.

In my case emails are added to the corpus incrementally. Which means I will have to recalculate TFIDF/LSA periodically for the whole document set.

Is there a good option for incremental data anlaysis in S-Space or in data analysis in general?

Thanks,

Dileepa

David Jurgens

unread,

Jun 12, 2014, 5:59:13 PM6/12/14

to s-spac...@googlegroups.com

Hi Dileepa,

Yes, the Random Indexing model and HAL model (without dimensionality reduction) can be updated incrementally. If you have a lot of data, Random Indexing is probably a better choice because it produces more compact representations. Both algorithms should be useable off the shelf.

In general, there's some really nice distributional representations coming out of the deep learning community that can be continuously trained. Mikolov (2013) is a great example. It would be great to get these in the S-Space package one day.

Thanks,

David

Dileepa Jayakody

unread,

Jun 13, 2014, 12:54:24 AM6/13/14

to s-spac...@googlegroups.com

Thanks a lot David for the pointer.

I will check out the RandomIndexing in SSpace and come back to you if I get more questions.

Regards,

Dileepa

Marcin Tatjewski

unread,

Jun 13, 2014, 8:35:44 PM6/13/14

to s-spac...@googlegroups.com

David, thanks a lot for that pointer to Mikolov (2013)! It's interesting to see the impact of deep learning for distributional semantics.

Thanks,

Marcin

Dileepa Jayakody

unread,

Jul 23, 2014, 7:55:10 AM7/23/14

to s-spac...@googlegroups.com

Hi David and All,

In my application, I integrated RandomIndexing to represent my emails in word-space model.

What I do is basically this;

RandomIndexing randomIndex = new RandomIndexing();
for(Email email : emails{

String processedTokenStream = email.getTextContent().getTokenStream();
randomIndex.processDocument(new BufferedReader(new StringReader(processedTokenStream)));
randomIndex.processSpace(null);

}

To get results I invoke;

Set<String> allWords = getRandomIndex().getWords();
    for (String word : allWords) {
               Vector contextVector = getRandomIndex().getVector(word);
                    String vectorString = "";
                    for (int i = 0; i < contextVector.length(); i++){
                        Integer val = (Integer) contextVector.getValue(i);
                        vectorString += "[" + i + " : " + val + "], ";
                    }
                    logger.info(word + " : " + vectorString);
        }

I get a large matrix of size : num.of.words(cols) X 4000(rows) But I see almost all values are 0.

When I browsed through the RandomIndexing code I see that processSpace(Properties prop) has empty implementation,

Does this mean RandomIndexing algo is not doing any semantic processing after processing the documents?

/**
     * Does nothing.
     *
     * @param properties {@inheritDoc}
     */
    public void processSpace(Properties properties) {
    }

Can you please help me to learn how to extract results for my emails?

What I need is a result matrix of num.of.words(cols) X no.of.emailDocuments(rows) so that I can input it to a similarity based clustering algo to identify email topic clusters.