Log analysis using doc2vec

Julien Verplanken

unread,

Feb 9, 2018, 4:39:29 AM2/9/18

to gensim

Hi all,

I'm investigating whether there's a potential to applying word embeddings to log files, in order to do some type of anomaly detection.

For this purpose there was some prior development that basicly classifies each log line into a single numeric (integer) value.

So you could say that a processed log file would look like this:

...

125

3454

12

15647

1201

78

98897

1544

122

...

I then proceeded with dividing the log sequences into chunks of fixed length (for instance: 1000 integers, that used to represent a 1000 log lines).

Obviously in this scenario, the terms "sentence", "word" and "vocabulary" have a different meaning from traditional NLP problems.

My question(s):

Is it sensible to treat each log chunk as a single sentence to train a doc2vec model? Having a single "number" for a sentence isn't going to learn much word familiarities.

I've read somewhere that the sentence size limit is 10000, does that mean that this should be my maximum chunking length or should I rethink how to treat the concept of sentences and try to have smaller chunks as part of a single "document" or in this case, a log segment.

Have any of you attempted this (or know somebody else who did)?

Any recommendations are greatly appreciated!

PS: I've come across the following experience report already: https://hal.laas.fr/hal-01576291/document They use the raw log data (with some minor adaptation) to create document vectors.

Kind regards,

Julien

Ivan Menshikh

unread,

Feb 11, 2018, 11:05:30 PM2/11/18

to gensim

Hello Julien,

about 10000 - better to use len(documents) <= 10000 (this is some "static" limitation of our implementation).

What's kind of classification you use (what's a rule for converting logline into an integer)? Also, I don't catch, what's a rule to creating one document (this should be based on some semantic, or you simply pick 10000 number in a row)?

Julien Verplanken

unread,

Feb 12, 2018, 6:10:40 AM2/12/18

to gensim

Hi,

About how we convert a log line into an integer: we "mask" the log lines by removing all non-alpha terms and use a parse tree to "classify" a logline to a certain (random) number. We're basicly just making sure that 2 identical loglines get the same number and 2 different loglines get a different one. We then feed these numbers to doc2vec (casting them to strings first).

A "document" is indeed a log sequence, as you mentioned this is for instance 1000 numbers in a row. We're also investigating creating log chunks in time intervals (which result in varying sequence length per chunk).

For doing anomaly detection, we hope to find out which log sequences are rare/unexpected.

Op maandag 12 februari 2018 05:05:30 UTC+1 schreef Ivan Menshikh:

Ivan Menshikh

unread,

Feb 13, 2018, 5:22:27 AM2/13/18

to gensim

Looks like you made something like hash.

So, I don't think that you receive something useful from Doc2Vec for this case (because you have no "document semantics" here), but ... you can try to train Word2Vec and cluster all word-vectors (number-vectors in your case), but this is the strange method for anomaly search anyway.

Reply all

Reply to author

Forward