Log analysis using doc2vec

192 views
Skip to first unread message

Julien Verplanken

unread,
Feb 9, 2018, 4:39:29 AM2/9/18
to gensim
Hi all,

I'm investigating whether there's a potential to applying word embeddings to log files, in order to do some type of anomaly detection.
For this purpose there was some prior development that basicly classifies each log line into a single numeric (integer) value.

So you could say that a processed log file would look like this:
...
125
3454
12
15647
1201
78
98897
1544
122
...
I then proceeded with dividing the log sequences into chunks of fixed length (for instance: 1000 integers, that used to represent a 1000 log lines).
Obviously in this scenario, the terms "sentence", "word" and "vocabulary" have a different meaning from traditional NLP problems.

My question(s):
Is it sensible to treat each log chunk as a single sentence to train a doc2vec model? Having a single "number" for a sentence isn't going to learn much word familiarities.
I've read somewhere that the sentence size limit is 10000, does that mean that this should be my maximum chunking length or should I rethink how to treat the concept of sentences and try to have smaller chunks as part of a single "document" or in this case, a log segment.
Have any of you attempted this (or know somebody else who did)?
Any recommendations are greatly appreciated!

PS: I've come across the following experience report already: https://hal.laas.fr/hal-01576291/document They use the raw log data (with some minor adaptation) to create document vectors.

Kind regards,

Julien

Ivan Menshikh

unread,
Feb 11, 2018, 11:05:30 PM2/11/18
to gensim
Hello Julien,

about 10000 - better to use len(documents) <= 10000 (this is some "static" limitation of our implementation).
What's kind of classification you use (what's a rule for converting logline into an integer)? Also, I don't catch, what's a rule to creating one document (this should be based on some semantic, or you simply pick 10000 number in a row)?

Julien Verplanken

unread,
Feb 12, 2018, 6:10:40 AM2/12/18
to gensim
Hi,

About how we convert a log line into an integer: we "mask" the log lines by removing all non-alpha terms and use a parse tree to "classify" a logline to a certain (random) number. We're basicly just making sure that 2 identical loglines get the same number and 2 different loglines get a different one. We then feed these numbers to doc2vec (casting them to strings first).

A "document" is indeed a log sequence, as you mentioned this is for instance 1000 numbers in a row. We're also investigating creating log chunks in time intervals (which result in varying sequence length per chunk).

For doing anomaly detection, we hope to find out which log sequences are rare/unexpected.








Op maandag 12 februari 2018 05:05:30 UTC+1 schreef Ivan Menshikh:

Ivan Menshikh

unread,
Feb 13, 2018, 5:22:27 AM2/13/18
to gensim
Looks like you made something like hash. 
So, I don't think that you receive something useful from Doc2Vec for this case (because you have no "document semantics" here), but ... you can try to train Word2Vec and cluster all word-vectors (number-vectors in your case), but this is the strange method for anomaly search anyway.
Reply all
Reply to author
Forward
0 new messages