Hi all,
I'm investigating whether there's a potential to applying word embeddings to log files, in order to do some type of anomaly detection.
For this purpose there was some prior development that basicly classifies each log line into a single numeric (integer) value.
So you could say that a processed log file would look like this:
...
125
3454
12
15647
1201
78
98897
1544
122
...
I then proceeded with dividing the log sequences into chunks of fixed length (for instance: 1000 integers, that used to represent a 1000 log lines).
Obviously in this scenario, the terms "sentence", "word" and "vocabulary" have a different meaning from traditional NLP problems.
My question(s):
Is it sensible to treat each log chunk as a single sentence to train a doc2vec model? Having a single "number" for a sentence isn't going to learn much word familiarities.
I've read somewhere that the sentence size limit is 10000, does that mean that this should be my maximum chunking length or should I rethink how to treat the concept of sentences and try to have smaller chunks as part of a single "document" or in this case, a log segment.
Have any of you attempted this (or know somebody else who did)?
Any recommendations are greatly appreciated!
Kind regards,
Julien