Questions about feature selection and data engineering when using H2O autoencoder for anomaly detection

34 views
Skip to first unread message

Fanwei Zeng

unread,
Oct 22, 2020, 4:26:41 PM10/22/20
to H2O Open Source Scalable Machine Learning - h2ostream

I am using H2O autoencoder in R for anomaly detection. I don’t have a training dataset, so I am using the data.hex to train the model, and then the same data.hex to calculate the reconstruction errors. The rows in data.hex with the largest reconstruction errors are considered anomalous. Mean squared error (MSE) of the model, which is calculated by the model itself, would be the sum of the squared reconstruction errors and then divided by the number of rows (i.e. examples). Below is some sudo code of the model.

# Deeplearning Model

model.dl <- h2o.deeplearning(x = x, training_frame = data.hex, autoencoder = TRUE, activation = "Tanh", hidden = c(25,25,25), variable_importances = TRUE) 

# Anomaly Detection Algorithm 

errors <- h2o.anomaly(model.dl, data.hex, per_feature = FALSE) 

Currently there are about 10 features (factors) in my data.hex, and they are all categorical features. I have two questions below:

(1) Do I need to perform feature selection to select a subset of the 10 features before the data go into the deep learning model (with autoencoder=TRUE), in case some features are significantly associated with each other? Or I don’t need to since the data will go into an autoencoder which compresses the data and selects only the most importance information already, so feature selection would be redundant? 

(2) The purpose of using the H2O autoencoder here is to identify the senders in data.hex whose action is anomalous. Here are two examples of data.hex. Example B is a transformed version of Example A, by concatenating all the actions for each sender-receiver pair in Example A. 


After running the model on data.hex in Example A and in Example B separately, what I got is

(a)   MSE from Example A (~0.005) is 20+ times larger than MSE from Example B;

(b)  When I put the reconstruction errors in ascending order and plot them (so errors increase from left to right in the plot), the reconstruction error curve from Example A is steeper (e.g. skyrocketing) on the right end, while the reconstruction error curve from Example B increases more gradually.

My question is, which example of data.hex works better for my purpose to identify anomalies?

Thanks for your insights!

Fanwei Zeng

unread,
Oct 22, 2020, 4:33:54 PM10/22/20
to H2O Open Source Scalable Machine Learning - h2ostream
The image didn't show. I am trying this:

Example A of data.hex:

Sender         Receiver       Action

------------------------------------

person1       person2        Action1

person1       person2        Action2

person1       person2        Action3

person1       person2        Action4

person3       person4        Action5

person3       person4        Action6 


Example B of data.hex, transformed from Example A above:

Sender         Receiver       Action

-------------------------------------------------------------------

person1       person2        Action1, Action2, Action3, Action4

person3       person4        Action5, Action6


Reply all
Reply to author
Forward
0 new messages