Questions about feature selection and data engineering when using H2O autoencoder for anomaly detection

50 views

Skip to first unread message

Fanwei Zeng

unread,

Oct 22, 2020, 4:26:41 PM10/22/20

to H2O Open Source Scalable Machine Learning - h2ostream

I am using H2O autoencoder in R for anomaly detection. I don’t have a training dataset, so I am using the data.hex to train the model, and then the same data.hex to calculate the reconstruction errors. The rows in data.hex with the largest reconstruction errors are considered anomalous. Mean squared error (MSE) of the model, which is calculated by the model itself, would be the sum of the squared reconstruction errors and then divided by the number of rows (i.e. examples). Below is some sudo code of the model.

# Deeplearning Model

model.dl <- h2o.deeplearning(x = x, training_frame = data.hex, autoencoder = TRUE, activation = "Tanh", hidden = c(25,25,25), variable_importances = TRUE)

# Anomaly Detection Algorithm

errors <- h2o.anomaly(model.dl, data.hex, per_feature = FALSE)

Currently there are about 10 features (factors) in my data.hex, and they are all categorical features. I have two questions below:

(1) Do I need to perform feature selection to select a subset of the 10 features before the data go into the deep learning model (with autoencoder=TRUE), in case some features are significantly associated with each other? Or I don’t need to since the data will go into an autoencoder which compresses the data and selects only the most importance information already, so feature selection would be redundant?

(2) The purpose of using the H2O autoencoder here is to identify the senders in data.hex whose action is anomalous. Here are two examples of data.hex. Example B is a transformed version of Example A, by concatenating all the actions for each sender-receiver pair in Example A.

After running the model on data.hex in Example A and in Example B separately, what I got is

(a) MSE from Example A (~0.005) is 20+ times larger than MSE from Example B;

(b) When I put the reconstruction errors in ascending order and plot them (so errors increase from left to right in the plot), the reconstruction error curve from Example A is steeper (e.g. skyrocketing) on the right end, while the reconstruction error curve from Example B increases more gradually.

My question is, which example of data.hex works better for my purpose to identify anomalies?

Thanks for your insights!

Fanwei Zeng

unread,

Oct 22, 2020, 4:33:54 PM10/22/20

to H2O Open Source Scalable Machine Learning - h2ostream

The image didn't show. I am trying this:

Example A of data.hex:

Sender Receiver Action

------------------------------------

person1 person2 Action1

person1 person2 Action2

person1 person2 Action3

person1 person2 Action4

person3 person4 Action5

person3 person4 Action6

Example B of data.hex, transformed from Example A above:

Sender Receiver Action

-------------------------------------------------------------------

person1 person2 Action1, Action2, Action3, Action4

person3 person4 Action5, Action6

Reply all

Reply to author

Forward

0 new messages