Anomaly detection with H2o Autoencoders with semi supervised mode(few tagged records)

search...@gmail.com

unread,

Jul 23, 2015, 3:17:49 AM7/23/15

to H2O Open Source Scalable Machine Learning - h2ostream

Hi,

I want to use H2o Auto-encoders for anomaly detection.Initially I will not have labled data so It will be "Unsupervised mode".

But I might have few labled data records in future.So it will be like semi supervised.Should I use Autoencoders only in this case also? Looks like Autoencoder does not consider "Response Column"

It will be good If I can reuse same algorithm.

Please suggest.

Thanks,
Mahesh

Arno Candel

unread,

Jul 23, 2015, 12:02:44 PM7/23/15

to H2O Open Source Scalable Machine Learning - h2ostream, search...@gmail.com

Hi Manesh,

Thanks for your interest in H2O Deep Learning autoencoders for anomaly detection.

If you have labels, then you can do semi-supervised training by giving the autoencoder the "normal" cases only, making it easier to detect the "abnormal" cases. In that case, you would subset the data yourself (in R, normal.train<-train[train$response=="normal",] or similar).

Without labels, you have to make sure the network won't learn to reconstruct the abnormal cases too well (depends on the ratio of normal to abnormal), in order for you to be able to tell abnormal vs normal just from the reconstruction error.

You can find some examples here:

R examples: https://github.com/h2oai/h2o-3/tree/master/h2o-r/tests/testdir_algos/deeplearning

Python examples: https://github.com/h2oai/h2o-3/tree/master/h2o-py/tests/testdir_algos/deeplearning

Hope this helps,

Arno

Message has been deleted

search...@gmail.com

unread,

Jul 24, 2015, 1:29:38 AM7/24/15

to H2O Open Source Scalable Machine Learning - h2ostream

Thanks Arno.Really appreciate your help.

Look like my data has lot of noise too?(few categorical and few continuous).There is de-noising feature in Autoencoders which is in h2o roadmap,Will that help?

As of now is there any other way to denoise that data?

My another concern is while dealing with multi-variate data,distance between points becomes very small and H2o(used in reproducible = false) mode, uses hogwild style multithreading,which supports intentional race conditions,Will it cause problem with accuracy?

Thanks,
Mahesh

Arno Candel

unread,

Jul 24, 2015, 1:53:10 AM7/24/15

to search...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

Hi Manesh,

You are welcome.

To denoise your features, you can use input_dropout > 0 (works fine with “Tanh” or “Rectifier” activation). I would also suggest to use some L1/L2 penalty. Make sure to check convergence (MSE should go down from initial noise levels) and do some parameter tuning.

If you have too many features (especially categorical factor levels, they add quickly), then you can use unsupervised methods such as GLRM/PCA or K-Means to reduce the dimensionality, before running the autoencoder. This will all depend on the data you have, so it’s not easy to know upfront what works best for you.

Hogwild! should be fine for accuracy, if your neural net is large enough (> 10k weights or so). Otherwise, you can compare to reproducible mode (with a seed), and see how it behaves. Again, you have to experiment.

Hope this helps,
Arno

> --
> You received this message because you are subscribed to a topic in the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/h2ostream/XTzjny2Ilq0/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to h2ostream+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

search...@gmail.com

unread,

Jul 27, 2015, 12:55:30 AM7/27/15

to H2O Open Source Scalable Machine Learning - h2ostream

Hi Arno,

My results vary very wildly If I use "reproducible=true"
Do you think it is due to excessive noise in data?
I am using following
_input_dropout_ratio = 0.2;
_activation = tanh
_l1 = 1e-4;
_l2 = 1e-5;

Also If I use dimension reduction techniques noise in data gets cancelled out?

Thanks,
Mahesh

Arno Candel

unread,

Jul 27, 2015, 1:06:32 AM7/27/15

to search...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

Mahesh,
You need to set the seed as well, otherwise it’s not reproducible. If results vary a lot, then that means that your parameters aren’t leading to a good model, you might need to run for more epochs or change the network size, number of layers or other parameters. I can’t tell you more without having access to the data and the use case.
What do you mean by “noise gets cancelled out”? Even for deep learning, "garbage in, garbage out” still holds… but yeah, denoising with input dropout and L1/L2 should help a bit to filter the high frequencies out and increase the signal/noise ratio.
Hope this helps,
Arno

search...@gmail.com

unread,

Jul 27, 2015, 1:40:18 AM7/27/15

to H2O Open Source Scalable Machine Learning - h2ostream

Sorry typo from me.

"My results vary very wildly If I use "reproducible=false*"
False is necessary in real world with large data for performance reasons.