Java Memory Leak - Expanding Heap Size

stuartgo...@gmail.com

unread,

Oct 5, 2016, 9:57:15 AM10/5/16

to H2O Open Source Scalable Machine Learning - h2ostream

Hi All,

Here is some background on the problem I am solving: I am using H2O to build deep neural networks to forecast security prices. The models are used every day to construct a forecast and are updated (retrained from a checkpoint) every 3 months. I am simulating 50 assets (50 models) over a period of 14 years. In other words each model is trained 56 times (14 years * 4 times per year).

I noticed that the memory usage just keeps climbing as the simulation continues. I have a server with 24GB of RAM and 20 cores. At first I thought that this was due to all the temporary files which H2O creates so I use h2o.ls() to get a list of all the objects and I remove everything which isn't a model (i.e. I keep 50 items). But the memory usage was still quite high. So I thought it might be R so I am continuously calling gc() but it doesn't help.

Through H2O flow I can see that the memory size (used) on the cluster hovers between 120 and 160MB. However, the total RAM used by Java just keeps climbing until the entire program crashes. I have replicated this bug with the bleeding edge version, the current stable release (Turing), and the previous stable release (Turchin). Also, even if I set the max memory usage to 8GB it will just go over 8GB (and fairly quickly at that).

I think that it is exactly the same bug as this one logged on JIRA: https://0xdata.atlassian.net/browse/PUBDEV-3203. Is anybody working on this issue? I would label it as critical since it prevents people from using H2O in many use cases. Right now I have no idea what to do. Is there a way to install an even older version of H2O? Or call the Java garbage collector from inside R? Or just fix the problem wherever it is?

Any help / suggestions would be greatly appreciated!

Kind regards
Stuart Gordon Reid

stuartgo...@gmail.com

unread,

Oct 5, 2016, 12:02:32 PM10/5/16

to H2O Open Source Scalable Machine Learning - h2ostream, stuartgo...@gmail.com

I just wanted to add one more piece of information. The computer is running Ubuntu Server

stuartgo...@gmail.com

unread,

Oct 5, 2016, 12:13:55 PM10/5/16

to H2O Open Source Scalable Machine Learning - h2ostream, stuartgo...@gmail.com

I have also taken a screenshot of the memory usage according to the H2O Flow User Interface and another screenshot of the memory usage according to htop (a command line "task manager" for linux). You can see in the htop photo that the memory ("mem") is sitting at 5.16GB but the memory usage according to H2O Flow is only 124MB. The memory usage according to htop just keeps climbing until it starts using the swap file and then crashes. I'm not sure how to reconcile these two measurements.

Image 1 (H2O Flow): https://drive.google.com/file/d/0B8BsJ1DWY28VczdVNEpSUXdpWG8/view?usp=sharing
Image 2 (HTOP): https://drive.google.com/file/d/0B8BsJ1DWY28VdGtVcmlvNEFaWkE/view?usp=sharing

Tom Kraljevic

unread,

Oct 5, 2016, 12:26:54 PM10/5/16

to stuartgo...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

I recommend you read the memory tips I wrote in this thread:
https://groups.google.com/forum/#!topic/h2ostream/Dc6l4xzwkaU

Thanks,
Tom

> --
> You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

stuartgo...@gmail.com

unread,

Oct 6, 2016, 4:17:49 AM10/6/16

to H2O Open Source Scalable Machine Learning - h2ostream, stuartgo...@gmail.com

Hi Tom,

I did read through your comments and this is the code I have after each iteration of the training / evaluating the models:

# Clear the H2O and R memory.
rm(models.trained); rm(h2o.frames)
rm(train); rm(valid)
gc(); gc(); gc()
h2o.objs <- as.character(h2o.ls()$key)
keeps <- c(names(forec...@models.trained))
h2o.rm(h2o.objs[which(!h2o.objs %in% keeps)])
h2o:::.h2o.garbageCollect()
h2o:::.h2o.garbageCollect()
h2o:::.h2o.garbageCollect()

This just keeps the models so I can update them at the next iteration but removes everything else I don't need. I ran this overnight and I saw that whilst the memory consumption was lower, it is still increasing.

After a few hours this message popped up:

warning is .h2o.__checkConnectionHealth() is behaving slowly

So I followed the link and this is the output from http://10.0.10.88:54321/3/Cloud:

{"__meta":{"schema_version":3,"schema_name":"CloudV3","schema_type":"Iced"},"_exclude_fields":"","skip_ticks":false,"version":"3.11.0.3643","branch_name":"master","build_number":"3643","build_age":"1 day","build_too_old":false,"node_idx":0,"cloud_name":"H2O_started_from_R_root_pwa051","cloud_size":1,"cloud_uptime_millis":54446634,"cloud_healthy":true,"bad_nodes":0,"consensus":true,"locked":true,"is_client":false,"nodes":[{"__meta":{"schema_version":3,"schema_name":"NodeV3","schema_type":"Iced"},"h2o":"/127.0.0.1:54321","ip_port":"127.0.0.1:54321","healthy":true,"last_ping":1475740751649,"pid":94657,"num_cpus":20,"cpus_allowed":19,"nthreads":19,"sys_load":11.83,"my_cpu_pct":-1,"sys_cpu_pct":-1,"mem_value_size":158821376,"pojo_mem":2279624704,"free_mem":7106214912,"max_mem":9544660992,"swap_mem":0,"num_keys":37515,"free_disk":250123124736,"max_disk":268803506176,"rpcs_active":0,"fjthrds":[-1,1,6,6,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,1,-1,2,0,0,0,0,0,0,1],"fjqueue":[-1,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,-1,7,0,0,0,0,0,0,0],"tcps_active":0,"open_fds":60,"gflops":0.5889999866485596,"mem_bw":2.8382048256E10}]}

I am surprised that the number of keys is equal to 37515 because I have run this code whilst printing nrow(h2o.ls()) and I see that there are +- 50 objects stored.

I have no idea what is going on here, but I assure you I am following your advice from the other thread.

Reply all

Reply to author

Forward