My data file is too big. ERDDAP runs out of memory.

362 views
Skip to first unread message

Bob Simons

unread,
Oct 16, 2018, 6:36:41 PM10/16/18
to ERDDAP
[A user privately asked a question of general interest. So I'm going to ask and answer it here.]

The Question:
I'm trying to make an ERDDAP dataset from one, very large CSV file. I keep getting the error message: "java.lang.OutOfMemoryError: GC overhead limit exceeded" when ERDDAP tries to load the dataset. I've increased the memory settings for ERDDAP up to 8GB via the Tomcat settings:
  export JAVA_OPTS='-server -Djava.awt.headless=true -Xmx8000M -Xms8000M -d64'
but I keep getting the same error message. What should I do?

Answer: 
There are two options:
1) Leave the file as is. In this case, you need to increase the -Xmx setting until ERDDAP can read the file successfully. And as a rule-of-thumb, the server should have at least 33% more physical memory than you are allocating to Java.  For example, if you set -Xmx to 12000M, the server should have at least 16GB.  Yes, it's true that reading ASCII files takes a surprisingly large amount of memory. Part of it has to do with how Java deals with Strings and there is nothing I can do about that aspect of it.

2) (Recommended) Split the file into multiple files. Ideally, you can split the file into logical chunks. For example, if the file has 20 month's worth of data, split it into 20 files, each with 1 month's worth of data. But there are advantages even if the main file is split up arbitrarily. This approach has multiple benefits: a) This will reduce the memory needed to read the data files to 1/20th, because only one file is read at a time. b) Often, ERDDAP can deal with requests much faster because it only has to look in one or a few files to find the data for a given request. c) If data collection is ongoing, then the existing 20 files can remain unchanged, and you only need to modify one, small, new file to add the next month's worth of data to the dataset.

I hope that helps.

tylar...@mail.usf.edu

unread,
Oct 17, 2018, 10:54:58 AM10/17/18
to ERDDAP
Does this error always mean that there is a file too large? I have been getting "java.lang.OutOfMemoryError" when running on a large directory of nc files. I assumed that the dataset itself might be too big to index, but my problem might be solved by splitting any large nc files?

Bob Simons

unread,
Oct 17, 2018, 11:36:04 AM10/17/18
to ERDDAP
No. There are lots of possible causes of the OutOfMemoryError.  

It is unlikely that a large number of files would cause the file index that ERDDAP creates to be so large that that file would cause the error. If we assume that each file uses 300 bytes, then 1,000,000 files would only take up 300MB. But in extreme cases, it's possible.

A single huge data file can cause the OutOfMemoryError. But here, it should be obvious because ERDDAP will fail to load the dataset (for tabular datasets) or read data from that file (for gridded datasets). 

A single huge request can cause the OutOfMemoryError. In particular, some of the orderBy options have the entire response in memory for a second (e.g., to do a sort). If the response is huge, it can lead to the error.

It's always possible that several simultaneous large requests (on a really busy ERDDAP) can combine to cause memory trouble. E.g., 8 requests, each using 1GB each, would cause problems for an -Xmx=8GB setup. But it is rare that each request would be at the peak of its memory use simultaneously. And you would easily be able to see that your ERDDAP is really busy with big requests. But, it's possible.

There are other scenarios. If you look at the log.txt file to see what ERDDAP was doing when the error arose, you can usually get a good clue as to the cause. In most cases, there is a way to minimize that problem (e.g., split the data file into smaller files), but sometimes you just need more memory.

ERDDAP does a lot of checking to enable it to handle the error gracefully (so a request fails, but the system retains its integrity). But sometimes, the error damages system integrity and you have to restart ERDDAP. Hopefully, that is rare.

Bob Simons

unread,
Oct 17, 2018, 12:26:39 PM10/17/18
to ERDDAP
Ah. I forgot to mention: yes, there is another type of index file that ERDDAP creates that can easily be the source of OutOfMemoryErrors.
For tabular datasets that use the <subsetVariables> attribute, ERDDAP makes a table of unique combinations of the values of
those variables. 
For huge datasets or when <subsetVariables> is misconfigured, this table can be large enough to cause OutOfMemoryErrors.
The solution is to remove variables from the list of <subsetVariables> for which there are a large number of values, 
or remove variables as needed until the size of that table is reasonable and the OutOfMemoryErrors go away.
The parts of ERDDAP that use the subsetVariables system don't work well
(e.g., web pages load very slowly) when there are more than 100,000 rows in that table.

tylar...@mail.usf.edu

unread,
Oct 17, 2018, 1:00:39 PM10/17/18
to ERDDAP
I am looking forward to revisiting my issue armed with this new knowledge; as you say I think it is a misconfiguration on my part related to sorting / indexing. Thanks for sharing the original question & great follow-up info.

MATHEW BIDDLE

unread,
Jul 15, 2019, 3:59:16 PM7/15/19
to ERDDAP
I'm curious about this topic. I have a dataset that is a 747MB tsv file with 3,503,267 lines. When I ran GenerateDatasetsXml.sh using EDDTableFromAsciiFiles with 10GB memory, it failed with OutOfMemoryError. So I bumped it up to 11GB and that was enough to get it through. 

Could I get a little more detail about the comment:
"Yes, it’s true that reading ASCII files takes a surprisingly large amount of memory. Part of it has to do with how Java deals with Strings and there is nothing I can do about that aspect of it."
11GB of memory to read a 747MB file seems extraordinarily high.

To note: The only variables we have assigned as subsetVariables are variables with one unique value. So, the table for subsetVariables isn't large. 4 variables with 1 value each. 

Matt

Bob Simons

unread,
Jul 15, 2019, 4:55:13 PM7/15/19
to ERDDAP
I admit that the increase from file size to memory usage is huge. I'm sorry about that.

Problems:
* 2X because Java stores all chars as 2 byte Unicode characters. 
* Another 2X because ERDDAP reads the entire file to parse it into lines, then processes the file.
  It does because ERDDAP has a lot of auto-sensing of the file type in the reader, so it sometimes needs to read many lines a few times.
  (e.g., auto sense the column names row, auto sense the data start row, auto sense the separator)
* Another 2X (of the original) because ERDDAP holds the ASCII lines and the parsed strings (each cell) in memory at the same time.  
* Plus there's certainly a lot of overhead when dealing with 3M line strings and 30M(?) substrings (stored in the heap, with pointers).

One can complain about various aspects of Java, but the advantages are huge. 
* 2 byte chars gain Unicode support and speed of processing (vs UTF8) at the expense of memory. A classic computer trade-off.
* The heap/dynamic memory allocation in Java is a wondrous thing compared to e.g., C's handling. 
  It frees up my time programming and it drastically reduces bugs and security vulnerabilities, but at the expense of memory and speed (notably garbage collection).
* The code can run unchanged on virtually any computer (Raspberry Pi to PC to server to super computer) on basically any OS and any chip.

Yes, if ERDDAP didn't do the auto-sensing and immediately parsed to specific data types. But then generateDatasetsXml wouldn't be nearly as useful.
Yes, I could probably take time to make this so it can process the file basically as a stream (not via lines in memory).
Yes, it is possible I could optimize other aspects of this. 
But it is already pretty complicated code (very far from repeatedly: read a line, split on ",", grab the values).
I prefer to add features than spend tons of time optimizing. For at least 5 years, I have had 10+ years of feature requests on my To Do list.
(I note that I didn't do one of the BCO-DMO requests for the last version. I'm sorry.)

As with the 2GB limit of .nc3 files as output, sometimes these limits are useful to hint to you to do things differently.
Here, without seeing the file, I am pretty darn sure that the dataset would be more efficient in ERDDAP (regardless of the current parser or an optimized one) if the file were broken into logical chunks (e.g., by station/something (if any), and certainly by time) because then ERDDAP could often reply to requests by opening 1 or a few small files instead of 1 giant file. ASCII files always have a time penalty because of the need to parse the items and convert to binary (especially date conversions which are quite complicated).
I understand that BCO-DMO probably doesn't want to split the files. Okay. It's your choice.

In any case, memory is cheap and has been for years and especially right now. Additional memory has lots of advantages for ERDDAP.
As an example (although it is almost certainly not right for your particular computer) here's 32GB for $128.
I thought Roy was overdoing it when he bought a server with 16 (or 32) GB a few years ago, but it has come in handy for a few datasets with huge <subsetVariables> tables.
A day of your or my time is way more than $128. 
Heck, I probably just spent that in responding to this email (salary + benefits + overhead (which is high in a big organization like NOAA).  :-(  or :-O or ;-)

I hope that helps.
Best wishes.

Adam Shepherd

unread,
Jul 15, 2019, 5:15:57 PM7/15/19
to Bob Simons, ERDDAP

Hi Bob,

Thanks so much for this review of what's happening behind the scenes. We love ERDDAP and appreciate all the hard work that's gone into it by you and others to. This description helps us validate to our IT department why we need the memory we are asking for. Thank you!

cheers, Adam

--
You received this message because you are subscribed to the Google Groups "ERDDAP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to erddap+un...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/erddap/e829c6a8-a734-4901-b6e1-a84bbb7344fe%40googlegroups.com.

Roy Mendelssohn - NOAA Federal

unread,
Jul 15, 2019, 5:43:13 PM7/15/19
to Adam Shepherd, Bob Simons, ERDDAP
Yes generally I throw a lot of memory at ERDDAP. Most of the time it doesn't need it, but some of the time it does, and it keeps things running smoothly for those times. Also, my anecdotal experience is that it helps Tomcat a lot. I am not certain if it improves garbage collection, or if dealing with errors (we get a large number of mis-formed requests) or whatever, but Tomcat behaves better,

As for the 2GB .nc limit, I know Bob has looked at nc4 as an output format (serving .nc4 files already exists), I don't know where it stands or if it is being worked on at all, there are a number of issues, such as there is no pure Java implementation of the required libraries (to write .nc4 files), and raises issues as how for us to test and distribute. There have been several neat contributions to ERDDAP lately, this might be another one someone can work on, thoughtful additions to the service are welcomed and my own feeling is its modular design makes this quite possible.

-Roy
> To view this discussion on the web, visit https://groups.google.com/d/msgid/erddap/4702d079-f2cc-69b8-1610-a1a3fa135d19%40whoi.edu.

**********************
"The contents of this message do not reflect any position of the U.S. Government or NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
***Note new street address***
110 McAllister Way
Santa Cruz, CA 95060
Phone: (831)-420-3666
Fax: (831) 420-3980
e-mail: Roy.Men...@noaa.gov www: http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected"
"the arc of the moral universe is long, but it bends toward justice" -MLK Jr.

Roy Mendelssohn - NOAA Federal

unread,
Jul 15, 2019, 6:23:07 PM7/15/19
to ERDDAP, Adam Shepherd, Bob Simons


> On Jul 15, 2019, at 2:43 PM, Roy Mendelssohn - NOAA Federal <roy.men...@noaa.gov> wrote:
>
> As for the 2GB .nc limit, I know Bob has looked at nc4 as an output format (serving .nc4 files already exists), I don't know where it stands or if it is being worked on at all, there are a number of issues, such as there is no pure Java implementation of the required libraries (to write .nc4 files), and raises issues as how for us to test and distribute.

I want to expand on this. It is reasonable to expect a consistent list of return types in ERDDAP, across different implementations of ERDDAP. The Java interface to write nc4 files uses a JNI to the C libraries. That means if there is to be a consistent list of return types, the NetCDF (and HDF5 and compression libraries) must be installed. Installing the libraries can be non-trivial. This puts an extra burden on an ERDDAP admin, not only to get the libraries built, but to have the expertise to do so, as compared to dropping a war file in a directory.

A number of years ago a fellow at NCAR made an effort to write a pure Java HDF5 library that could do read and writes (or at least enough type of writes to satisfy the needs of netcdf) but then it was dropped (it looks like 5 years ago). see:

https://github.com/NCAR/nujan

Now there would be an interesting project to revive.

-Roy

Bob Simons

unread,
Jul 15, 2019, 7:31:08 PM7/15/19
to ERDDAP
Roy's right about the 64bit nc3, and nc4 things. 
I'll add:
The coding is mostly done. I just wanted to think more about the other consequences, like the things Roy brought up.
I think the best solution is to turn on support for 64bit nc3 files (which is still all-Java) and add that as a standard feature in ERDDAP. That significantly expands the allowed file size but doesn't add the complications of .nc4 (needing to install an external C library). ERDDAP currently doesn't really need the other .nc4 features (groups, compression (debatable), but longs and UTF-8 would be nice), but that could change. I tried to get CF and Unidata to make small but important improvements to a new version of 64bit .nc3 (like longs and UTF-8), but got nowhere. (If you want those things, ask Unidata. But they're really busy, too.)

Bob Simons

unread,
Jul 17, 2019, 11:57:58 AM7/17/19
to ERDDAP
For the sake of completeness and to have all the info in one place for when I (hopefully someday) work on this, I'll add:
I forgot to mention some other possible reasons for huge memory needs:

1) When ERDDAP starts reading an ASCII file, it has no idea how many rows there will be. 10? 10 million?
ERDDAP uses ArrayList-like data structures for this, mostly for their ability to grow efficiently as needed. 
But it comes at a cost of 1 - 2X wasted space at any given moment.

2) ASCII files can be quite memory efficient.
A missing value takes only 1 byte (for the delimiter, e.g.,  ",").
Integers often take 2 bytes, e.g., ",0".

In their binary representation,
32 bit integers and floats take 4 bytes, while doubles take 8 bytes.
Strings are hard to quantify, but in one form in Java (subset of original string), they can take 16 bytes minimum.

So if an ASCII file has missing values or simple values, those values can may take 2-4 times (worst case 8 times) more memory in their binary form.
If the file has lots of those values, e.g., millions, that is a significant amount of memory.

It's true, ASCII files can be wasteful of space. 
But few integers in ASCII files are as large as 2,000,000,000.
And few doubles from sensors utilize the full decimal precision of doubles. (Yes, derived double values often do use the full precision.)

So in a typical file, ASCII files have values that are both more memory efficient and less memory efficient.
But extreme cases occur often.

Rob Fuller

unread,
Jul 18, 2019, 1:52:31 AM7/18/19
to Bob Simons, ERDDAP
Hi, probably it would be a big change but storing strings in a Trie data structure could give big memory savings...

--
You received this message because you are subscribed to the Google Groups "ERDDAP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to erddap+un...@googlegroups.com.
Message has been deleted

Bob Simons

unread,
Jul 24, 2019, 11:15:38 AM7/24/19
to ERDDAP
I worked on this and hopefully solved the problem. The changes will be in the next release: v2.02, hopefully relatively soon.

Basically:
ERDDAP's method of storing arrays of strings (StringArray) is now much more memory efficient. 
These are used throughout ERDDAP, notably when reading tabular ASCII data files.
Also, other changes make reading tabular ASCII data files even more memory efficient.
The result is: for a 764 MB ASCII data test file (now compressed to a 52MB .gz file) with 3,503,266 rows and 33 columns, 
the maximum memory usage went from 10GB down to 1.6GB (at peak).
The time to read it went from ~7 minutes (but varies greatly with how much physical memory is in the computer)
down to ~42 seconds (largely because Java doesn't have to do so much memory management now).
Many other places in ERDDAP will benefit from this increased memory efficiency.

I explored a variant (storing strings in StringArray as UTF-8-encoded byte arrays).
That variant reduces memory usage another ~33%, but at the cost of ~33% slowdown. 
Compared to the system that is now being used, that seemed like a not good trade off.
It's easier to give a computer more memory (buy more memory for ~$200) 
than to make it faster (buy a whole new computer).

Big thanks to everyone (especially Tylar Murray and Mathew Biddle) for bringing this up and pointing out that ERDDAP was using way, way too much memory when reading large tabular ASCII files, and for providing a good test file. 
Thanks also for prodding me to explain why this was happening -- that led me to see where things could be improved.

Bob Simons

unread,
Jul 29, 2019, 3:05:59 PM7/29/19
to ERDDAP
I made some more optimizations. 
Now, ERDDAP can read the test file in 33s, using only 0.6GB at peak.
Reply all
Reply to author
Forward
0 new messages