I admit that the increase from file size to memory usage is huge. I'm sorry about that.
Problems:
* 2X because Java stores all chars as 2 byte Unicode characters.
* Another 2X because ERDDAP reads the entire file to parse it into lines, then processes the file.
It does because ERDDAP has a lot of auto-sensing of the file type in the reader, so it sometimes needs to read many lines a few times.
(e.g., auto sense the column names row, auto sense the data start row, auto sense the separator)
* Another 2X (of the original) because ERDDAP holds the ASCII lines and the parsed strings (each cell) in memory at the same time.
* Plus there's certainly a lot of overhead when dealing with 3M line strings and 30M(?) substrings (stored in the heap, with pointers).
One can complain about various aspects of Java, but the advantages are huge.
* 2 byte chars gain Unicode support and speed of processing (vs UTF8) at the expense of memory. A classic computer trade-off.
* The heap/dynamic memory allocation in Java is a wondrous thing compared to e.g., C's handling.
It frees up my time programming and it drastically reduces bugs and security vulnerabilities, but at the expense of memory and speed (notably garbage collection).
* The code can run unchanged on virtually any computer (Raspberry Pi to PC to server to super computer) on basically any OS and any chip.
Yes, if ERDDAP didn't do the auto-sensing and immediately parsed to specific data types. But then generateDatasetsXml wouldn't be nearly as useful.
Yes, I could probably take time to make this so it can process the file basically as a stream (not via lines in memory).
Yes, it is possible I could optimize other aspects of this.
But it is already pretty complicated code (very far from repeatedly: read a line, split on ",", grab the values).
I prefer to add features than spend tons of time optimizing. For at least 5 years, I have had 10+ years of feature requests on my To Do list.
(I note that I didn't do one of the BCO-DMO requests for the last version. I'm sorry.)
As with the 2GB limit of .nc3 files as output, sometimes these limits are useful to hint to you to do things differently.
Here, without seeing the file, I am pretty darn sure that the dataset would be more efficient in ERDDAP (regardless of the current parser or an optimized one) if the file were broken into logical chunks (e.g., by station/something (if any), and certainly by time) because then ERDDAP could often reply to requests by opening 1 or a few small files instead of 1 giant file. ASCII files always have a time penalty because of the need to parse the items and convert to binary (especially date conversions which are quite complicated).
I understand that BCO-DMO probably doesn't want to split the files. Okay. It's your choice.
In any case, memory is cheap and has been for years and especially right now. Additional memory has lots of advantages for ERDDAP.
As an example (although it is almost certainly not right for your particular computer) here's 32GB for $128.
I thought Roy was overdoing it when he bought a server with 16 (or 32) GB a few years ago, but it has come in handy for a few datasets with huge <subsetVariables> tables.
A day of your or my time is way more than $128.
Heck, I probably just spent that in responding to this email (salary + benefits + overhead (which is high in a big organization like NOAA). :-( or :-O or ;-)
I hope that helps.
Best wishes.