Apache POI WorkbookFactory.create throws java.lang.OutOfMemoryError: Java heap space

4 views
Skip to first unread message

advocate via StackOverflow

unread,
Nov 19, 2014, 3:36:22 PM11/19/14
to google-appengin...@googlegroups.com

My issue is pretty simple. I want to validate files up to 50MB in size for correct formatting in App Engine.

This presents a number of large challenges now. The first of which is the Apache XLS/XLSX POI API. When I load 20MB of file data into memory locally before validation it throws:

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2271)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
    at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:128)
    at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55)
    at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:84)
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:272)
    at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:79)

I need to open and valite spreadsheets up to 20 to 25MB in size. 50 would be a nice stretch goal if possible. We're talking hundreds of thousands of lines of data all on a single sheet.

Now my traditional code loads the entire file into memory and promptly proceeds to crash the heap of my app engine instance. Here's my traditional code:

    public ErrorLog validateWorkbook(inputWorkbook)
    {
        int sheetCount = inputWorkbook.getNumberOfSheets();
        for (int x = 0; x< sheetCount; x++)
        {
            Sheet currentSheet = inputWorkbook.getSheetAt(x);
            Iterator<Row> rowIterator = currentSheet.rowIterator();
            while(rowIterator.hasNext())
            {
                Iterator<Cell> cellIterator = rowIterator.next().cellIterator();
                while(cellIterator.hasNext())
                {
                    Cell currentCell = cellIterator.next();
                    boolean success = validateCellContents(currentCell);
                    if(!success)
                        ErrorLog.appendError(new Error()); // detailed user error explicitly defining error location, cell value, and recommended steps to fix
                }
            }
        }
        return ErrorLog;
    }

Now there are supposedly event-based ways to handle an actionlistener every time a cell is encountered. But the dummy code here has a reference to:

ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(container); 

I've inspected this object in the debugger and it contains every unique string reference in the current worksheet. This is essentially doing exactly what I'm trying to avoid. It's allocating a large block of memory to store each value in memory beforehand. The ideal solution takes a stream of input bytes and decodes the strings as it traverses the file to reduce the memory footprint.

Because the string table is definitely going to take up a lot of space in memory. I'm working on processing 150,000 to 300,000 line item spreadsheets

Now the quick guide mentions that you can use a File or InputStream and that if you use a File, the input will be buffered. The problem here being that App Engine and the Blob Store Service, have no knowledge of File Objects and only return InputStreams (to the best of my knowledge).

Additionally, the other event-driven model, Default Handler, doesn't seem to have any concept of a column or row for each value in its interface-defined methods that are called when actions take place (and it also allocates the entire Shared String Table up front).

Running out of ideas here! Gonna try to offer up a bounty for this. At least a concrete 'no this isn't possible' would suffice and then I can start looking at workarounds but I get the feeling I'm just not using the vast API as well as I could be.



Please DO NOT REPLY directly to this email but go to StackOverflow:
http://stackoverflow.com/questions/25796777/apache-poi-workbookfactory-create-throws-java-lang-outofmemoryerror-java-heap-s
Reply all
Reply to author
Forward
0 new messages