My issue is pretty simple. I want to validate files up to 50MB in size for correct formatting in App Engine.
This presents a number of large challenges now. The first of which is the Apache XLS/XLSX POI API. When I load 20MB of file data into memory locally before validation it throws:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:128)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:84)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:272)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:79)
That's because I'm using the input stream method instead of the file method as detailed in the Apache POI quick start guide:
When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file.
To the best of my knowledge, the blob store service in App Engine only returns streams of bytes, not actual file objects. This means I have to use the Apache API for a stream which requires the entire file to be read into memory first. Thus the gist of my issue. Here's my sample code:
validatingSpec.validate(WorkbookFactory.create(blobInputStream)
Here are my questions: