So I found this, how to fix this?
2024-03-31T07:30:20,184 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.memory.free=464MB
2024-03-31T07:30:20,184 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.memory.max=512MB
2024-03-31T07:30:20,184 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.memory.total=512MB
So a parquet file standard size is 256MB - Snappy compressed means 5X increase requires 1280MB, need 2 copies so 2560MB
which has to be less that 40% of processing heap so these values should be 8GB. Not clear who needs the value adjusted as the documentation
does not seem to address this.
2024-03-31T07:30:26,679 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - RecordReader initialized will read a total of 1372160 records.
2024-03-31T07:30:26,679 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - at row 0. reading next block
2024-03-31T07:30:28,968 INFO [processing-0] org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.snappy]
2024-03-31T07:30:29,048 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - block read in memory in 2368 ms. row count = 679936
2024-03-31T07:31:18,635 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - Assembled and processed 679936 records from 43 columns in 48854 ms: 13.917714 rec/ms, 598.46173 cel$
2024-03-31T07:31:18,635 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - time spent so far 4% reading (2368 ms) and 95% processing (48854 ms)
2024-03-31T07:31:18,635 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - at row 679936. reading next block
2024-03-31T07:31:19,310 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - block read in memory in 675 ms. row count = 692224
Terminating due to java.lang.OutOfMemoryError: Java heap space