Presto parquet optimized reader issues

542 views
Skip to first unread message

Abraham Theodorus

unread,
Oct 4, 2017, 4:18:52 AM10/4/17
to Presto
Hi,

I am facing issues regarding presto nested schema. So basically I have a table in presto:

CREATE TABLE hive.lake.product (                                                                                                                                                                                       
    ...                                                                                                                                                                                                                   
    product row( ..., name varchar, depid array(varchar), price bigint, id varchar),                                                                             
    date varchar
     ...                                                                                                                                                                                                               
 )                                                                                                                                                                                                                                    
 WITH (                                                                                                                                                                                                                               
    external_location = 's3a://lake/product,                                                                                                                                                            
    format = 'PARQUET',                                                                                                                                                                                                               
    partitioned_by = ARRAY['date']                                                                                                                                                                                               
 )

1. My first issue was when I set hive.parquet_optimzed_reader_enabled to true, I couldn't query array type column as presto always returns NULL

it is similar to this issue -> https://github.com/prestodb/presto/issues/7947

presto:lake> set session hive.parquet_optimized_reader_enabled=true;
SET SESSION
presto:iris> select * from product;
                                                       product                                                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {name=Panasonic E27, depid=null, price=7750, id=179164559} 
...

Query 20171003_101057_00559_6xrfv, RUNNING, 2 nodes
http://localhost:8889/query.html?20171003_101057_00559_6xrfv
Splits: 97 total, 0 done (0.00%)
CPU Time: 0.0s total,     0 rows/s,     0B/s, 100% active
Per Node: 0.0 parallelism,     0 rows/s,     0B/s
Parallelism: 0.0
0:07 [0 rows, 0B] [0 rows/s, 0B/s]

However when I disabled it, Presto returned error

presto:lake> select * from product limit 10;

Query 20171003_101519_00561_6xrfv, FAILED, 2 nodes
http://localhost:8889/query.html?20171003_101519_00561_6xrfv
Splits: 98 total, 0 done (0.00%)
CPU Time: 0.0s total,     0 rows/s,     0B/s, 23% active
Per Node: 0.0 parallelism,     0 rows/s,     0B/s
Parallelism: 0.1
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20171003_101519_00561_6xrfv failed: Error opening Hive split s3a://lake/product/date=2017-09-25/part-0000_2017-09-26-13-47-42-260164-b806276b-48b6-4e31-aa26-e52021e10ce1.c000.snappy.parquet (offset=0, length=30890026): Column product.depId type LIST not supported
com.facebook.presto.spi.PrestoException: Error opening Hive split s3a://lake/product/date=2017-09-25/part-0000_2017-09-26-13-47-42-260164-b806276b-48b6-4e31-aa26-e52021e10ce1.c000.snappy.parquet (offset=0, length=30890026): Column product.depId type LIST not supported
	at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:386)
	at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.<init>(ParquetHiveRecordCursor.java:160)
	at com.facebook.presto.hive.parquet.ParquetRecordCursorProvider.createRecordCursor(ParquetRecordCursorProvider.java:94)
	at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:159)
	at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:87)
	at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
	at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
	at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:259)
	at com.facebook.presto.operator.Driver.processInternal(Driver.java:335)
	at com.facebook.presto.operator.Driver.lambda$processFor$6(Driver.java:240)
	at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:612)
	at com.facebook.presto.operator.Driver.processFor(Driver.java:235)
	at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
	at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at com.facebook.presto.execution.executor.LegacyPrioritizedSplitRunner.process(LegacyPrioritizedSplitRunner.java:23)
	at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:483)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Column product.depId type LIST not supported
	at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createGroupConverter(ParquetHiveRecordCursor.java:724)
	at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createConverter(ParquetHiveRecordCursor.java:710)
	at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.access$1100(ParquetHiveRecordCursor.java:102)
	at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor$ParquetStructConverter.<init>(ParquetHiveRecordCursor.java:760)
	at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createGroupConverter(ParquetHiveRecordCursor.java:722)
	at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.access$300(ParquetHiveRecordCursor.java:102)
	at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor$PrestoReadSupport.<init>(ParquetHiveRecordCursor.java:438)
	at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:339)
	... 18 more` 

 
 

 


2. The second one is regarding updated schema

When parquet optimized reader is set to false, Presto will return schema mismatch error when there is a schema difference in the nested schema. i found no issues when the schema difference is not in the nested part.  

Query 20171003_143733_00028_brtjs, FAILED, 3 nodes http://localhost:8889/query.html?20171003_143733_00028_brtjs Splits: 64 total, 3 done (4.69%) CPU Time: 0.1s total, 0 rows/s, 0B/s, 23% active Per Node: 0.0 parallelism, 0 rows/s, 0B/s Parallelism: 0.1 0:01 [0 rows, 0B] [0 rows/s, 0B/s] Query 20171003_143733_00028_brtjs failed: Error opening Hive split s3a://lake/product/date=2017-09-27/part-0000_2017-09-27-12-07-36-166979-7b298d70-62e9-43cb-abd1-027d2d54328f.c000.snappy.parquet (offset=0, length=27356): Schema mismatch, metastore schema for row column product has 9 fields but parquet schema has 8 fields com.facebook.presto.spi.PrestoException: Error opening Hive split s3a://lake/product/date=2017-09-27/part-0000_2017-09-27-12-07-36-166979-7b298d70-62e9-43cb-abd1-027d2d54328f.c000.snappy.parquet (offset=0, length=27356): Schema mismatch, metastore schema for row column product has 9 fields but parquet schema has 8 fields at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:386) at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.<init>(ParquetHiveRecordCursor.java:160) at com.facebook.presto.hive.parquet.ParquetRecordCursorProvider.createRecordCursor(ParquetRecordCursorProvider.java:94) at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:159) at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:87) at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44) at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56) at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:259) at com.facebook.presto.operator.Driver.processInternal(Driver.java:335) at com.facebook.presto.operator.Driver.lambda$processFor$6(Driver.java:240) at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:612) at com.facebook.presto.operator.Driver.processFor(Driver.java:235) at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622) at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163) at com.facebook.presto.execution.executor.LegacyPrioritizedSplitRunner.process(LegacyPrioritizedSplitRunner.java:23) at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:483) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: Schema mismatch, metastore schema for row column product has 9 fields but parquet schema has 8 fields at com.google.common.base.Preconditions.checkArgument(Preconditions.java:399) at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor$ParquetStructConverter.<init>(ParquetHiveRecordCursor.java:747) at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createGroupConverter(ParquetHiveRecordCursor.java:722) at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.access$300(ParquetHiveRecordCursor.java:102) at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor$PrestoReadSupport.<init>(ParquetHiveRecordCursor.java:438) at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:339) ... 18 more

any ideas?

Zhenxiao Luo

unread,
Oct 4, 2017, 4:55:32 AM10/4/17
to presto...@googlegroups.com

Hi Abraham,

For old Parquet reader, schema evolution is supported in this PR:
#6675

Will work on new Parquet reader fix Q4 this year.


Thanks,

Zhenxiao


--
You received this message because you are subscribed to the Google Groups "Presto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to presto-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hao

unread,
Oct 4, 2017, 5:20:51 PM10/4/17
to Presto
Can you share your parquet schema? you can generate it using parquet-tools schema PARQUET-DATA

jamesch...@gmail.com

unread,
May 10, 2018, 2:09:45 AM5/10/18
to Presto
Hi Zhenxiao:

Just want to quick check which version can fix #6676 issues.
Thanks.

Zhenxiao Luo於 2017年10月4日星期三 UTC+8下午4時55分32秒寫道:
> To unsubscribe from this group and stop receiving emails from it, send an email to presto-users...@googlegroups.com.

Zhenxiao Luo

unread,
May 16, 2018, 12:55:13 AM5/16/18
to presto...@googlegroups.com

Hi James,

Do you mean #6675? It is still under review. I think you could just apply that patch to Presto, nested schema evolution issue should be fixed.

Thanks,
Zhenxiao

To unsubscribe from this group and stop receiving emails from it, send an email to presto-users+unsubscribe@googlegroups.com.

Piyush N

unread,
Jun 11, 2018, 5:14:43 PM6/11/18
to Presto
This change would be really useful to have in master. We've run into this bug as well. I tried applying this change to a slightly more recent version of Presto (based off 0.198) and I have run into a ton of merge conflicts as there have been updates to drop classes like the InterleavedBlockBuilder and other changes to the Block code. Once I have this working (wading through some test code change updates), I can take a shot at putting up a review to update 6675 (or a fresh one) unless someone already has this handy. 
Reply all
Reply to author
Forward
0 new messages