--
You received this message because you are subscribed to the Google Groups "Presto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to presto-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to presto-users+unsubscribe@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to presto-users...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "Presto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to presto-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
-- - Shubham
Dain, I am testing out your changes and here are some observations:
- I can see a definite improvement in case of hdfs but the performance degrades by a great factor in case of s3.
- Does not work with orc files that were created via hive.11
- This is just to confirm, is the predicate pushdown to reader already implemented? From a glance of your changes I did not see it while creating the reader.
On Friday, July 11, 2014 4:58:48 AM UTC-7, Shubham Tagra wrote:Dain, I am testing out your changes and here are some observations:
- I can see a definite improvement in case of hdfs but the performance degrades by a great factor in case of s3.
How are you measuring "performance and what are you using for the baseline? RCFile? Existing ORC?
My guess is the read shape are making s3 perform poorly. The way ORC works is we advance through the file in 10k row chunks. For each 10k chunk we determine which streams are needed and fetch all of them. Depending on how the ORC file is laid out, this could be one large read or many small reads. I have very little experience tuning S3, can you describe how the system works, what is fast/slow and how you work around these. Once I understand these, I can tweak the heuristics. For example, in ORC you can end up with a read plan that looks like: read x MB, skip y M, read z MB. Should we consolidate that into 1 read? Should we read these in parallel. If we are doing parallel, at what size should we split a read chunk into two parallel reads?
- Does not work with orc files that were created via hive.11
I tested the code by writing files with "hive.exec.orc.write.format=11", so there must be different about the writers. I don't have a Hive 11 installation, so can you send me a small data set in ORC 11 and something like CSV?
- This is just to confirm, is the predicate pushdown to reader already implemented? From a glance of your changes I did not see it while creating the reader.
When creating the OrcDataStream in https://github.com/dain/presto/blob/custom-orc/presto-hive/src/main/java/com/facebook/presto/hive/OrcDataStreamFactory.java , the tuple domain is passed in. This is used to prune the segments. This code doesn't use any or the pruning logic from the Hive code (it seems to be pretty slow and buggy).
-dain
--
You received this message because you are subscribed to the Google Groups "Presto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to presto-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
-- - Shubham
-- - Shubham
On 11-07-2014 22:00, Dain Sundstrom wrote:
Against existing ORCOn Friday, July 11, 2014 4:58:48 AM UTC-7, Shubham Tagra wrote:Dain, I am testing out your changes and here are some observations:
- I can see a definite improvement in case of hdfs but the performance degrades by a great factor in case of s3.
How are you measuring "performance and what are you using for the baseline? RCFile? Existing ORC?
I ran some experiments on small dataset which takes ~5sec for count(*) in hdfs while ~30sec for count(*) against s3.
My guess is the read shape are making s3 perform poorly. The way ORC works is we advance through the file in 10k row chunks. For each 10k chunk we determine which streams are needed and fetch all of them. Depending on how the ORC file is laid out, this could be one large read or many small reads. I have very little experience tuning S3, can you describe how the system works, what is fast/slow and how you work around these. Once I understand these, I can tweak the heuristics. For example, in ORC you can end up with a read plan that looks like: read x MB, skip y M, read z MB. Should we consolidate that into 1 read? Should we read these in parallel. If we are doing parallel, at what size should we split a read chunk into two parallel reads?
Collected some data around it and it is the seeks which are killing the performance. Seeks are quite costly in s3, in this
case we spent majority of the time in seek calls.
Attaching a file for small orc dataset created in hive.11, attached ddl as well. Error is occuring because StripeStatistics not defined in hive.11- Does not work with orc files that were created via hive.11
I tested the code by writing files with "hive.exec.orc.write.format=11", so there must be different about the writers. I don't have a Hive 11 installation, so can you send me a small data set in ORC 11 and something like CSV?
But I could see the domain set according to the filter clause. Maybe I have your old branch, will resync and check- This is just to confirm, is the predicate pushdown to reader already implemented? From a glance of your changes I did not see it while creating the reader.
When creating the OrcDataStream in https://github.com/dain/presto/blob/custom-orc/presto-hive/src/main/java/com/facebook/presto/hive/OrcDataStreamFactory.java , the tuple domain is passed in. This is used to prune the segments. This code doesn't use any or the pruning logic from the Hive code (it seems to be pretty slow and buggy).
Correction in last statement, meant to write "I could not see the domain set according to the filter clause."
--
Sorry I meant 50ms to open a connetion, not 500ms.. but I have seen 1s occasionally.
--
You received this message because you are subscribed to the Google Groups "Presto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to presto-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
-- - Shubham
To unsubscribe from this group and stop receiving emails from it, send an email to presto-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
-- - Shubham
To unsubscribe from this group and stop receiving emails from it, send an email to presto-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
-- - Shubham
ORC files created from hive 0.11 have empty stripe stats. We also found that tupledomains weren't being passed in - and that had a perf impact. I'll send you a patch with our changes in a few hours.