Hi,
I've been trying to run large self join queries on Presto deployed into a cluster which has Hadoop 2.4.0, Hive 0.13.1 and Presto 0.77 servers
running.
For a reason or another, when the number of rows in the table having the self query increases, I quite fast start getting errors into server logs
similar to these:
----------
2014-10-13T02:00:17.249+0300 ERROR SplitRunner-20-75 com.facebook.presto.execution.TaskExecutor Error processing Split 20141012_230000_00000_pyz92.10.1-0 (start = 14131
54813122, wall = 4107 ms, cpu = 17 ms, calls = 5)
java.lang.IndexOutOfBoundsException: end index (16391) must not be greater than size (16386)
at io.airlift.slice.Preconditions.checkPositionIndexes(Preconditions.java:80) ~[slice-0.7.jar:0.7]
at io.airlift.slice.Slice.checkIndexLength(Slice.java:984) ~[slice-0.7.jar:0.7]
at io.airlift.slice.Slice.getLong(Slice.java:366) ~[slice-0.7.jar:0.7]
at com.facebook.presto.spi.block.AbstractVariableWidthBlock.getLong(AbstractVariableWidthBlock.java:59) ~[presto-spi-0.77.jar:0.77]
at com.facebook.presto.spi.type.BigintType.compareTo(BigintType.java:72) ~[presto-spi-0.77.jar:0.77]
at com.facebook.presto.spi.block.SortOrder.compareBlockValue(SortOrder.java:62) ~[presto-spi-0.77.jar:0.77]
at com.facebook.presto.$gen.PagesIndexComparator_139.compareTo(Unknown Source) ~[na:na]
at com.facebook.presto.operator.PagesIndex$1.compare(PagesIndex.java:241) ~[presto-main-0.77.jar:0.77]
at com.facebook.presto.operator.WindowOperator.getOutput(WindowOperator.java:281) ~[presto-main-0.77.jar:0.77]
at com.facebook.presto.operator.Driver.process(Driver.java:330) ~[presto-main-0.77.jar:0.77]
at com.facebook.presto.operator.Driver.processFor(Driver.java:272) ~[presto-main-0.77.jar:0.77]
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:541) ~[presto-main-0.77.jar:0.77]
at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:444) ~[presto-main-0.77.jar:0.77]
at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:578) ~[presto-main-0.77.jar:0.77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_65]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_65]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
----------
The threshold number of rows when this problem starts to appear is somewhere in between 10000 and 100000 rows, so the tables aren't that big. The rows in
the table (events) are like this:
----------
3980444,614,2010-06-21 15:16:04.000
3980730,614,2010-06-21 15:16:04.000
3985097,614,2010-06-14 11:42:41.000
3985261,614,2010-06-14 11:42:41.000
----------
For tables having less than 10000 rows, everything seems to work without problems.
The query I'm trying to run is like this:
----------
WITH events_with_rank_asc AS
(
SELECT
evt_case,
evt_event,
evt_timestamp,
ROW_NUMBER() OVER (PARTITION BY evt_case ORDER BY evt_timestamp ASC, evt_event ASC) AS evt_asc_rank
FROM
events
),
events_with_rank_desc AS
(
SELECT
evt_case,
evt_event,
evt_timestamp,
ROW_NUMBER() OVER (PARTITION BY evt_case ORDER BY evt_timestamp DESC, evt_event DESC) AS evt_desc_rank
FROM
events
)
SELECT *
FROM
(
SELECT
count(1) AS tra_count,
x.evt_event AS tra_event,
y.evt_event AS tra_next_event,
CAST((SUM(y.evt_timestamp - x.evt_timestamp)) AS BIGINT) AS tra_duration
FROM
events_with_rank_asc x
JOIN events_with_rank_asc y ON
((x.evt_case = y.evt_case) AND (x.evt_asc_rank = y.evt_asc_rank-1))
GROUP BY
x.evt_event, y.evt_event
UNION ALL
SELECT
count(1) AS tra_count,
'_START' AS tra_event,
evt_event AS tra_next_event,
CAST(0.0 AS BIGINT) AS tra_duration
FROM
events_with_rank_asc
WHERE
evt_asc_rank = 1
GROUP BY
evt_event
UNION ALL
SELECT
count(1) AS tra_count,
evt_event AS tra_event,
'_END' AS tra_next_event,
CAST(0.0 AS BIGINT) AS tra_duration
FROM
events_with_rank_desc
WHERE
evt_desc_rank = 1
GROUP BY
evt_event
) unionResult
ORDER BY
tra_count DESC, tra_event ASC, tra_next_event ASC;
----------
Configurations look like this:
jvm.config:
---
-server
-Xmx7G
-XX:+UseConcMarkSweepGC
-XX:+ExplicitGCInvokesConcurrent
-XX:+CMSClassUnloadingEnabled
-XX:+AggressiveOpts
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
-XX:PermSize=150M
-XX:MaxPermSize=150M
-XX:ReservedCodeCacheSize=150M
-Xbootclasspath/p:<PRESTO_SERVER_INSTALLATION_PATH>/lib/floatingdecimal-0.1.jar
---
config.properties
---
http-server.http.port=<PRESTO_PORT>
task.max-memory=4GB
discovery-server.enabled=true
discovery.uri=http://<PRESTO_DISCOVERY_HOST>:<PRESTO_PORT>
---
hive.properties:
---
connector.name=hive-hadoop2
hive.metastore.uri=thrift://<NAMENODE>:<HIVE_PORT>
hive.config.resources=<HADOOP_CONFIG_PATH>/core-site.xml,<HADOOP_CONFIG_PATH>/hdfs-site.xml
---
There is also a bit related question (that has been resolved) for earlier Presto version I wrote earlier in:
https://groups.google.com/d/msg/presto-users/B5YVhiddkdM/8WmvM0qnLDEJDoes anyone have any idea what could be causing the failed queries or even have an idea on how to proceed in the investigation? I'll gladly give more information if needed. I wouldn't want to
revert back to using 0.60 since I'm trying to do a benchmark of the latest versions of several different frameworks.
- Markku