Alluxio Worker Error

445 views
Skip to first unread message

Shail Shah

unread,
Feb 2, 2017, 10:07:24 AM2/2/17
to Alluxio Users
Hi,

I was successfully able to run alluxio on presto. But after running some queries. I am trying to face the belowing issue as pasted below:-

2017-02-02 11:43:01,926 ERROR logger.type (UnderFileSystemDataServerHandler.java:handleFileReadRequest) - Failed to read ufs file, may have been closed due to a client timeout.

java.net.SocketException: Socket closed

at java.net.SocketInputStream.read(SocketInputStream.java:203)

at java.net.SocketInputStream.read(SocketInputStream.java:141)

at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)

at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593)

at sun.security.ssl.InputRecord.read(InputRecord.java:532)

at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)

at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)

at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)

at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)

at java.io.BufferedInputStream.read(BufferedInputStream.java:345)

at org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)

at java.io.FilterInputStream.read(FilterInputStream.java:133)

at org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)

at alluxio.org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:78)

at alluxio.org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:136)

at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)

at java.io.BufferedInputStream.read(BufferedInputStream.java:345)

at alluxio.underfs.s3.S3InputStream.read(S3InputStream.java:101)

at com.google.common.io.CountingInputStream.read(CountingInputStream.java:62)

at alluxio.underfs.ObjectUnderFileInputStream.read(ObjectUnderFileInputStream.java:75)

at alluxio.worker.netty.UnderFileSystemDataServerHandler.handleFileReadRequest(UnderFileSystemDataServerHandler.java:83)

at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:78)

at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)

at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)

at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)

at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:346)

at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)

at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

at java.lang.Thread.run(Thread.java:745)

2017-02-02 11:43:01,927 INFO  httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry) - I/O exception (java.net.SocketException) caught when processing request: Socket Closed

2017-02-02 11:43:01,927 INFO  httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry) - Retrying request

2017-02-02 11:43:02,005 INFO  logger.type (FileUtils.java:createStorageDirPath) - Folder /mnt/ramdisk/alluxioworker/.tmp_blocks/648 was created!

2017-02-02 11:43:02,073 ERROR logger.type (UnderFileSystemDataServerHandler.java:handleFileReadRequest) - Failed to read ufs file, may have been closed due to a client timeout.

alluxio.exception.FileDoesNotExistException: Worker fileId 10206415740131496 is invalid. The worker may have crashed or cleaned up the client state due to a timeout.

at alluxio.worker.file.UnderFileSystemManager.getInputStreamAtPosition(UnderFileSystemManager.java:432)

at alluxio.worker.file.DefaultFileSystemWorker.getUfsInputStream(DefaultFileSystemWorker.java:148)

at alluxio.worker.netty.UnderFileSystemDataServerHandler.handleFileReadRequest(UnderFileSystemDataServerHandler.java:77)

at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:78)

at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)

at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)

at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)

at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollRdHupReady(AbstractEpollStreamChannel.java:772)

at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:338)

at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)

at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

at java.lang.Thread.run(Thread.java:745)

2017-02-02 11:43:02,075 ERROR logger.type (UnderFileSystemDataServerHandler.java:handleFileReadRequest) - Failed to read ufs file, may have been closed due to a client timeout.

alluxio.exception.FileDoesNotExistException: Worker fileId 10206415740131490 is invalid. The worker may have crashed or cleaned up the client state due to a timeout.

at alluxio.worker.file.UnderFileSystemManager.getInputStreamAtPosition(UnderFileSystemManager.java:432)

at alluxio.worker.file.DefaultFileSystemWorker.getUfsInputStream(DefaultFileSystemWorker.java:148)

at alluxio.worker.netty.UnderFileSystemDataServerHandler.handleFileReadRequest(UnderFileSystemDataServerHandler.java:77)

at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:78)

at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)

at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)

at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)

at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollRdHupReady(AbstractEpollStreamChannel.java:772)

at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:338)

at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)

at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

at java.lang.Thread.run(Thread.java:745)

2017-02-02 11:43:02,184 ERROR logger.type (UnderFileSystemDataServerHandler.java:handleFileReadRequest) - Failed to read ufs file, may have been closed due to a client timeout.

alluxio.exception.FileDoesNotExistException: Worker fileId 10206415740131455 is invalid. The worker may have crashed or cleaned up the client state due to a timeout.

at alluxio.worker.file.UnderFileSystemManager.getInputStreamAtPosition(UnderFileSystemManager.java:432)

at alluxio.worker.file.DefaultFileSystemWorker.getUfsInputStream(DefaultFileSystemWorker.java:148)

at alluxio.worker.netty.UnderFileSystemDataServerHandler.handleFileReadRequest(UnderFileSystemDataServerHandler.java:77)

at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:78)

at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)

at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)

at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)

at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:346)

at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)

at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

at java.lang.Thread.run(Thread.java:745)



re S3 response data streams are always fully consumed or closed.

2017-02-02 11:39:35,002 INFO  logger.type (FileUtils.java:createStorageDirPath) - Folder /mnt/ramdisk/alluxioworker/.tmp_blocks/743 was created!

2017-02-02 11:39:35,016 INFO  logger.type (FileUtils.java:createStorageDirPath) - Folder /mnt/ramdisk/alluxioworker/.tmp_blocks/83 was created!

2017-02-02 11:39:35,162 ERROR logger.type (UnderFileSystemDataServerHandler.java:handleFileReadRequest) - Failed to read ufs file, may have been closed due to a client timeout.

javax.net.ssl.SSLProtocolException: Data received in non-data state: 6

        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1109)

        at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)

        at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)

        at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)

        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)

        at org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)

        at java.io.FilterInputStream.read(FilterInputStream.java:133)

        at org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)

        at alluxio.org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:78)

        at alluxio.org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:136)

        at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)

        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)

        at alluxio.underfs.s3.S3InputStream.read(S3InputStream.java:101)

        at com.google.common.io.CountingInputStream.read(CountingInputStream.java:62)

        at alluxio.underfs.ObjectUnderFileInputStream.read(ObjectUnderFileInputStream.java:75)

        at alluxio.worker.netty.UnderFileSystemDataServerHandler.handleFileReadRequest(UnderFileSystemDataServerHandler.java:83)

        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:78)

        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)

        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)

        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)

        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:346)

        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)

        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

        at java.lang.Thread.run(Thread.java:745)

                                                                                                                                                                                          15919,1       94%



My underfs address is a local path and I mount the s3 directory to the Alluxio mount point. 

If I free all the data and run the queries again. The query runs fine. Is there any consistency issue while reading from underfs. Could someone help me in figuring the issue?



Thanks,
Shail Shah

Calvin Jia

unread,
Feb 2, 2017, 3:53:17 PM2/2/17
to Alluxio Users
Hey Shail,

Could you try running this with the s3a connector?

Thanks,
Calvin

Shail Shah

unread,
Feb 3, 2017, 5:09:13 AM2/3/17
to Alluxio Users

Hi Calvin,

We weren't able to resolve this issue using s3a connector. We tried increasing the timeout for netty using the following option:-

alluxio.user.network.netty.timeout.ms=60000


We tried to run the queries after that. I have attached the log file below of one of the workers. 

Basically, it is giving "socket closed" error. I am receiving different error everytime once I rerun the query after the failure.


I am using s3 directory as the UNDERFS path with the s3a connector.


Then after restarting alluxio and presto with the same config, I tried to test run same query on different amount of data. First I incrementally ran the simple select count(*) from table_name where hour='00';  I increased the number of hours gradually. I was able to run the queries. When I ran count(*) on the whole table the query failed with the same error. I am running the queries through presto.


Could these be due to large number of s3a connections or should I change any settings on Presto client to handle this issue.


2017-02-03 09:50:32,138 ERROR logger.type (BlockDataServerHandler.java:handleBlockReadRequest) - Exception reading block 2231369728










alluxio.exception.BlockDoesNotExistException: lockId 25434 has no lock record


        at alluxio.worker.block.BlockLockManager.validateLock(BlockLockManager.java:249)


        at alluxio.worker.block.TieredBlockStore.getBlockReader(TieredBlockStore.java:173)


        at alluxio.worker.block.DefaultBlockWorker.readBlockRemote(DefaultBlockWorker.java:383)


        at alluxio.worker.netty.BlockDataServerHandler.handleBlockReadRequest(BlockDataServerHandler.java:89)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:70)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)


        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)


        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)


        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollRdHupReady(AbstractEpollStreamChannel.java:772)


        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:338)


        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)


        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)


        at java.lang.Thread.run(Thread.java:745)


2017-02-03 09:50:32,258 ERROR logger.type (BlockDataServerHandler.java:handleBlockReadRequest) - Exception reading block 2248146944


alluxio.exception.BlockDoesNotExistException: lockId 25427 has no lock record


        at alluxio.worker.block.BlockLockManager.validateLock(BlockLockManager.java:249)


        at alluxio.worker.block.TieredBlockStore.getBlockReader(TieredBlockStore.java:173)


        at alluxio.worker.block.DefaultBlockWorker.readBlockRemote(DefaultBlockWorker.java:383)


        at alluxio.worker.netty.BlockDataServerHandler.handleBlockReadRequest(BlockDataServerHandler.java:89)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:70)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)


        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)


        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)


        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollRdHupReady(AbstractEpollStreamChannel.java:772)


        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:338)


        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)


        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)


        at java.lang.Thread.run(Thread.java:745)


2017-02-03 09:50:32,304 ERROR logger.type (BlockDataServerHandler.java:handleBlockReadRequest) - Exception reading block 1912602624


alluxio.exception.BlockDoesNotExistException: lockId 25428 has no lock record


        at alluxio.worker.block.BlockLockManager.validateLock(BlockLockManager.java:249)


        at alluxio.worker.block.TieredBlockStore.getBlockReader(TieredBlockStore.java:173)


        at alluxio.worker.block.DefaultBlockWorker.readBlockRemote(DefaultBlockWorker.java:383)


        at alluxio.worker.netty.BlockDataServerHandler.handleBlockReadRequest(BlockDataServerHandler.java:89)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:70)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)


        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)


        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)


        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollRdHupReady(AbstractEpollStreamChannel.java:772)


        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:338)


        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)


        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)


        at java.lang.Thread.run(Thread.java:745)


2017-02-03 09:50:32,724 ERROR logger.type (BlockDataServerHandler.java:handleBlockReadRequest) - Exception reading block 2080374784


alluxio.exception.BlockDoesNotExistException: lockId 25456 has no lock record


        at alluxio.worker.block.BlockLockManager.validateLock(BlockLockManager.java:249)


        at alluxio.worker.block.TieredBlockStore.getBlockReader(TieredBlockStore.java:173)


        at alluxio.worker.block.DefaultBlockWorker.readBlockRemote(DefaultBlockWorker.java:383)


        at alluxio.worker.netty.BlockDataServerHandler.handleBlockReadRequest(BlockDataServerHandler.java:89)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:70)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)


        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)


        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)


        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollRdHupReady(AbstractEpollStreamChannel.java:772)


        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:338)


        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)


        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)


        at java.lang.Thread.run(Thread.java:745)


2017-02-03 09:50:32,726 ERROR logger.type (UnderFileSystemDataServerHandler.java:handleFileReadRequest) - Failed to read ufs file, may have been closed due to a client timeout.


java.net.SocketException: Socket closed


        at java.net.SocketInputStream.read(SocketInputStream.java:183)


        at java.net.SocketInputStream.read(SocketInputStream.java:121)


        at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)


        at org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:200)


        at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)


        at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)


        at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)


        at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)


        at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)


        at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)


        at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)


        at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)


        at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)


        at java.security.DigestInputStream.read(DigestInputStream.java:161)


        at com.amazonaws.services.s3.internal.DigestValidationInputStream.read(DigestValidationInputStream.java:59)


        at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)


        at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)


        at alluxio.underfs.s3a.S3AInputStream.read(S3AInputStream.java:97)


        at com.google.common.io.CountingInputStream.read(CountingInputStream.java:62)


        at alluxio.underfs.ObjectUnderFileInputStream.read(ObjectUnderFileInputStream.java:75)


        at alluxio.worker.netty.UnderFileSystemDataServerHandler.handleFileReadRequest(UnderFileSystemDataServerHandler.java:83)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:78)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)


        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)


        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)


        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:346)


        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)


        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)


        at java.lang.Thread.run(Thread.java:745)


2017-02-03 09:50:34,156 ERROR logger.type (UnderFileSystemDataServerHandler.java:handleFileReadRequest) - Failed to read ufs file, may have been closed due to a client timeout.


com.amazonaws.AmazonClientException: Unable to verify integrity of data download.  Client calculated content hash didn't match hash calculated by Amazon S3.  The data may be corrupt.


        at com.amazonaws.services.s3.internal.DigestValidationInputStream.validateMD5Digest(DigestValidationInputStream.java:79)


        at com.amazonaws.services.s3.internal.DigestValidationInputStream.read(DigestValidationInputStream.java:61)


        at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)


        at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)


        at alluxio.underfs.s3a.S3AInputStream.read(S3AInputStream.java:97)


        at com.google.common.io.CountingInputStream.read(CountingInputStream.java:62)


        at alluxio.underfs.ObjectUnderFileInputStream.read(ObjectUnderFileInputStream.java:75)


        at alluxio.worker.netty.UnderFileSystemDataServerHandler.handleFileReadRequest(UnderFileSystemDataServerHandler.java:83)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:78)


        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)


        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)


        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)


        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)


        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)


        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)


        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:346)


        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)


        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)


        at java.lang.Thread.run(Thread.java:745)



2017-02-03 06:00:11,608 INFO  http.AmazonHttpClient (AmazonHttpClient.java:executeHelper) - Unable to execute HTTP request: Timeout waiting for connection from pool
org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
        at sun.reflect.GeneratedMethodAccessor65.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
        at com.amazonaws.http.conn.$Proxy39.get(Unknown Source)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
        at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:787)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:630)
        at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:405)
at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:367)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:318)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3787)
        at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1137)
        at alluxio.underfs.s3a.S3AInputStream.openStream(S3AInputStream.java:127)
        at alluxio.underfs.s3a.S3AInputStream.read(S3AInputStream.java:95)
        at com.google.common.io.CountingInputStream.read(CountingInputStream.java:62)
        at alluxio.underfs.ObjectUnderFileInputStream.read(ObjectUnderFileInputStream.java:75)
        at alluxio.worker.netty.UnderFileSystemDataServerHandler.handleFileReadRequest(UnderFileSystemDataServerHandler.java:83)
        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:78)
        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)





Thanks,

Shail Shah



Calvin Jia

unread,
Feb 3, 2017, 1:32:58 PM2/3/17
to Alluxio Users
Hi,

If you are using s3a and Alluxio 1.4.0, you can increase the number of connections available by changing the configuration (see conf documentation for more details): alluxio.underfs.s3.threads.max

I noticed this exception in the logs you pasted, is this expected in your environment (ie. unstable network)?

com.amazonaws.AmazonClientException: Unable to verify integrity of data download.  Client calculated content hash didn't match hash calculated by Amazon S3.  The data may be corrupt.


Thanks,
Calvin

Shail Shah

unread,
Feb 6, 2017, 2:03:42 AM2/6/17
to Alluxio Users
Hi Calvin,

We were able to solve the issue by adding the following parameters 

alluxio.underfs.s3a.request.timeout.ms=0

alluxio.worker.session.timeout.ms=60000000

alluxio.underfs.s3a.socket.timeout.ms=5000000 


and also using the following parameters in presto client.


alluxio.user.file.waitcompleted.poll.ms=1000000

alluxio.user.network.netty.timeout.ms=6000000


I think there was some issue due to client timeout or either s3a timeout. We were able to resolve the issue using these parameters.


I am still unable to diagnose a previous issue completely. When I used to get the previously mentioned errors, if I ran the queries that span the same data files after getting the timeout error, it never used to run. It used to throw the same error each and everytime in just 2 seconds. But, if I free those files from memory, and query the same data, sometimes it used to run, sometimes it doesn't. Could you think of any possibility that might lead to this issue? Because if query gets killed due to timeout issue, the query should run the next time irrespective of the availability of data in any tier. 


Thanks,
Shail Shah

Shail Shah

unread,
Feb 6, 2017, 7:45:04 AM2/6/17
to Alluxio Users
Hi,

To continue on the previous post,
I get the following error once there is a socket timeout and if I run the query on the same dataset.

2017-02-06 12:16:20,569 ERROR logger.type (UnderFileSystemDataServerHandler.java:handleFileReadRequest) - Failed to read ufs file, may have been closed due to a client timeout.

alluxio.exception.FileDoesNotExistException: Worker fileId 5018825212407877824 is invalid. The worker may have crashed or cleaned up the client state due to a timeout.

        at alluxio.worker.file.UnderFileSystemManager.getInputStreamAtPosition(UnderFileSystemManager.java:432)

        at alluxio.worker.file.DefaultFileSystemWorker.getUfsInputStream(DefaultFileSystemWorker.java:148)

        at alluxio.worker.netty.UnderFileSystemDataServerHandler.handleFileReadRequest(UnderFileSystemDataServerHandler.java:77)

        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:78)

        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)

        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)

        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)

        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:346)

        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)

        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

        at java.lang.Thread.run(Thread.java:745)

2017-02-06 12:16:20,913 INFO  logger.type (FileUtils.java:createStorageDirPath) - Folder /mnt/ramdisk/alluxioworker/.tmp_blocks/460 was created!

2017-02-06 12:16:21,198 INFO  logger.type (FileUtils.java:createStorageDirPath) - Folder /mnt/ramdisk/alluxioworker/.tmp_blocks/255 was created!

2017-02-06 12:16:21,525 ERROR logger.type (UnderFileSystemDataServerHandler.java:handleFileReadRequest) - Failed to read ufs file, may have been closed due to a client timeout.

alluxio.exception.FileDoesNotExistException: Worker fileId 5018825212407877829 is invalid. The worker may have crashed or cleaned up the client state due to a timeout.

        at alluxio.worker.file.UnderFileSystemManager.getInputStreamAtPosition(UnderFileSystemManager.java:432)

        at alluxio.worker.file.DefaultFileSystemWorker.getUfsInputStream(DefaultFileSystemWorker.java:148)

        at alluxio.worker.netty.UnderFileSystemDataServerHandler.handleFileReadRequest(UnderFileSystemDataServerHandler.java:77)

        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:78)

        at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:43)

        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)

        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)

        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:346)

        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)

        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

        at java.lang.Thread.run(Thread.java:745)


Calvin Jia

unread,
Feb 6, 2017, 2:42:36 PM2/6/17
to Alluxio Users
Hi Shail,

Could you clarify your last post, does this happen after an S3 socket timeout even with the configuration changes you made in the previous post? Also, when you say the dataset becomes unreadable after one exception, is this only through presto or also through other applications (for example Alluxio's command line)?

Thanks,
Calvin

Shail Shah

unread,
Feb 7, 2017, 1:18:19 AM2/7/17
to Alluxio Users
Hi Calvin.

So, we added a tiered storage and put all the data in SSD. Still, we are facing the same issue. We are still facing the issue with the configuration changes mentioned before. Basically, we increased the data that we were querying from 80GB to 450GB. But, we also increased the above-mentioned timeouts to 10 hours.  We are still getting the issue.  

I don't know if the dataset becomes unreadable or not. But we are unable to run the same query again through presto on the same data set. We didn't try through other applications. I will try it and let you know.

Thanks,
Shail Shah

Calvin Jia

unread,
Feb 7, 2017, 3:54:10 PM2/7/17
to Alluxio Users
Hi Shail,

Thanks for the update. I think you are hitting several unrelated problems, and I would like to understand which ones are being addressed and which ones are still outstanding (possibly all of them, but I think some should be addressed by the conf changes you made).

1. alluxio.exception.FileDoesNotExistException: Worker fileId 5018825212407877824 is invalid. The worker may have crashed or cleaned up the client state due to a timeout.
2. alluxio.exception.BlockDoesNotExistException: lockId 25434 has no lock record
3. com.amazonaws.AmazonClientException: Unable to verify integrity of data download.  Client calculated content hash didn't match hash calculated by Amazon S3.  The data may be corrupt.
4. org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
5. java.net.SocketException: Socket closed

If possible, could you provide the client and worker logs?

Thanks,
Calvin

Shail Shah

unread,
Feb 7, 2017, 4:22:22 PM2/7/17
to Alluxio Users
Hi Calvin,

The following two problems have been addressed by increasing the timeouts:-
3. com.amazonaws.AmazonClientException: Unable to verify integrity of data download.  Client calculated content hash didn't match hash calculated by Amazon S3.  The data may be corrupt.
4. org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool 

But the other problems arise intermittently. I am running the job again with the fresh install. I will  attach the master and worker logs, once I encounter the issue again.

Thanks,
Shail Shah

Calvin Jia

unread,
Feb 13, 2017, 1:53:38 PM2/13/17
to Alluxio Users
Hi Shail,

Did you encounter these issues again after the fresh install?

Thanks,
Calvin

Shail Shah

unread,
Feb 16, 2017, 4:42:37 PM2/16/17
to Alluxio Users
Hi Calvin,

After the fresh install, we were able to run the presto. But we are facing the issue intermittently in some situations. If 10 presto nodes are running co-locally with Alluxio. If one of the presto node dies abruptly in between, we see the above error. I have attached the worker log of the Alluxio node as you had told. 


Thanks,
Shail Shah

Calvin Jia

unread,
Feb 16, 2017, 7:00:54 PM2/16/17
to Alluxio Users
Hi Shail,

Thanks for providing the logs, they are very helpful. Does the Presto node die because of the error or due to some other reason which then triggers this error? Also, could you provide the configurations you are setting on the workers and clients?

Thanks,
Calvin

Shail Shah

unread,
Feb 17, 2017, 2:01:58 AM2/17/17
to Alluxio Users
Hi Calvin,

The presto node dies because of some other error, which . Please find the configuration setting for the workers and clients below:-
Workers:-


alluxio.master.worker.threads.max=4096

alluxio.master.tieredstore.global.levels=2

alluxio.master.tieredstore.global.level0.alias=MEM

alluxio.master.tieredstore.global.level1.alias=SSD


# Worker properties


alluxio.worker.block.threads.max=4096

alluxio.worker.tieredstore.levels=2

alluxio.worker.tieredstore.level0.alias=MEM

alluxio.worker.tieredstore.level0.dirs.path=/mnt/ramdisk

alluxio.worker.tieredstore.level0.dirs.quota=10GB

alluxio.worker.tieredstore.level1.alias=SSD

alluxio.worker.tieredstore.level1.dirs.path=/data/alluxio

alluxio.worker.tieredstore.level1.dirs.quota=110GB

alluxio.worker.tieredstore.reserver.enabled=false

# User properties

alluxio.underfs.s3a.request.timeout.ms=0

alluxio.worker.session.timeout.ms=36000000

alluxio.underfs.s3a.socket.timeout.ms=36000000


USERS:-

-Dalluxio.user.file.waitcompleted.poll.ms=3600000

-Dalluxio.user.network.netty.timeout.ms=36000000

-Dalluxio.user.file.readtype.default=CACHE


Once the error occurs, presto won't be able to run any more queries on the same dataset on the remaining live nodes. I am unable to figure out if this issue is on the client side or alluxio side. Also, restarting the alluxio won't work. I need to format Alluxio before making it work.

Thanks,
Shail Shah

Calvin Jia

unread,
Feb 21, 2017, 1:27:21 PM2/21/17
to Alluxio Users
Hi Shail,

Could you elaborate on why the Presto node is crashing? From looking at the logs you provided, it looks like the worker has a closed stream cached, and any client which tries to access that stream will fail (this will automatically be cleaned up). However, this stream is cached on a per client basis, so it should not affect other clients which connect afterward. It is strange that you need to format Alluxio to restore the state, a restart should be sufficient. What kind of errors do you see when you try re-running without formatting?

Thanks,
Calvin 

Deepak Batra

unread,
Feb 27, 2017, 2:02:23 AM2/27/17
to Alluxio Users
Hey Calvin, 
The Presto query dies because of the error mentioned here. 

Nodes used to crash sometimes because of OOM on the worker nodes. We fixed the node crashing issue by tuning the memory params of presto.

If we don't format and restart presto, we keep seeing the same error as mentioned in the above link on different files. 

Calvin Jia

unread,
Feb 27, 2017, 3:56:05 PM2/27/17
to Alluxio Users
Hi,

Thanks for the pointer, I have a few follow up questions:

When you run into this error, did you have the packet streaming feature enabled? How often does this error occur? When you say format and restart, do you mean presto or Alluxio (same as what Shail mentioned)?

Thanks,
Calvin

Deepak Batra

unread,
Feb 28, 2017, 1:27:34 AM2/28/17
to Alluxio Users
Hey Calvin, 
Thanks for the reply. Here are the answers:

1. did you have the packet streaming feature enabled
No we haven't enabled the packet streaming feature. 

2. How often does this error occur?
It occurs intermittently, and there's no pattern as such. Sometimes it doesn't occur for days and sometimes twice a day. 

3. When you say format and restart, do you mean presto or Alluxio ?
We format alluxio and restart both alluxio and presto.

Calvin Jia

unread,
Feb 28, 2017, 4:03:18 PM2/28/17
to Alluxio Users
Hi Deepak,

Thanks for the responses. Were you able to verify if the file itself was corrupted? Also, does Presto do any update/append operations on the file or is the file immutable?

Cheers,
Calvin

Deepak Batra

unread,
Mar 1, 2017, 12:57:31 AM3/1/17
to Alluxio Users
Hey Calvin, 
The file wasn't corrupted. And everytime it (after a failure) it shows a different file which it is unable to read. The files are immutable files. No updates/append happen on the files at all. 

Calvin Jia

unread,
Mar 7, 2017, 2:03:09 PM3/7/17
to Alluxio Users
Hi Deepak,

I've been looking into the Alluxio read path but it doesn't seem like there is anything special which would cause Alluxio to fail with Presto. Do you know of a simple way to reproduce the issue?

Thanks,
Calvin

Deepak Batra

unread,
Mar 21, 2017, 3:55:35 AM3/21/17
to Alluxio Users
Hey Calvin,
There is no simple way as such. We've a 4 node cluster with 3 workers and 1 master. (Aluxio), around 1tb of data in orc format. We keep firing queries every 5 mins on presto (scheduled queries but not that heavy). Sometimes it takes 5 mins to crash and sometimes even a day.. and another problem is that we've to restart the alluxio cluster and format it to make it work again .. the logs etc are mentioned in this thread as well.. https://groups.google.com/forum/#!topic/alluxio-users/oYQGr-DcNp4. We're kind of stuck over here. Let us know if anything can be worked out here.

Deepak Batra

unread,
Mar 23, 2017, 3:19:34 AM3/23/17
to Alluxio Users
Hey Calvin, 
Some headway on the issue. This is most probably a network twitch or s3 not responding temporarily issue. The same arises without alluxio too. A question on alluxio, does alluxio remove the whole file if one of the blocks are not fetched because of an underlying file system issue ? This might be the issue because of which we need to restart alluxio and format. 

Thanks.

Calvin Jia

unread,
Mar 27, 2017, 1:58:46 PM3/27/17
to Alluxio Users
Hi Deepak,

Alluxio will work on a per block basis, so the other blocks of the file will not be affected. There should be no case where you need to restart Alluxio and/or format.

In your environment, what is the `hive.max-split-size` set to? A possible workaround is to increase this to a size greater than or equal to your Alluxio block size. Shail, this should also apply to the issue you are seeing.

Hope this helps,
Calvin

Deepak Batra

unread,
Mar 31, 2017, 3:44:24 PM3/31/17
to Alluxio Users
Hey Calvin, 
That worked. Thanks a lot .. :) We're still running scheduled queries to see if the old issue (because of which we need to do cluster restart and format) occurs again. 

Thanks again .. :)

Calvin Jia

unread,
Mar 31, 2017, 8:38:44 PM3/31/17
to Alluxio Users
Great, thanks for confirming the workaround! We are also looking into addressing the root cause.

Aaquib Khwaja

unread,
May 17, 2017, 9:49:05 AM5/17/17
to Alluxio Users
Hey Calvin,

We are trying to load a 4GB text file from S3 to Alluxio, but are running into the some S3 connection pool timeout errors. I've added the error logs.

Here are some configs that we are using:

alluxio.underfs.s3a.request.timeout.ms=0
alluxio.worker.session.timeout.ms=6000000

alluxio.user.file.waitcompleted.poll.ms=1000000
alluxio.user.network.netty.timeout.ms=36000000
alluxio.user.rpc.retry.max.sleep.ms=30000
alluxio.user.rpc.retry.max.num.retry=100
alluxio.user.failed.space.request.limits=10
alluxio.user.block.size.bytes.default=128MB

Have attached the worker logs.

Thanks,
Aaquib
worker.log

Aaquib Khwaja

unread,
May 17, 2017, 10:02:32 AM5/17/17
to Alluxio Users
And we were trying to load the data using Alluxio-cli. Following was the error:

alluxio.exception.FileDoesNotExistException: Ufs path s3a://bucket/path does not exist

Thanks,
Aaquib

Gene Pang

unread,
May 19, 2017, 12:16:10 PM5/19/17
to Alluxio Users
Hi Aaquib,

Could you create a new topic for your question?

Thanks,
Gene
Reply all
Reply to author
Forward
0 new messages