ERROR : java.net.SocketTimeoutException: Read timed out

2,228 views
Skip to first unread message

Mohammed Azad

unread,
Feb 22, 2016, 4:53:40 AM2/22/16
to Tachyon Users
Hi Guys,

Pretty new to Tachyon and spark.. i am basically trying to write partitioned data into parquet files using spark on tachyon.. spark version 1.6.0 / Tachyon 0.8.2 on hadoop 2.71... following happens as soon as all writes are done... spark job tries to delete the _temporary folder and then it starts hitting the read timeouts.. this cause my job to fail even though i verified that all records got written without issues.. any help in getting this resolved is really appreciated.. tried increasing the spark executor memory however still i am hitting the same error consistently. i am using CACHE_THROUGH writes directly to HDFS..

>> code snipp

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("parquet.enable.summary-metadata", "false")
hadoopConf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
hadoopConf.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
hadoopConf.set("parquet.metadata.read.parallelism", "15")
hadoopConf.set("spark.sql.parquet.output.committer.class","org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
.......
rtrk_viewership.write.partitionBy("market_code", "program_start_date").mode(SaveMode.Append).parquet("tachyon://ip-10-1-83-211.ec2.internal:19998/QH/staging_rtrk_viewership")


<<
 execption details are as follows..

-46ef-9ab2-acbd3b27ee29.gz.parquet): HDFS Path: hdfs://ip-10-1-83-211.ec2.internal:9000/QH/staging_rtrk_viewership/market_code=533/program_start_date=2015-12-05/part-r-00674-eb0e73e9
-e64f-46ef-9ab2-acbd3b27ee29.gz.parquet TPath: tachyon://ip-10-1-83-211.ec2.internal:19998/QH/staging_rtrk_viewership/market_code=533/program_start_date=2015-12-05/part-r-00674-eb0e7
3e9-e64f-46ef-9ab2-acbd3b27ee29.gz.parquet
16/02/22 09:32:07 INFO : File does not exist: tachyon://ip-10-1-83-211.ec2.internal:19998/QH/staging_rtrk_viewership/market_code=533/program_start_date=2015-12-05/part-r-00674-eb0e73e9-e64f-46ef-9ab2-acbd3b27ee29.gz.parquet
16/02/22 09:32:07 INFO : rename(tachyon://ip-10-1-83-211.ec2.internal:19998/QH/staging_rtrk_viewership/_temporary/0/task_201602220835_0000_m_000674/market_code=533/program_start_date=2015-12-05/part-r-00674-eb0e73e9-e64f-46ef-9ab2-acbd3b27ee29.gz.parquet, tachyon://ip-10-1-83-211.ec2.internal:19998/QH/staging_rtrk_viewership/market_code=533/program_start_date=2015-12-05/part-r-00674-eb0e73e9-e64f-46ef-9ab2-acbd3b27ee29.gz.parquet)
16/02/22 09:32:07 INFO : delete(tachyon://ip-10-1-83-211.ec2.internal:19998/QH/staging_rtrk_viewership/_temporary, true)
16/02/22 09:32:37 ERROR : java.net.SocketTimeoutException: Read timed out
tachyon.org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
        at tachyon.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
        at tachyon.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at tachyon.org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
        at tachyon.org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
        at tachyon.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at tachyon.org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at tachyon.org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at tachyon.org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at tachyon.org.apache.thrift.protocol.TProtocolDecorator.readMessageBegin(TProtocolDecorator.java:135)
        at tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
        at tachyon.thrift.FileSystemMasterService$Client.recv_deleteFile(FileSystemMasterService.java:265)
        at tachyon.thrift.FileSystemMasterService$Client.deleteFile(FileSystemMasterService.java:251)
        at tachyon.client.FileSystemMasterClient.deleteFile(FileSystemMasterClient.java:289)
        at tachyon.client.TachyonFS.delete(TachyonFS.java:377)
        at tachyon.client.AbstractTachyonFS.delete(AbstractTachyonFS.java:109)
        at tachyon.client.TachyonFS.delete(TachyonFS.java:66)
        at tachyon.hadoop.AbstractTFS.delete(AbstractTFS.java:199)
        at tachyon.hadoop.TFS.delete(TFS.java:27)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:381)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:314)
        at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
        at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
        at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
        at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
        at RentrakQHLoadStaging$.main(RentrakQHLoadStaging.scala:213)
        at RentrakQHLoadStaging.main(RentrakQHLoadStaging.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:152)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at tachyon.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
        ... 50 more
16/02/22 09:32:37 INFO : Tachyon client (version 0.8.2) is trying to connect with FileSystemMaster master @ ip-10-1-83-211.ec2.internal/10.1.83.211:19998
16/02/22 09:32:37 INFO : Client registered with FileSystemMaster master @ ip-10-1-83-211.ec2.internal/10.1.83.211:19998
16/02/22 09:33:07 ERROR : java.net.SocketTimeoutException: Read timed out
tachyon.org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out

Thanks
Azad

Gene Pang

unread,
Feb 23, 2016, 11:12:28 AM2/23/16
to Alluxio Users, alluxi...@googlegroups.com
Hi Azad,

How many workers are you running? Could you look at the worker logs to see if there is any information? Can you also look at the logs of the master?

Thanks,
Gene

Mohammed Azad

unread,
Feb 23, 2016, 11:41:46 PM2/23/16
to Alluxio Users, tachyo...@googlegroups.com
Hi Gene,

I am running 5 workers... and the worker logs are showing the same Read timeout exception when they are trying to connect to the master..

i was able to replicate this issue separately.. try deleting a multi-level directory with a large number of file like say 4000+ in one shell say using tachyon tfs rmr /somepath.. and then try inserting / writing data into tachyon in another shell you will start hitting Read timeouts... looks like tachyon master thread is busy with the deletes and its unable to process the second request.. 

when these large deletes happen.. even if you try to use the "Browse file system" menu item in the tachyon browser, it will get stuck.. once the deletes are done, all things go back to normal.. however all jobs would have failed with Read timeouts to master.. 

Thanks
Azad 

Gene Pang

unread,
Feb 24, 2016, 9:52:35 AM2/24/16
to Alluxio Users
Hi Azad,

Thanks for the info. Do you know how long the delete of the many files takes by itself?

Unfortunately, that timeout parameter was made configurable in the latest version 1.0.0. Would you be able to try the latest version 1.0.0?

Thanks,
Gene

Haoyuan Li

unread,
Feb 25, 2016, 12:30:57 PM2/25/16
to Gene Pang, Alluxio Users

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mohammed Azad

unread,
Feb 26, 2016, 7:02:51 AM2/26/16
to Alluxio Users
Hi Gene,

That delete took around 3.5 minutes to complete.. yep, i will give this a try on version 1.0.0..

Thanks
Azad

Mohammed Azad

unread,
Feb 26, 2016, 7:07:30 AM2/26/16
to Alluxio Users, gene...@gmail.com
Thanks... However, can you please point me to the right timeout parameter to configure for this?.. 


Thanks
Azad

Gene Pang

unread,
Feb 26, 2016, 10:50:53 AM2/26/16
to Alluxio Users, gene...@gmail.com
I'm not exactly sure where the timeout is happening, but I think it on the client to master socket connection. If so, the relevant parameter would be: alluxio.security.authentication.socket.timeout.ms

Thanks,
Gene

Gene Pang

unread,
Mar 7, 2016, 10:31:59 AM3/7/16
to Alluxio Users, gene...@gmail.com
Was the parameter able to help your issue?

Thanks,
Gene

Gene Pang

unread,
Mar 21, 2016, 10:23:08 AM3/21/16
to Alluxio Users, gene...@gmail.com
Hi Azad,

Did the parameter help your timeout issue?

Thanks,
Gene
Reply all
Reply to author
Forward
0 new messages