InvalidPathException while writing parquet files with spaces from Spark

412 views
Skip to first unread message

Jais Sebastian

unread,
Aug 14, 2017, 7:56:04 PM8/14/17
to Alluxio Users
Hello,

While writing parquet files with column partition enabled ( specifically if the column value contains spaces / special chars ) Alluxio throws InvalidPathException exception

Caused by: org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: alluxio.exception.InvalidPathException: Path /dev/default/parquet/query_result_p/_temporary/0/_temporary/attempt_20170814030826_0046_m_000000_3/Director=Adam McKay/part-00000-48384136-22a0-48bb-8217-f47e0d0d1d24.c000.snappy.parquet is invalid
at alluxio.hadoop.AbstractFileSystem.create(AbstractFileSystem.java:179)
at alluxio.hadoop.FileSystem.create(FileSystem.java:25)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787)
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:241)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.org$apache$spark$sql$execution$datasources$FileFormatWriter$DynamicPartitionWriteTask$$newOutputWriter(FileFormatWriter.scala:418)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:451)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:440)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator.foreach(AbstractScalaRowIterator.scala:26)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:440)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
... 8 common frames omitted


We are using Spark 2.2.0 and Alluxio 1.5.1 with UnderFs as Hadoop 2.7 enabled.

Regards,
Jais

Gene Pang

unread,
Aug 16, 2017, 11:35:50 AM8/16/17
to Alluxio Users
Hi Jais,

Does this fail with any amount of data? Is there a small example which can reproduce the issue?

Thanks,
Gene

Jais Sebastian

unread,
Aug 17, 2017, 9:43:16 AM8/17/17
to Alluxio Users
Hi Gene,
Scenario is very simple 
Load the attached files as spark Dataset 
Dataset table = sparkSession.read().parquet(<movies file path>);
DataFrameWriter dataFrameWriter = table.write();
dataFrameWriter.partitionBy("Title");
dataFrameWriter.option("compression", "parquet").parquet(getAlluxioUrlPath());


This works if we partition by "Nominated" column.

Environment :
Alluxio 1.5.1  running in cluster mode with short circuit read enabled. 
UnderFs: HDFS 
Spark 2.2.0

Regards,
Jais
m2.csv

bin...@alluxio.com

unread,
Aug 22, 2017, 7:06:21 PM8/22/17
to Alluxio Users
Hi Jais,

It looks like the file name contains a space - Alluxio currently do not support space in file name. Can you partition by a column which does not have space in its values?

Thanks,
Bin

Avinash Meda

unread,
Nov 1, 2017, 11:54:54 PM11/1/17
to Alluxio Users
Hi Bin,
Whats the plan to support this ? 

Thanks,
Avinash

bin...@alluxio.com

unread,
Nov 3, 2017, 4:49:50 PM11/3/17
to Alluxio Users
Hi Avinash,

I don't think there is a plan on that yet. Feel free to create a jira ticket for this issue or even better submit a fix to the project.

Thanks,
Bin
Reply all
Reply to author
Forward
0 new messages