InvalidPathException while writing parquet files with spaces from Spark

Jais Sebastian

unread,

Aug 14, 2017, 7:56:04 PM8/14/17

to Alluxio Users

Hello,

While writing parquet files with column partition enabled ( specifically if the column value contains spaces / special chars ) Alluxio throws InvalidPathException exception

Caused by: org.apache.spark.SparkException: Task failed while writing rows

at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:108)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.io.IOException: alluxio.exception.InvalidPathException: Path /dev/default/parquet/query_result_p/_temporary/0/_temporary/attempt_20170814030826_0046_m_000000_3/Director=Adam McKay/part-00000-48384136-22a0-48bb-8217-f47e0d0d1d24.c000.snappy.parquet is invalid

at alluxio.hadoop.AbstractFileSystem.create(AbstractFileSystem.java:179)

at alluxio.hadoop.FileSystem.create(FileSystem.java:25)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787)

at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:241)

at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)

at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)

at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)

at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.org$apache$spark$sql$execution$datasources$FileFormatWriter$DynamicPartitionWriteTask$$newOutputWriter(FileFormatWriter.scala:418)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:451)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:440)

at scala.collection.Iterator$class.foreach(Iterator.scala:893)

at org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator.foreach(AbstractScalaRowIterator.scala:26)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:440)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)

at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)

... 8 common frames omitted

We are using Spark 2.2.0 and Alluxio 1.5.1 with UnderFs as Hadoop 2.7 enabled.

Regards,

Jais

Gene Pang

unread,

Aug 16, 2017, 11:35:50 AM8/16/17

to Alluxio Users

Hi Jais,

Does this fail with any amount of data? Is there a small example which can reproduce the issue?

Thanks,

Gene

Jais Sebastian

unread,

Aug 17, 2017, 9:43:16 AM8/17/17

to Alluxio Users

Hi Gene,

Scenario is very simple

Load the attached files as spark Dataset

Dataset table = sparkSession.read().parquet(<movies file path>);

DataFrameWriter dataFrameWriter = table.write();

dataFrameWriter.partitionBy("Title");
dataFrameWriter.option("compression", "parquet").parquet(getAlluxioUrlPath());

This works if we partition by "Nominated" column.

Environment :

Alluxio 1.5.1 running in cluster mode with short circuit read enabled.

UnderFs: HDFS

Spark 2.2.0

Regards,

Jais

m2.csv

bin...@alluxio.com

unread,

Aug 22, 2017, 7:06:21 PM8/22/17

to Alluxio Users

Hi Jais,

It looks like the file name contains a space - Alluxio currently do not support space in file name. Can you partition by a column which does not have space in its values?

Thanks,

Bin

Avinash Meda

unread,

Nov 1, 2017, 11:54:54 PM11/1/17

to Alluxio Users

Hi Bin,

Whats the plan to support this ?

Thanks,

Avinash

bin...@alluxio.com

unread,

Nov 3, 2017, 4:49:50 PM11/3/17

to Alluxio Users

Hi Avinash,

I don't think there is a plan on that yet. Feel free to create a jira ticket for this issue or even better submit a fix to the project.

Thanks,

Bin

Reply all

Reply to author

Forward