Does scalding-parquet library support reading in snappy compressed Parquet files?

21 views
Skip to first unread message

ybro...@ebay.com

unread,
Apr 24, 2019, 8:05:10 PM4/24/19
to Scalding Development
Does scalding-parquet library support reading in snappy compressed Parquet files?

I are trying to read in Parquet files of the form:
> hadoop jar parquet-tools-1.10.1.jar schema /my/path/part-00000.snappy.parquet
message spark_schema {
  optional fixed_len_byte_array(8) fieldName1 (DECIMAL(18,0));
  optional fixed_len_byte_array(2) fieldName2 (DECIMAL(4,0));
  optional binary fieldName3 (UTF8);
}

I are using the following code:
val fields = new Fields("fieldName1","fieldName2","fieldName3")
ParquetTupleSource(fields, inputPath)
  .read
  .write(Tsv(outputPath))

The fieldName3 column output produces normal output that matches the input string, however, fieldName1 and fieldName2 columns produce garbage output. Does scalding-parquet library support snappy compressed Parquet files? Does it support reading fixed_len_byte_array type, how do I specify this in the TypedParquet setting?

Thank you for your help!
Best, 
Yuri

Oscar Boykin

unread,
Apr 25, 2019, 12:44:03 PM4/25/19
to ybro...@ebay.com, Scalding Development
Compression is generally supported by Hadoop directly and all input formats can use it. Those compression options in hadoop are generally configured with Configuration (String Key value pairs).

Did you try to run with snappy and hit an error?

We use snappy with Parquet at Stripe.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ybro...@ebay.com

unread,
Apr 25, 2019, 1:07:20 PM4/25/19
to Scalding Development
@Oscar
I am able to read in the data but the fixed_len_byte_array / DECIMAL type fields produce garbage results, so I was wondering if it had to do with snappy compression. The binary / UTF8 fields read correctly. 

Is there an example of how to read in the fixed_len_byte_array / DECIMAL type field?
Thank you

Alex Levenson

unread,
Apr 25, 2019, 4:50:23 PM4/25/19
to ybro...@ebay.com, Scalding Development
Parquet handles data encoding / compression a little bit different from most formats (it doesn't write a file, then compress the entire file's bytes, it instead individually compresses parts of the file in separate chunks (each page is compressed separately)). I think you might get a better answer to this question on the parquet mailing list. I don't know what would cause this off the top of my head though.
It'd help to show all the settings used and specifically how the data appears to be corrupt / garbage.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Alex Levenson
@THISWILLWORK
Reply all
Reply to author
Forward
0 new messages