A sequence file with the value type of List[(Long, Double)]

Alexy Khrabrov

unread,

Jul 10, 2012, 4:28:55 PM7/10/12

to scoobi...@googlegroups.com

Apparently storing time series as plain text is expensive. So I want to use the millis for day and doubles for values instead. Apparently I need to give Scoobi some evidence to make a Writeable out of it:

could not find implicit value for evidence parameter of type com.nicta.scoobi.Scoobi.SeqSchema[List[(Long, Double)]]

-- what's the approach?

A+

Russell Aronson

unread,

Jul 10, 2012, 6:52:26 PM7/10/12

to scoobi...@googlegroups.com, scoobi...@googlegroups.com

Hi Alex,

You either need to use LongWritable, DoubleWritale types, or you can use the conversion API convertToSequenceFile. From http://nicta.github.com/scoobi/guide/Input%20and%20Output.html

// persist as Int-String Sequence fille

val intString: DList[(Int, String)] = ...
persist(convertToSequenceFile(intString, "hdfs://path/to/output"))

// persist as Int-NullWritable Sequence fille
val intString: DList[(Int, String)] = ...
persist(convetKeyToSequenceFile(intString, "hdfs://path/to/output"))

// persist as NullWritable-Int Sequence fille
val intString: DList[(Int, String)] = ...
persist(convertValueFromSequenceFile(intString, "hdfs://path/to/output"))

Russell

Sent from my iPhone

Ben Lever

unread,

Jul 10, 2012, 7:40:37 PM7/10/12

to scoobi...@googlegroups.com

Hi Alexy,

If you're using any of the convertXxxToSequenceFile APIs, you will need a SeqSchema that can convert your type (e.g. List[(Long, Double)]) to a Writable type. This works great if there is a Writable type on the other end you're targeting, e.g. Int -> IntWritable, String -> Text, but is no good when there isn't. There are no Hadoop standard Writable types for collections which is why there isn't a SeqSchema[List[X]] implemented by Scoobi.

So two suggestions:

Create a SeqSchema[List(Long, Double)]] type class instance. This will mean you'll also need to create a Writable class as well, e.g. ListLongDoubleWritable, and provide conversions to/from it. Not pretty, which is why Writables suck in general, because you have to do redo all this for a new type, e.g. List[(Long, Int)].
Persist your time series data to Avro files instead of Sequence files. Avro schemas are very rich and include support for data structures like "lists" and "tuples". You can simply take your DList[List[(Long, Double)]] and write it out using "toAvroFile". You can similarly read it back in using "fromAvroFile".

Unless you're particularly wedded to Sequence files for some external reason, my recommendation would be to use Avro files.

Hope that helps.

Cheers,

Ben.

Alexy Khrabrov

unread,

Jul 12, 2012, 2:12:22 AM7/12/12

to scoobi...@googlegroups.com

Indeed, Avro is a fantastically simple way to go, and achieves about 50% compression (less than I expected, but still good).

A+

Christopher Severs

unread,

Jul 12, 2012, 1:41:37 PM7/12/12

to scoobi...@googlegroups.com

Hi Alexy,

How are you doing the Avro compression? It should get pretty small. I have a small example (using Pig rather than Scoobi but the compression parts are the same). Without specifying anything the Avro output size looks like this:
-rw-r--r-- 3 csevers gid-csevers 4821995 2012-07-03 17:51 /user/csevers/testavro2/part-m-00000.avro

If I add the following in Pig (I know there are regular Hadoop equivalents) SET avro.mapred.deflate.level 6;
SET mapred.output.compress true;, for the same input data I get this:
-rw-r--r-- 3 csevers gid-csevers 1468737 2012-07-03 17:48 /user/csevers/testavro/part-m-00000.avro

I don't know if this is possible right now in Scoobi. I think the Avro support in general needs to be slightly modified to be more generic.

Regards,
Chris

Alexy Khrabrov

unread,

Jul 12, 2012, 10:29:14 PM7/12/12

to scoobi...@googlegroups.com

I don't specify anything, and am in fact very interested to find out how are we supposed to tweak Scoobi job configuration for such purposes?

A+

Christopher Severs

unread,

Jul 13, 2012, 1:53:05 PM7/13/12

to scoobi...@googlegroups.com

I think we can get at the Hadoop Configuration object via ScoobiConfiguration. To set compression you can set mapred.output.compress to true (or use -D mapred.output.compress=true when running the job). To set the compression codec there is an AvroJob.setOutputCodec() function (http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/AvroJob.html#setOutputCodec%28org.apache.hadoop.mapred.JobConf,%20java.lang.String%29). The Avrojob functions are normally just convenience methods though. If you look at the source it is likely just modifying the Configuration in a few steps, which it would be possible to do manually.

Ideally we build these into the Scoobi Avro methods. I had some initial work done on this and chatted with Ben about it at the Hadoop Summit. When I have a little time I'll try and pick it up again and get in touch with the right person at Nicta.

Regards,
Chris

Reply all

Reply to author

Forward