Apparently storing time series as plain text is expensive. So I want to use the millis for day and doubles for values instead. Apparently I need to give Scoobi some evidence to make a Writeable out of it:
could not find implicit value for evidence parameter of type com.nicta.scoobi.Scoobi.SeqSchema[List[(Long, Double)]]
// persist as Int-String Sequence fille
val intString: DList[(Int, String)] = ...
persist(convertToSequenceFile(intString, "hdfs://path/to/output"))
// persist as Int-NullWritable Sequence fille
val intString: DList[(Int, String)] = ...
persist(convetKeyToSequenceFile(intString, "hdfs://path/to/output"))
// persist as NullWritable-Int Sequence fille
val intString: DList[(Int, String)] = ...
persist(convertValueFromSequenceFile(intString, "hdfs://path/to/output"))
Russell
Sent from my iPhone
On Jul 11, 2012, at 6:28, Alexy Khrabrov <al...@scalable.pro> wrote:
> Apparently storing time series as plain text is expensive. So I want to use the millis for day and doubles for values instead. Apparently I need to give Scoobi some evidence to make a Writeable out of it:
> could not find implicit value for evidence parameter of type com.nicta.scoobi.Scoobi.SeqSchema[List[(Long, Double)]]
If you're using any of the convertXxxToSequenceFile APIs, you will need a SeqSchema that can convert your type (e.g. List[(Long, Double)]) to a Writable type. This works great if there is a Writable type on the other end you're targeting, e.g. Int -> IntWritable, String -> Text, but is no good when there isn't. There are no Hadoop standard Writable types for collections which is why there isn't a SeqSchema[List[X]] implemented by Scoobi.
So two suggestions:
1. Create a SeqSchema[List(Long, Double)]] type class instance. This will mean you'll also need to create a Writable class as well, e.g. ListLongDoubleWritable, and provide conversions to/from it. Not pretty, which is why Writables suck in general, because you have to do redo all this for a new type, e.g. List[(Long, Int)]. 2. Persist your time series data to Avro files instead of Sequence files. Avro schemas are very rich and include support for data structures like "lists" and "tuples". You can simply take your DList[List[(Long, Double)]] and write it out using "toAvroFile". You can similarly read it back in using "fromAvroFile".
Unless you're particularly wedded to Sequence files for some external reason, my recommendation would be to use Avro files.
On Wednesday, July 11, 2012 6:28:55 AM UTC+10, Alexy Khrabrov wrote:
> Apparently storing time series as plain text is expensive. So I want to > use the millis for day and doubles for values instead. Apparently I need > to give Scoobi some evidence to make a Writeable out of it:
> could not find implicit value for evidence parameter of type > com.nicta.scoobi.Scoobi.SeqSchema[List[(Long, Double)]]
On Tuesday, July 10, 2012 4:40:37 PM UTC-7, Ben Lever wrote:
> Hi Alexy,
> If you're using any of the convertXxxToSequenceFile APIs, you will need a > SeqSchema that can convert your type (e.g. List[(Long, Double)]) to a > Writable type. This works great if there is a Writable type on the other > end you're targeting, e.g. Int -> IntWritable, String -> Text, but is no > good when there isn't. There are no Hadoop standard Writable types for > collections which is why there isn't a SeqSchema[List[X]] implemented by > Scoobi.
> So two suggestions:
> 1. Create a SeqSchema[List(Long, Double)]] type class instance. This > will mean you'll also need to create a Writable class as well, e.g. > ListLongDoubleWritable, and provide conversions to/from it. Not pretty, > which is why Writables suck in general, because you have to do redo all > this for a new type, e.g. List[(Long, Int)]. > 2. Persist your time series data to Avro files instead of Sequence > files. Avro schemas are very rich and include support for data structures > like "lists" and "tuples". You can simply take your DList[List[(Long, > Double)]] and write it out using "toAvroFile". You can similarly read it > back in using "fromAvroFile".
> Unless you're particularly wedded to Sequence files for some external > reason, my recommendation would be to use Avro files.
> Hope that helps.
> Cheers, > Ben.
> On Wednesday, July 11, 2012 6:28:55 AM UTC+10, Alexy Khrabrov wrote:
>> Apparently storing time series as plain text is expensive. So I want to >> use the millis for day and doubles for values instead. Apparently I need >> to give Scoobi some evidence to make a Writeable out of it:
>> could not find implicit value for evidence parameter of type >> com.nicta.scoobi.Scoobi.SeqSchema[List[(Long, Double)]]
How are you doing the Avro compression? It should get pretty small. I have a small example (using Pig rather than Scoobi but the compression parts are the same). Without specifying anything the Avro output size looks like this: -rw-r--r-- 3 csevers gid-csevers 4821995 2012-07-03 17:51 /user/csevers/testavro2/part-m-00000.avro
If I add the following in Pig (I know there are regular Hadoop equivalents) SET avro.mapred.deflate.level 6; SET mapred.output.compress true;, for the same input data I get this: -rw-r--r-- 3 csevers gid-csevers 1468737 2012-07-03 17:48 /user/csevers/testavro/part-m-00000.avro
I don't know if this is possible right now in Scoobi. I think the Avro support in general needs to be slightly modified to be more generic.
On Wednesday, July 11, 2012 11:12:22 PM UTC-7, Alexy Khrabrov wrote:
> Indeed, Avro is a fantastically simple way to go, and achieves about 50% > compression (less than I expected, but still good).
> A+
> On Tuesday, July 10, 2012 4:40:37 PM UTC-7, Ben Lever wrote:
>> Hi Alexy,
>> If you're using any of the convertXxxToSequenceFile APIs, you will need a >> SeqSchema that can convert your type (e.g. List[(Long, Double)]) to a >> Writable type. This works great if there is a Writable type on the other >> end you're targeting, e.g. Int -> IntWritable, String -> Text, but is no >> good when there isn't. There are no Hadoop standard Writable types for >> collections which is why there isn't a SeqSchema[List[X]] implemented by >> Scoobi.
>> So two suggestions:
>> 1. Create a SeqSchema[List(Long, Double)]] type class instance. This >> will mean you'll also need to create a Writable class as well, e.g. >> ListLongDoubleWritable, and provide conversions to/from it. Not pretty, >> which is why Writables suck in general, because you have to do redo all >> this for a new type, e.g. List[(Long, Int)]. >> 2. Persist your time series data to Avro files instead of Sequence >> files. Avro schemas are very rich and include support for data structures >> like "lists" and "tuples". You can simply take your DList[List[(Long, >> Double)]] and write it out using "toAvroFile". You can similarly read it >> back in using "fromAvroFile".
>> Unless you're particularly wedded to Sequence files for some external >> reason, my recommendation would be to use Avro files.
>> Hope that helps.
>> Cheers, >> Ben.
>> On Wednesday, July 11, 2012 6:28:55 AM UTC+10, Alexy Khrabrov wrote:
>>> Apparently storing time series as plain text is expensive. So I want to >>> use the millis for day and doubles for values instead. Apparently I need >>> to give Scoobi some evidence to make a Writeable out of it:
>>> could not find implicit value for evidence parameter of type >>> com.nicta.scoobi.Scoobi.SeqSchema[List[(Long, Double)]]
On Thursday, July 12, 2012 10:41:37 AM UTC-7, Christopher Severs wrote:
> Hi Alexy,
> How are you doing the Avro compression? It should get pretty small. I have > a small example (using Pig rather than Scoobi but the compression parts are > the same). Without specifying anything the Avro output size looks like this: > -rw-r--r-- 3 csevers gid-csevers 4821995 2012-07-03 17:51 > /user/csevers/testavro2/part-m-00000.avro
> If I add the following in Pig (I know there are regular Hadoop > equivalents) SET avro.mapred.deflate.level 6; > SET mapred.output.compress true;, for the same input data I get this: > -rw-r--r-- 3 csevers gid-csevers 1468737 2012-07-03 17:48 > /user/csevers/testavro/part-m-00000.avro
> I don't know if this is possible right now in Scoobi. I think the Avro > support in general needs to be slightly modified to be more generic.
> Regards, > Chris
> On Wednesday, July 11, 2012 11:12:22 PM UTC-7, Alexy Khrabrov wrote:
>> Indeed, Avro is a fantastically simple way to go, and achieves about 50% >> compression (less than I expected, but still good).
>> A+
>> On Tuesday, July 10, 2012 4:40:37 PM UTC-7, Ben Lever wrote:
>>> Hi Alexy,
>>> If you're using any of the convertXxxToSequenceFile APIs, you will need >>> a SeqSchema that can convert your type (e.g. List[(Long, Double)]) to a >>> Writable type. This works great if there is a Writable type on the other >>> end you're targeting, e.g. Int -> IntWritable, String -> Text, but is no >>> good when there isn't. There are no Hadoop standard Writable types for >>> collections which is why there isn't a SeqSchema[List[X]] implemented by >>> Scoobi.
>>> So two suggestions:
>>> 1. Create a SeqSchema[List(Long, Double)]] type class instance. This >>> will mean you'll also need to create a Writable class as well, e.g. >>> ListLongDoubleWritable, and provide conversions to/from it. Not pretty, >>> which is why Writables suck in general, because you have to do redo all >>> this for a new type, e.g. List[(Long, Int)]. >>> 2. Persist your time series data to Avro files instead of Sequence >>> files. Avro schemas are very rich and include support for data structures >>> like "lists" and "tuples". You can simply take your DList[List[(Long, >>> Double)]] and write it out using "toAvroFile". You can similarly read it >>> back in using "fromAvroFile".
>>> Unless you're particularly wedded to Sequence files for some external >>> reason, my recommendation would be to use Avro files.
>>> Hope that helps.
>>> Cheers, >>> Ben.
>>> On Wednesday, July 11, 2012 6:28:55 AM UTC+10, Alexy Khrabrov wrote:
>>>> Apparently storing time series as plain text is expensive. So I want >>>> to use the millis for day and doubles for values instead. Apparently I >>>> need to give Scoobi some evidence to make a Writeable out of it:
>>>> could not find implicit value for evidence parameter of type >>>> com.nicta.scoobi.Scoobi.SeqSchema[List[(Long, Double)]]
I think we can get at the Hadoop Configuration object via ScoobiConfiguration. To set compression you can set mapred.output.compressto true (or use -D mapred.output.compress=true when running the job). To set the compression codec there is an AvroJob.setOutputCodec() function (http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/A...). The Avrojob functions are normally just convenience methods though. If you look at the source it is likely just modifying the Configuration in a few steps, which it would be possible to do manually.
Ideally we build these into the Scoobi Avro methods. I had some initial work done on this and chatted with Ben about it at the Hadoop Summit. When I have a little time I'll try and pick it up again and get in touch with the right person at Nicta.
On Thursday, July 12, 2012 7:29:14 PM UTC-7, Alexy Khrabrov wrote:
> I don't specify anything, and am in fact very interested to find out how > are we supposed to tweak Scoobi job configuration for such purposes?
> A+
> On Thursday, July 12, 2012 10:41:37 AM UTC-7, Christopher Severs wrote:
>> Hi Alexy,
>> How are you doing the Avro compression? It should get pretty small. I >> have a small example (using Pig rather than Scoobi but the compression >> parts are the same). Without specifying anything the Avro output size looks >> like this: >> -rw-r--r-- 3 csevers gid-csevers 4821995 2012-07-03 17:51 >> /user/csevers/testavro2/part-m-00000.avro
>> If I add the following in Pig (I know there are regular Hadoop >> equivalents) SET avro.mapred.deflate.level 6; >> SET mapred.output.compress true;, for the same input data I get this: >> -rw-r--r-- 3 csevers gid-csevers 1468737 2012-07-03 17:48 >> /user/csevers/testavro/part-m-00000.avro
>> I don't know if this is possible right now in Scoobi. I think the Avro >> support in general needs to be slightly modified to be more generic.
>> Regards, >> Chris
>> On Wednesday, July 11, 2012 11:12:22 PM UTC-7, Alexy Khrabrov wrote:
>>> Indeed, Avro is a fantastically simple way to go, and achieves about 50% >>> compression (less than I expected, but still good).
>>> A+
>>> On Tuesday, July 10, 2012 4:40:37 PM UTC-7, Ben Lever wrote:
>>>> Hi Alexy,
>>>> If you're using any of the convertXxxToSequenceFile APIs, you will need >>>> a SeqSchema that can convert your type (e.g. List[(Long, Double)]) to a >>>> Writable type. This works great if there is a Writable type on the other >>>> end you're targeting, e.g. Int -> IntWritable, String -> Text, but is no >>>> good when there isn't. There are no Hadoop standard Writable types for >>>> collections which is why there isn't a SeqSchema[List[X]] implemented by >>>> Scoobi.
>>>> So two suggestions:
>>>> 1. Create a SeqSchema[List(Long, Double)]] type class instance. >>>> This will mean you'll also need to create a Writable class as well, e.g. >>>> ListLongDoubleWritable, and provide conversions to/from it. Not pretty, >>>> which is why Writables suck in general, because you have to do redo all >>>> this for a new type, e.g. List[(Long, Int)]. >>>> 2. Persist your time series data to Avro files instead of Sequence >>>> files. Avro schemas are very rich and include support for data structures >>>> like "lists" and "tuples". You can simply take your DList[List[(Long, >>>> Double)]] and write it out using "toAvroFile". You can similarly read it >>>> back in using "fromAvroFile".
>>>> Unless you're particularly wedded to Sequence files for some external >>>> reason, my recommendation would be to use Avro files.
>>>> Hope that helps.
>>>> Cheers, >>>> Ben.
>>>> On Wednesday, July 11, 2012 6:28:55 AM UTC+10, Alexy Khrabrov wrote:
>>>>> Apparently storing time series as plain text is expensive. So I want >>>>> to use the millis for day and doubles for values instead. Apparently I >>>>> need to give Scoobi some evidence to make a Writeable out of it:
>>>>> could not find implicit value for evidence parameter of type >>>>> com.nicta.scoobi.Scoobi.SeqSchema[List[(Long, Double)]]