The cascading loganalysis example seems to produce data sorted on the timestamp, at least for small input files and a single shard of output. There seems to be some implicit typing going on under the covers so that the timestamps are treated as longs and sorted as such; where dos this happen coercion happen?
Is there a way to disable the sorting by timestamp, so that the loganalysis benchmark is unconstrained as to the output order, and thus presumably make it run faster? This is to say, treat the log data as a set of events, rather than a sequence of events.
> The cascading loganalysis example seems to produce data sorted on the
> timestamp, at least for small input files and a single shard of
> output. There seems to be some implicit typing going on under the
> covers so that the timestamps are treated as longs and sorted as such;
> where dos this happen coercion happen?
> Is there a way to disable the sorting by timestamp, so that the
> loganalysis benchmark is unconstrained as to the output order, and
> thus presumably make it run faster? This is to say, treat the log
> data as a set of events, rather than a sequence of events.
> Robert Henry
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google
> Groups "cascading-user" group.
> To post to this group, send email to cascading-user@googlegroups.com
> To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en > -~----------~----~----~----~------~----~------~--~---
How is the type of the key values determined? The key values must be
Longs or Dates, somehow, to be sorted correctly. Which of the
pipeline builders knows that it will be dealing with Longs or Dates?
Is there some magic involved with the use of the DateParser object, or
the field named "ts"?
Thanks.
On Tue, Nov 10, 2009 at 9:07 AM, Chris K Wensel <ch...@wensel.net> wrote:
> In MapReduce sorting happens to support grouping on key values.
> So the results are sorted on the fields that are grouped upon.
> In this example, we are grouping on timestamps (minute and second
> intervals) in order to get the metrics for each.
> cheers,
> chris
> On Nov 9, 2009, at 4:49 PM, Robert Henry wrote:
>> The cascading loganalysis example seems to produce data sorted on the
>> timestamp, at least for small input files and a single shard of
>> output. There seems to be some implicit typing going on under the
>> covers so that the timestamps are treated as longs and sorted as such;
>> where dos this happen coercion happen?
>> Is there a way to disable the sorting by timestamp, so that the
>> loganalysis benchmark is unconstrained as to the output order, and
>> thus presumably make it run faster? This is to say, treat the log
>> data as a set of events, rather than a sequence of events.
>> Robert Henry
>> --~--~---------~--~----~------------~-------~--~----~
>> You received this message because you are subscribed to the Google
>> Groups "cascading-user" group.
>> To post to this group, send email to cascading-user@googlegroups.com
>> To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com
>> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en >> -~----------~----~----~----~------~----~------~--~---
> You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To post to this group, send email to cascading-user@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=.
> How is the type of the key values determined? The key values must be
> Longs or Dates, somehow, to be sorted correctly. Which of the
> pipeline builders knows that it will be dealing with Longs or Dates?
> Is there some magic involved with the use of the DateParser object, or
> the field named "ts"?
> Thanks.
> On Tue, Nov 10, 2009 at 9:07 AM, Chris K Wensel <ch...@wensel.net>
> wrote:
>> In MapReduce sorting happens to support grouping on key values.
>> So the results are sorted on the fields that are grouped upon.
>> In this example, we are grouping on timestamps (minute and second
>> intervals) in order to get the metrics for each.
>> cheers,
>> chris
>> On Nov 9, 2009, at 4:49 PM, Robert Henry wrote:
>>> The cascading loganalysis example seems to produce data sorted on
>>> the
>>> timestamp, at least for small input files and a single shard of
>>> output. There seems to be some implicit typing going on under the
>>> covers so that the timestamps are treated as longs and sorted as
>>> such;
>>> where dos this happen coercion happen?
>>> Is there a way to disable the sorting by timestamp, so that the
>>> loganalysis benchmark is unconstrained as to the output order, and
>>> thus presumably make it run faster? This is to say, treat the log
>>> data as a set of events, rather than a sequence of events.
>>> Robert Henry
>>> --~--~---------~--~----~------------~-------~--~----~
>>> You received this message because you are subscribed to the Google
>>> Groups "cascading-user" group.
>>> To post to this group, send email to cascading-user@googlegroups.com
>>> To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com
>>> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en >>> -~----------~----~----~----~------~----~------~--~---
>> You received this message because you are subscribed to the Google
>> Groups "cascading-user" group.
>> To post to this group, send email to cascading-user@googlegroups.com.
>> For more options, visit this group at http://groups.google.com/group/cascading-user?hl= >> .
> --
> You received this message because you are subscribed to the Google
> Groups "cascading-user" group.
> To post to this group, send email to cascading-user@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl= > .