Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Mapreduce performance
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  1 message - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Miroslav Urbanek  
View profile   Translate to Translated (View Original)
 More options Jun 21 2012, 12:19 pm
From: Miroslav Urbanek <miroslav.urba...@gmail.com>
Date: Thu, 21 Jun 2012 18:19:39 +0200
Local: Thurs, Jun 21 2012 12:19 pm
Subject: Mapreduce performance
Dear Riak users,

we are evaluating Riak for storing data similar to the dataset in the
tutorial example ("goog.csv" at
http://wiki.basho.com/Loading-Data-and-Running-MapReduce-Queries.html
). However, the mapreduce seems to be very slow. We have generated
200k lines exactly in the goog.csv format and loaded them into Riak.
Using the mapreduce queries from the example, we are unable to get any
results - all queries return "timeout" error.

A sample query:
{
"inputs":"goog",
"query":[
{"map":{"language":"javascript","source":
"function(value,keyData,arg) {var data = Riak.mapValuesJson(value)[0];
return [data.High];}"}},
{"reduce":{"language":"javascript","name":"Riak.reduceMax","keep":true}}
]}

For comparison, the following oneliner takes under a second:
$ awk 'NR==1 {next} max=="" || $3 > max {max=$3} END {print max}'
FS=',' goog-200k.csv

I know that Riak has to execute Javascript code, and to do a lot of
inter-node communication, so the comparison is not completely valid.
However, the entire file goog-200k.csv is only 11 MB big. I expected
Riak would handle it without a problem. We have experimented with
different backends, with a 4 node cluster on the same machine, with a
3 physical-node cluster, but the results are the same.

I have several questions:
1. What setup do you recommend for this use case, specifically for storing logs?
2. I know that mapreduce over entire bucket is not recommended, but
how would you calculate statistics over entire buckets, similar to the
queries in the tutorial?
3. We also tried Riak Search, but we were unable to perform a query
like this - finding the highest column value. Is there a way to do
this?

Thanks,
Miro

_______________________________________________
riak-users mailing list
riak-us...@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »