Howdy folks, Don Syme pointed me at this thread and I wanted to offer a few thoughts, and ask a few questions. I work on the Hadoop team at Microsoft, and spend a chunk of time thinking about the ways .NET developers will interact with Hadoop.
First, Jack, we've just been really quiet publicly the last few months working on what we talked about at TechEd and Hadoop Summit in June. We're releasing bits fairly regularly onto
http://www.hadooponazure.com, and I'd encourage you to check that and let me know if you have any feedback. If you've got a customer who is interested, let me know how/if I can help.
Second, is the mention around performance of F# using Hadoop Streaming. Like most complex perf problem, the answer is usually a giant "it depends." I'll give a few anecdotes, and some data. What I've seen is that it is a combination of algorithmic complexity, data size, and processing pattern. Hardware profile comes into play here as well, in a cloud setting where you don't neccessarily have the benefit of locality, your performance observations will vary substantially. I have talked with orgs that are fairly large Hadoop users, who opt into streaming from the reasoning of developer productivity, and are willing to pay the perf penalty incurred (in the two biggest cases I can think of, it was using Python scripts via streaming) . In some of our testing, we've seen streaming not be substantially worse, and as data size increased, the performance between the same algorithm written in C# (via steaming) and Java is similar [the network becomes the bottleneck]. One test does not a robust claim around performance make, but are investing in taking Carl Nolan's work forward so that .NET developers have an option. For those of you that have looked at Hadoop, I'd love to know a little bit more about the data you were going to be processing (size, shape, speed, etc).
Currently, JVM languages dominate the Hadoop space, particularly around extensibility, because the core compute model and runtime is Java. This also pops up wtih things like Hive and Pig and other pieces and parts of the Hadoop universe like HBase. We've looked at ways to make it easy for F# and C# developers to write Hive queries using LINQ, for instance, but one place where the pure JVM approach is nice is when I want to push down an arbitrary lambda UDF. This is an area where something like
Scalding (a Scala library on top of
Cascading) is pretty nice. That said, if you don't need to push down arbitary lambda's, the authoring convinience of writing queries with LINQ (and potentially assisted in F# by type providers working against the metadata store), gives you a pretty nice query experience. If the type of processing your doing is handled well by Hive, then you might be in good shape here.
As Hadoop looks forward to 2.0, namely
YARN, there is a capability there that will let folks bring in distribued runtimes written in any language, and this is certainly something we're looking into what we can do there. If anyone is interested in writing YARN apps using .NET, please let me know. This link has some good followup from Hadoop Summit:
http://www.dbms2.com/2012/07/23/hadoop-yarn-beyond-mapreduce/
Rick brings up a good point which is that Hadoop in its current form is not a good fit for certain classes of problems. This is true, but the Hadoop community is trying to address a number of those. Bringing YARN and a set of interesting compute models onto a Hadoop cluster means there will be a number of different ways to query and write jobs that need to process your data living on the same set of hardware. A lot of folks are trying to find ways to get around the very batchy nature of Map/Reduce, as well as bringing in distributing computing patterns that don't require reducing down to map/reduce.
Mauricio, one interesting thing to look into is Scalding (that I mentioned before). It feels like something similar to what we could have in F# on top of a LINQ-y way to write data processing jobs... If we could push the predicates all the way down to the processing nodes with minmal overhead, is that something that would be interesting?
If you've got any questions, please feel free to reach out, I'd love to hear about the ways you might be using F# to address big data problems.
--matt winkler
(email is mwinkle at msft )