F# and the JVM

Ryan Riley

ongelezen,

31 aug 2012, 23:05:1631-08-2012

aan fsh...@googlegroups.com

I was thinking tonight about F# and Hadoop and a comment I recently read about Scala being a more likely candidate than F# for Hadoop use given it was already on the JVM. That got me thinking, would an IKVM type provider make sense? It's too late for me to think through this completely, so I'm just posing the question. Thoughts on advantages might be to use this as a means to interact with the apache projects, especially Solr, PigKeeper, Hadoop, etc. Would a type provider work this way?

Thoughts?

Mauricio Scheffer

ongelezen,

1 sep 2012, 10:44:3001-09-2012

aan fsh...@googlegroups.com

Of the projects you mention, I'm very familiar with Solr, and have little to no experience with the others, so take what follows with a grain of salt :)

Solr's primary interface is HTTP, which means that it can be used from any platform. There are already stable, proven Solr clients in .NET. Solr also has an "embedded" mode for other JVM applications (i.e. using it as a library instead of a server), but nobody recommends it as a first choice.

In the context of Solr I've been thinking about a type provider that infers a document type from the Solr schema ( http://code.google.com/p/solrnet/issues/detail?id=190 ), but that's it.

I've also started a project to run Solr on .NET using IKVM ( https://github.com/mausch/SolrIKVM ), but as I said above, embedding Solr isn't generally recommended. IKVM is simply run once to generate .NET DLLs from JARs and that's it, I don't see how a type provider would help here.

About Hadoop: Carl Nolan ( http://blogs.msdn.com/b/carlnol/ ) has been blogging at length about submitting .NET jobs using streaming, and he's written a library to expose a simpler API for F# jobs ( http://code.msdn.microsoft.com/Hadoop-Streaming-and-F-f2e76850 ). It would be interesting to see how this compares to other high-level APIs for Hadoop, in particular Scala ones, such as Scrunch, Scoobi, Scalding. These also seem to compete with Pig. I imagine these Scala APIs are a lot more feature rich.

About ZooKeeper: there's already a client for .NET: https://github.com/ewhauser/zookeeper/tree/trunk/src/dotnet

Tomas mentioned the idea of a Java type provider a few days ago ( http://twitter.com/tomaspetricek/status/233348878005579777 ). Personally, I don't think it would buy you much over using IKVM as it is. You'd have to deploy the JARs instead of the ikvm-compiled DLLs. You'd still need to deploy the IKVM OpenJDK DLLs. It would act as the dynamic (i.e. on-demand) flavor of IKVM, which means it would be slow; or compile the whole JAR at startup, which would considerably increase the startup time. And you'd still need to wrap all the Java idioms and types.

Still, it might be fun to give it a try anyway and see what happens :)

Cheers

Mauricio

Rick Minerich

ongelezen,

1 sep 2012, 12:48:2201-09-2012

aan fsh...@googlegroups.com

The problem with Hadoop streaming is that it comes at huge performance cost over using the standard JVM API. There is another option though, newer versions of Hadoop are increasingly being rewritten in C++ and they also include an unmanaged API. While I haven't seen any benchmarks, I would assume it's significantly faster than streaming at least. This might be a good target for F#. The question is: which is faster once you take into account marshaling.

Another thing to consider is that while Hadoop is just now coming to Microsoft platforms, it's old news in the distributed computing world. Due to the platform's inherent limitations with some kinds of algorithms there has been a lot of ongoing work on new platforms that reuse much of the existing Hadoop infrastructure. The favorite of these is called Spark and it's written in Scala.

Cheers,
-Rick

Ryan Riley

ongelezen,

1 sep 2012, 13:21:1401-09-2012

aan fsh...@googlegroups.com

Thanks, Mauricio. I assumed you would have a solid understanding as to what was involved. So the Java Type Provider would be the only real player in this mix. I think you are right that this would be something of a F#-specific IKVM and likely not more useful. I suppose another option would be to create type providers that run on various runtimes such as the JVM or a GPGPU from the F# AST.

Ryan Riley

ongelezen,

1 sep 2012, 13:23:3201-09-2012

aan fsh...@googlegroups.com

Is it then worth working on something native to F# that would compete or work similarly to Spark rather than trying to work directly with Hadoop? I don’t have any real use for this at the moment (at least not that I know) so I’m a little in the dark on the topic. How does Mattias’ bumblebee or Luca’s L’Agent fit into this area? Do they? It would seem M-Brace fits into this realm nicely.

From: fsh...@googlegroups.com [mailto:fsh...@googlegroups.com] On Behalf Of Rick Minerich
Sent: Saturday, September 1, 2012 11:48 AM
To: fsh...@googlegroups.com
Subject: Re: [fsharpx] F# and the JVM

The problem with Hadoop streaming is that it comes at huge performance cost over using the standard JVM API. There is another option though, newer versions of Hadoop are increasingly being rewritten in C++ and they also include an unmanaged API. While I haven't seen any benchmarks, I would assume it's significantly faster than streaming at least. This might be a good target for F#. The question is: which is faster once you take into account marshaling.

Rick Minerich

ongelezen,

1 sep 2012, 15:29:4901-09-2012

aan fsh...@googlegroups.com

Certainly not trying to rain on any parades here, a solid and fast F# + Windows Hadoop story would be totally awesome. I've just been doing a lot of research on this lately for Bayard Rock and at least for us Hadoop isn't a great fit.

Also, building a distributed platform isn't easy at all. There's a lot of bottlenecks to consider all the way from l1 cache to network. Spark was able to do it much more easily by riding on Hadoop's coattails and using a lot of things they already built. It would be awesome if we could do the same in F# somehow.

Jack Fox

ongelezen,

2 sep 2012, 15:51:5002-09-2012

aan fsh...@googlegroups.com

I'm an interested spectator to this conversation because I have a reasonable chance to get F# into a high-profile organization, but a lot depends on the Microsoft Hadoop story. I think it's clear Hadoop has won the hearts and minds of upper management, witness the job postings requiring 10 years Hadoop experience (that of course is hyperbole, but you know what I mean). At this point I think the MS story is only a little beyond the vaporware stage.

mwinkle

ongelezen,

2 sep 2012, 23:33:4202-09-2012

aan fsh...@googlegroups.com

Howdy folks, Don Syme pointed me at this thread and I wanted to offer a few thoughts, and ask a few questions. I work on the Hadoop team at Microsoft, and spend a chunk of time thinking about the ways .NET developers will interact with Hadoop.

First, Jack, we've just been really quiet publicly the last few months working on what we talked about at TechEd and Hadoop Summit in June. We're releasing bits fairly regularly onto http://www.hadooponazure.com, and I'd encourage you to check that and let me know if you have any feedback. If you've got a customer who is interested, let me know how/if I can help.

Second, is the mention around performance of F# using Hadoop Streaming. Like most complex perf problem, the answer is usually a giant "it depends." I'll give a few anecdotes, and some data. What I've seen is that it is a combination of algorithmic complexity, data size, and processing pattern. Hardware profile comes into play here as well, in a cloud setting where you don't neccessarily have the benefit of locality, your performance observations will vary substantially. I have talked with orgs that are fairly large Hadoop users, who opt into streaming from the reasoning of developer productivity, and are willing to pay the perf penalty incurred (in the two biggest cases I can think of, it was using Python scripts via streaming) . In some of our testing, we've seen streaming not be substantially worse, and as data size increased, the performance between the same algorithm written in C# (via steaming) and Java is similar [the network becomes the bottleneck]. One test does not a robust claim around performance make, but are investing in taking Carl Nolan's work forward so that .NET developers have an option. For those of you that have looked at Hadoop, I'd love to know a little bit more about the data you were going to be processing (size, shape, speed, etc).

Currently, JVM languages dominate the Hadoop space, particularly around extensibility, because the core compute model and runtime is Java. This also pops up wtih things like Hive and Pig and other pieces and parts of the Hadoop universe like HBase. We've looked at ways to make it easy for F# and C# developers to write Hive queries using LINQ, for instance, but one place where the pure JVM approach is nice is when I want to push down an arbitrary lambda UDF. This is an area where something like Scalding (a Scala library on top of Cascading) is pretty nice. That said, if you don't need to push down arbitary lambda's, the authoring convinience of writing queries with LINQ (and potentially assisted in F# by type providers working against the metadata store), gives you a pretty nice query experience. If the type of processing your doing is handled well by Hive, then you might be in good shape here.

As Hadoop looks forward to 2.0, namely YARN, there is a capability there that will let folks bring in distribued runtimes written in any language, and this is certainly something we're looking into what we can do there. If anyone is interested in writing YARN apps using .NET, please let me know. This link has some good followup from Hadoop Summit: http://www.dbms2.com/2012/07/23/hadoop-yarn-beyond-mapreduce/

Rick brings up a good point which is that Hadoop in its current form is not a good fit for certain classes of problems. This is true, but the Hadoop community is trying to address a number of those. Bringing YARN and a set of interesting compute models onto a Hadoop cluster means there will be a number of different ways to query and write jobs that need to process your data living on the same set of hardware. A lot of folks are trying to find ways to get around the very batchy nature of Map/Reduce, as well as bringing in distributing computing patterns that don't require reducing down to map/reduce.

Mauricio, one interesting thing to look into is Scalding (that I mentioned before). It feels like something similar to what we could have in F# on top of a LINQ-y way to write data processing jobs... If we could push the predicates all the way down to the processing nodes with minmal overhead, is that something that would be interesting?

If you've got any questions, please feel free to reach out, I'd love to hear about the ways you might be using F# to address big data problems.

--matt winkler

(email is mwinkle at msft )

Jack Fox

ongelezen,

3 sep 2012, 12:33:2203-09-2012

aan fsh...@googlegroups.com

Matt, thank you for joining this thread, and I apologize if I talked out of turn about MS's commitment to Hadoop. My knowledge of Hadoop and JVM is sketchy at best. I will be in touch. I am still formulating my thoughts on how to approach this customer.

I've done very basic experiments with F# Quotations, but I know WebSharper has been successful converting the F# AST to javascript, and I believe there is at least one other F# javascript generator out in the wild. Has anyone considered generating Java bytecode from F# as a way to push down arbitrary lambdas to the Hadoop cluster?

Allen beantwoorden

Auteur beantwoorden

Doorsturen