I was looking around the web for Lucene/Solr and Hadoop integrations and came upon your project. It definitely intrigued me and I am trying to figure out if it fits our use case.
We have a current project that is a traditional app server and Solr. The app server processes files of 100,000 or so records. It applies some BI to each record, sends a query to a small Solr cluster, and then applies some additional logic. Finally, it puts everything together in a results file. In addition the the app server BI, the query is quite complex, so it is a bit on the slow side (about 1-2 seconds). Our Solr index is about 7,000,000 records. However, each record is not that large. It is not what I would describe as your traditional big data problem. The entire index is about 7-8GB. However, the overall application is slow. And we have demands to process many more files (at the same time) and to process a large number of records per minute. Overall performance is now a critical need.
So, what does interest me is parallel processing. Our overall problem is a MapRunnable problem and may often be a MapReduce problem. Most of the requests to the system are asynchronous jobs (pick up a file and process it), but we do have the occasional need to process synchronous requests. And we need the entire workflow to handle a large amount of parallel processing. Ideally, I would like the entire workflow to be in Hadoop. Sure, we could put the BI in Hadoop and have it call Solr (or a Solr cluster), but that just adds a bottleneck as well as extra HTTP requests.
So, my question is whether this seems like a good use case for Blur? I see that you have definitely done a lot of impressive work. Are there others working on this? In addition to evaluating the actual technology, I want to evaluate the health of the project. That is what I love about Solr.
Thanks for any responses.
-Kevin