Is this a Blur use case?

54 views
Skip to first unread message

osbornk

unread,
Jul 3, 2012, 8:04:18 PM7/3/12
to blur...@googlegroups.com
I was looking around the web for Lucene/Solr and Hadoop integrations and came upon your project. It definitely intrigued me and I am trying to figure out if it fits our use case.

We have a current project that is a traditional app server and Solr. The app server processes files of 100,000 or so records. It applies some BI to each record, sends a query to a small Solr cluster, and then applies some additional logic. Finally, it puts everything together in a results file. In addition the the app server BI, the query is quite complex, so it is a bit on the slow side (about 1-2 seconds). Our Solr index is about 7,000,000 records. However, each record is not that large. It is not what I would describe as your traditional big data problem. The entire index is about 7-8GB. However, the overall application is slow. And we have demands to process many more files (at the same time) and to process a large number of records per minute. Overall performance is now a critical need.

So, what does interest me is parallel processing. Our overall problem is a MapRunnable problem and may often be a MapReduce problem. Most of the requests to the system are asynchronous jobs (pick up a file and process it), but we do have the occasional need to process synchronous requests. And we need the entire workflow to handle a large amount of parallel processing. Ideally, I would like the entire workflow to be in Hadoop. Sure, we could put the BI in Hadoop and have it call Solr (or a Solr cluster), but that just adds a bottleneck as well as extra HTTP requests.

So, my question is whether this seems like a good use case for Blur? I see that you have definitely done a lot of impressive work. Are there others working on this? In addition to evaluating the actual technology, I want to evaluate the health of the project. That is what I love about Solr.

Thanks for any responses.
-Kevin

Aaron McCurry

unread,
Jul 3, 2012, 9:49:06 PM7/3/12
to blur...@googlegroups.com
Kevin,

First the question of whether Blur is a good fit for your technical challenges.  Blur was designed from the beginning to tightly integrated with Hadoop both for storage of the indexes as well as the actual indexing process in the MapReduce framework.  So you idea of moving your file processing into Hadoop seems to be a good idea considering that's what Hadoop does best (process files).  I can see a few different ways you could go about changing your process to become more parallel using Hadoop and Blur.

Overall system performance has been a priority from the beginning.  Query execution speed has been the overall driver for a lot of the design decisions.  Along with performance, scalability has been an equal driver, so that as you add infrastructure you can continue to add data to your system and maintain a high level of performance.  One of the nice things about Blur is that you can either perform live (near real time) updates through Thrift (RPC) or you can updates the indexes through the HDFS filesystem in M/R.  This allows you to move the indexing load away from the search load, assuming that you have multiple clusters of servers.

For the second question of whether or not there are others helping me and the overall health of the project.  Currently I am the primary committer, though there have been a few others that have committed code and I love getting suggestions/help :) .  Besides my primary project where I use Blur everyday, there are a couple of other projects that I know of that are under development using Blur.  Overall it is a young project that I believe can fill a gap in the Hadoop stack.  We have plans for the project that we believe will greatly grow it's community and usage, more to come on that.

Hope this helps to answer your questions.

Aaron

Kevin Osborn

unread,
Jul 5, 2012, 2:54:00 PM7/5/12
to blur...@googlegroups.com
Thanks for your response. I will definitely putting Blur as one of the top ideas I am evaluating for our project. And if we end up using it, we would be glad to help where we can to make it better.

-Kevin
--
KEVIN OSBORN
LEAD SOFTWARE ENGINEER
T 949.399.8714      C 949.310.4677
5 Park Plaza, Suite 600, Irvine, CA 92614


Tim Williams

unread,
Jul 7, 2012, 1:40:36 PM7/7/12
to blur...@googlegroups.com
On Thu, Jul 5, 2012 at 2:54 PM, Kevin Osborn
<kevin....@cbsinteractive.com> wrote:
>
> Thanks for your response. I will definitely putting Blur as one of the top
> ideas I am evaluating for our project. And if we end up using it, we would
> be glad to help where we can to make it better.

Hey Kevin, if you're able to post any information on your evaluation I
think that'd be helpful - even if you don't end up using Blur.

Thanks,
--tim
Reply all
Reply to author
Forward
0 new messages