I wonder what NuGet.org would be like, powered by RavenDb ...

103 views
Skip to first unread message

Justin A

unread,
Apr 23, 2014, 9:25:34 PM4/23/14
to rav...@googlegroups.com
We all know NuGet :)

So they've updated their search stuff recently. They were using Lucene (yay!) but we using it wrongly. (ok...)




So they create some new search service thingy and use Lucene (yay again!) + Azure blobs and stuff (ok .. not i don't get this .. so I'll classify this as "i'm not smart"). 
I've not looked at any code (*hint* Ayende + blog post/code review *hint*) but my first thought was "er.. why? Isn't there a solution (RavenDB) already out there that can do all of this AND it's free .. cause NuGet.org is OSS..."

So next i thought -> 
1) fork NuGet.org
2) replace their search with RavenDb.
3) get metrics to see how they compare.

.. just putting the idea out there if anyone might have the brains and time....

hint.

hint.

maybe?

Oren Eini (Ayende Rahien)

unread,
Apr 23, 2014, 11:19:56 PM4/23/14
to ravendb
Okay, I haven't done any real thinking here, but Lucene on Azure Blob is going to have... interesting perf stats.
Lucene does a LOT of random reads and random writes. It is also very expensive in terms of memory on large indexes, and it needs to merge stuff, which all create very interesting results.

Now, the entire nuget data set is < 100K docs. So that isn't really something that should NEED Azure Blob.


I wrote about this a while ago:





Oren Eini

CEO

Mobile: + 972-52-548-6969

Office:  + 972-4-674-7811

Fax:      + 972-153-4622-7811





--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Itamar Syn-Hershko

unread,
Apr 24, 2014, 9:58:02 AM4/24/14
to rav...@googlegroups.com
Lucene doesn't really do random writes, and it can be used it a way that doesn't require random reads as well. I actually have native Azure support on my TODO list for Lucene.NET and have ad MS open-source sponsor that effort.

Merges are simply new writes. Lucene never goes back to fiddling with previously written files.

RavenDB isn't really suitable for search only usage when there's a requirement for fine-tuning search performance and relevance. I definitely think using raw Lucene is actually the right move here. With regards to performance on Azure - apparently it isn't so bad.

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant

Kijana Woodard

unread,
Apr 24, 2014, 10:29:07 AM4/24/14
to rav...@googlegroups.com
"RavenDB isn't really suitable for search only usage when there's a requirement for fine-tuning search performance and relevance."

Care to expand? What does nuget need to do which requires fine tuning? What prevents RavenDB from being able to do the same?

Oren Eini (Ayende Rahien)

unread,
Apr 24, 2014, 10:31:33 AM4/24/14
to ravendb
You are correct about the random writes. But merges are still very expensive for high frequency updates.
When running on Azure, reading & writing so much data is a killer, because of their latencies.

How do you prevent it from doing random reads? Assuming that you can't hold everything in memory?

Itamar Syn-Hershko

unread,
Apr 24, 2014, 12:04:58 PM4/24/14
to rav...@googlegroups.com
RavenDB wraps Lucene and merely exposes parts of it. When you are building a search service like they do, you want to have as much control as possible on things like when you commit (during indexing, to improve performance) and what query you pass in (you may not necessarily want to pass a string query to be parsed but rather an actual optimized query object with filters and so on, or simply write your own query parser - like they did).

Also things like custom analyzers are much easier to handle (like they did here https://github.com/NuGet/NuGet.Services.Search/pull/10 to implement type ahead type feature), or doing efficient faceting (which RavenDB lacks at the moment, faceting on large data sets are very expensive in terms of time complexity).

When you want to use a search engine, you use a search engine, not something that wraps it unless it's another search server / service which takes great care not to hide anything (like Elasticsearch does to Lucene)

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant


Itamar Syn-Hershko

unread,
Apr 24, 2014, 12:09:07 PM4/24/14
to rav...@googlegroups.com
You are assuming high frequency updates, that is not always the case. And merges can be controlled and I wouldn't be surprised if they use or will use an AzureMergePolicy because of that.

Preventing random reads is making sure everything can be read to memory and stay there as long as possible. The general practice in search today is to make sure you can do that to achieve high performance search, so if we are dealing with a lot of data either use larger instances or shard. There's usually room for optimizations, like removing norms if no boosting is used, using filters only when needed, using the right analyze, indexing positions yes/no depending on usage etc.

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant


Justin A

unread,
Apr 25, 2014, 3:54:05 AM4/25/14
to rav...@googlegroups.com
ElasticSearch was the other idea I was going to ask, later on. So Itamar, you're saying that could also be a good fit for NuGet (search) ... definitely over RavenDb (in this single scenario) ?

Itamar Syn-Hershko

unread,
Apr 25, 2014, 5:46:35 AM4/25/14
to rav...@googlegroups.com

Yes

On Apr 25, 2014 10:54 AM, "Justin A" <jus...@adler.com.au> wrote:
ElasticSearch was the other idea I was going to ask, later on. So Itamar, you're saying that could also be a good fit for NuGet (search) ... definitely over RavenDb (in this single scenario) ?

--
Reply all
Reply to author
Forward
0 new messages