Mongodb Nosql

1 view

Skip to first unread message

Athenasby Regalado

unread,

Aug 4, 2024, 7:08:35 PM8/4/24

to comropyssi

Withthe NoSQL movement growing based on document-based databases, I've looked at MongoDB lately. I have noticed a striking similarity with how to treat items as "Documents", just like Lucene does (and users of Solr).

Lucene gives some serious advantages, such as powerful searching and weight systems. Not to mention facets in Solr (which Solr is being integrated into Lucene soon, yay!). You can use Lucene documents to store IDs, and access the documents as such just like MongoDB. Mix it with Solr, and you now get a WebService-based, load balanced solution.

The restrictions around MongoDB reminds me of using MemCached, but I can use Microsoft's Velocity and have more grouping and list collection power over MongoDB (I think). Can't get any faster or scalable than caching data in memory. Even Lucene has a memory provider.

MongoDB etc. seem to serve a purpose where there is no requirement of searching and/or faceting. It appears to be a simpler and arguably easier transition for programmers detoxing from the RDBMS world. Unless one's used to it Lucene & Solr have a steeper learning curve.

There aren't many examples of using Lucene/Solr as a datastore, but Guardian has made some headway and summarize this in an excellent slide-deck, but they too are non-committal on totally jumping on Solr bandwagon and "investigating" combining Solr with CouchDB.

Finally, I will offer our experience, unfortunately cannot reveal much about the business-case. We work on the scale of several TB of data, a near real-time application. After investigating various combinations, decided to stick with Solr. No regrets thus far (6-months & counting) and see no reason to switch to some other.

Summary: if you do not have a search requirement, Mongo offers a simple & powerful approach. However if search is key to your offering, you are likely better off sticking to one tech (Solr/Lucene) and optimizing the heck out of it - fewer moving parts.

[...] However we observe that query performance of Solr decreases when index size increases. We realized that the best solution is to use both Solr and Mongo DB together. Then, we integrate Solr with MongoDB by storing contents into the MongoDB and creating index using Solr for full-text search. We only store the unique id for each document in Solr index and retrieve actual content from MongoDB after searching on Solr. Getting documents from MongoDB is faster than Solr because there is no analyzers, scoring etc. [...]

From my experience with both, Mongo is great for simple, straight-forward usage. The main Mongo disadvantage we've suffered is the poor performance on unanticipated queries (you cannot created mongo indexes for all the possible filter/sort combinations, you simple can't).

Since no one else mentioned it, let me add that MongoDB is schema-less, whereas Solr enforces a schema. So, if the fields of your documents are likely to change, that's one reason to choose MongoDB over Solr.

@mauricio-scheffer mentioned Solr 4 - for those interested in that, LucidWorks is describing Solr 4 as "the NoSQL Search Server" and there's a video at -solr-4-the-nosql-search-server/ where they go into detail on the NoSQL(ish) features. (The -ish is for their version of schemaless actually being a dynamic schema.)

If you just want to store data using key-value format, Lucene is not recommended because its inverted index will waste too much disk spaces. And with the data saving in disk, its performance is much slower than NoSQL databases such as redis because redis save data in RAM. The most advantage for Lucene is it supports much of queries, so fuzzy queries can be supported.

MongoDB Atlas will have a lucene-based search engine soon. The big announcement was made at this week's MongoDB World 2019 conference. This is a great way to encourage more usage of their high revenue MongoDB Atlas product.

The third party solutions, like a mongo op-log tail are attractive. Some thoughts or questions remain about whether the solutions could be tightly integrated, assuming a development/architecture perspective. I don't expect to see a tightly integrated solution for these features for a few reasons (somewhat speculative and subject to clarification and not up to date with development efforts):

might be a good idea as a start; later I may think about storing everything in mongo/raven, it is my first project with umbraco, I still exploring the product, and not fully ware about its data model or object model.

There's no way which you can remove the SQL database requirement of Umbraco without forking the product and changing the source code yourself. I'm not sure what advantage you're hoping for by using a DDB rather than a RDBMS for Umbraco content.

what I want to do is to use umbraco for a content sharing web site, to share videos, photos, articles ..... that means all the registered users will be able to post content, comment, rate, and see most commented, rated, viewed ...... and I expect a large costomer base and intensive read write operations; and I don't want to store part of my data in SQL and other part in mongo/raven I think I'll fork the code and change it; only for the modules I'll use in my site, but this means I'll need a RDBMS server, which will increase the hosting cost, administration ...

@slace: I'm not sure I understand this question... any DDB or ODB is a much more natural fit to Umbraco's data structure than any RDBMS. It is also going to be much faster, and support sharding, replication and what not out of the box, freeing you to work on the actual CMS instead of its surroundings.

The Umbraco data is stored in a RDBMS, but this is only used for the backend. When you publish the content it is stored in a nice formatted XML structure which is cached so you only query memory which is fast IMHO.

This may be a dead conversation, but I too see a great advantage to using MongoDB on the back end of Umbraco and I for one would be interested helping develop it. I definately agree it would need to be a fork.

In some circumstances the XML cache document becomes a bottleneck, regenerating this each time content is publisehd is also a massively inefficient use of resources. With a high enough frequency of publishes and a large enough set of nodes this process also fails a lot.

Hi guys, what about doing some experiments with this? Starting with a simple mongodb cache/replication of the xml tree. I have no experience with nosql in real scenarios and would be interested to hear suggestions about how to structure the information (document content to begin with) in mongodb.

I'm new to the concept of nosql databases and have never used it. Based on what I've read and the little that I've understood I still don't see how they can be particularly useful if you can't make references between data, if there's no concept of foreign key.

If you have data structures that are not clearly defined at the time when you make the system. I tend to keep user settings in nosql, for example. Another example was a system where the users needed to be able to add fields at runtime - very painful in an RDBMS and a breeze in NoSQL.

If your model structure is largely centered around one or few model objects and most relationships are actually child objects of the main model objects. In this case you will find that you will have fairly little need for actual joins. I found that contact management system can be implemented quite nicely in nosql for example. A person can have multiple addresses, phones and e-mails. Instead of putting them each into a separate table, they all become part of the same model and you have one person object.

Honestly, the no join thing sounded quite scary to me too in the beginning. But the trick is to stop thinking in SQL. You have to actually think with the object you have in memory when you are running your application. These should more or less just be saved into the NoSQL database as they area.

Because you can store your full object graph, with child objects, most of the need for joins is eliminated. And if you find you need one, you will have to bite the bullet and fetch both objects and join in your application code.

I would definitely use such a database during the planning stages of a project (before development, before even design) to record data whose structure, relationships, and characteristics are not yet known and subject to analysis. I would try to make everything fit a relational model after that.

can be as simple as loading the comments part of a document corresponding to a user. This is called denormalization: instead of having two sets with a join, you have one document and everything you need is inside the document. One query, no joins, better performance.

But in some circumstances, this may lead to data duplication, so linking from one to another document could be suitable. In this case, you may be interested by MongoDB normalization, foreign key and joining, Database References page and especially DBRefs feature.

NoSQL is a classification of database systems that do not conform to the relational database or SQL standard. They have various roots, from distributed internet databases, to object databases, XML databases and even legacy databases. They have become recently popular because of their use in large scale distributed databases in Google, Amazon, and Facebook.

EclipseLink's NoSQL support allows the JPA API and JPA annotations/xml to be used with NoSQL data. EclipseLink also supports several NoSQL specific annotations/xml including @NoSQL that defines a class to map NoSQL data.

EclipseLink's NoSQL support is based on previous EIS support offered since EclipseLink 1.0. EclipseLink's EIS support allowed persisting objects to legacy and non-relational databases. EclipseLink's EIS and NoSQL support uses the Java Connector Architecture (JCA) to access the data-source similar to how EclipseLink's relational support uses JDBC. EclipseLink's NoSQL support is extendable to other NoSQL databases, through the creation of an EclipseLink EISPlatform class and a JCA adapter.