Hello,
just a short message to announce Cheméo,
http://www.chemeo.com a
search engine for chemical properties built on top of MongoDB. I wrote
down a fairly extensive explanation of the tools and software used,
included quite some MongoDB tips here:
http://chemeo.com/doc/technology
A short summary for the most interesting part with respect to MongoDB,
is that I need to index a large number of properties for each chemical
component with search by min/max value ranges. This resulted in
troubles with respect to indexing (70+ properties at the moment and it
will grow). So at the end I followed the advices of MongoDB fathers
and created for each component with a special key for indexing. Here
is a copy/paste of the relevant part of the post:
{i: [
{k: 'myprop',
n: 10.1, // Min value
x: 123.0}, // Max value
{k: 'otherprop',
n: -1234.1,
x: 254.0},
// you can add more properties
]}
Then you need to think about the query. It will always need to know
about the key and then min or max or both. So we need two indexes:
* Index 1 on ('i.k', 'i.n', 'i.x'), which can also be used to
search on the key only and the key plus the min value;
* Index 2 on ('i.k', 'i.x'), which can be use to search on the key
plus the max value.
This means that now, when looking for a component with the $all and
$elemMatch operators, you will always hit the indexes, yeah! But then,
a guy will do a search which will translate to something like that:
{ i: { $all: { [{ $elemMatch: { k: "mw", n: { $lte: 400.0 } } },
{ $elemMatch: { k: "tc", x: { $gte: 500.0 } } },
{ $elemMatch: { k: "hf", n: { $lte: 500.0 }, x:
{ $gte: 50.0 } } }
] } } }
And your server will fall, because mw is the molecular weight and
Mongo will take the first hit in the $all query and then do a standard
scan for the other properties without using the index. In that case,
even if we have only 50 components matching the hf value, if mw
provides 50,000 components, Mongo will scan 50,000 components. Oups,
the wrong part of the index is used. You need to know your data to
order your query the best way to correctly hit your index.
The rest of the post
http://chemeo.com/doc/technology is talking about
things like Node.js/Python/Open Babel/pyparselet and the other tools
used.
The indexing stuff is really what took me some time to get right as I
am coming from the RDBMS world.
loïc
--
Indefero, project management & code hosting -
http://www.indefero.net
Pluf PHP5 Framework inspired by Django -
http://www.pluf.org
Cheméo, high quality chemical properties -
http://www.chemeo.com