"can't use sharded collection from db.eval" Why?

Nod

unread,

Jun 6, 2012, 12:07:31 PM6/6/12

to mongodb-user

Hello.

Why so limitation in mongo, that I can not run db.eval (JavaScript
function) on sharded collection?

May be mongodb has undocumented features how to do it?

---------------------------------------------------------------------
My problem is following:

I has "views_raw" collection, which is sharded and every 5 minuts to
this collection writes 100K rows.
One row looks like:
{
"_id": ObjectId("4fcf7349d144ad591c016534"),
"sid": 111,
"uh": "www.site.kz",
"cid": "4facd5427701a",
"ses": "4fcf7278f1f19",
"ip": NumberInt(1597732605),
"t": ISODate("2012-06-06T15: 08: 40.0Z"),
...
}

I need run task, which writes to another sharded collection
"last_ses_active" information about last access of session:
{
"_id": {sid: 111, ses:"4fcf7278f1f19"},
"t": ISODate("2012-06-06T15: 08: 40.0Z"),
}
So from this table I can obtain (via MR) total amount of unique
sessions (ses) for each site (sid) for any arbitrary period of time.
For example: site 111 has 123003 unique sessions for last 24 hours.

So I want run JavaScript code via $MongoDB->command(array('$eval' =>
$code, 'nolock'=>true));
$code = '
var out = db.last_ses_active;
db.views_raw.find(
{"t": {"$gte": ISODate("2012-05-29T18: 00: 00.0Z"), "$lte":
ISODate("2012-05-29T18: 59: 59.0Z")}}
)
.forEach(
function(x) {
out.update(
{_id:{sid: x.sid, ses: x.ses}},
{_id:{sid: x.sid, ses: x.ses}, t: x.t},
upsert = true
);
}
);
';
This command I can parallize, write checks, catch errors AND IT WILL
BE VERY SPEEDLY.

Of course, I can use MapReduce with {merge : "last_ses_active" }, but
It very expensive and slowly:
- MapReduce makes conversations between BSON to JS and vice versa many
time (jsMode flag can't use, because too much data will be in future,
more than 500k unique keys)
- MapReduce creates temprory tables (but in my task not need to do it,
I can write data directly)

Of course, I can do it via client. But I dont want move huge amount
data between client and server.

Please help to solve my problem, how to count unique sessions for each
site for some period of time (from now to past).
Thanks.

P.S. And why did not do work JavaScript on sharded collections?
Indeed!

Nod

unread,

Jun 6, 2012, 12:15:40 PM6/6/12

to mongodb-user

Simply I need analog of MySQL "INSERT ... SELECT ..." in Mongo on
sharded collections, which I can run from client side.

Nod

unread,

Jun 7, 2012, 2:23:58 AM6/7/12

to mongodb-user

Please, vote in JIRA for Improvement "db.eval() for sharding
collections" https://jira.mongodb.org/browse/SERVER-5731

Sam Millman

unread,

Jun 7, 2012, 3:34:50 AM6/7/12

to mongod...@googlegroups.com

The schema and method could be heavily simplified if you orientate your data around your querying. Your trying to do dynamic aggregation which will just become harder and harder the more keys you get, you should look into pre-aggregation of essential data. Also I am ot sure why you house two session collections? It seems as though with some re-orientation of your schema you could house it all in a single insert and then aggregate on a single atomic upsert/modification of already existing data.

On 7 June 2012 07:23, Nod <pavel.ch...@gmail.com> wrote:

Please, vote in JIRA for Improvement "db.eval() for sharding
collections" https://jira.mongodb.org/browse/SERVER-5731

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Sam Millman

unread,

Jun 7, 2012, 9:06:39 AM6/7/12

to mongod...@googlegroups.com

Message accidently sent to me:

Actually full document from "views_raw" collection looks like this:
{
"_id": ObjectId("

4fd09e12d144ad3b020152ac"),
"sd": NumberInt(15498), // shard day - need for sharding for
parallel writes and best MapReduce tasks for reports by hour
"sh": NumberInt(12), // shard hour - need for sharding for
parallel writes and best MapReduce tasks for reports by hour
"sr": NumberInt(754), // shard random - need for sharding for
parallel writes and best MapReduce tasks for reports by hour
"sid": NumberInt(53255), // site id
"uh": "knigo.info", // site host
"up": "\/publ\/avtorskie_proekty\/sharzhi_s_alekseeva\/30", // site
uri
"cid": "4fd09c85647e5", // client id
"ses": "4fd09c8564e99", // session id
"ip": NumberInt(1595598202), // client ip
"t": ISODate("2012-06-07T12: 20: 21.0Z"), // request time
"hs": NumberInt(1), // number of hits (on buttons "front-end"
servers we have 5 minute preagregation)
"ix": {
"0": NumberInt(55), // IX - encodes browser, screensize,
operation system
"1": NumberInt(180),
"2": NumberInt(770),
"3": NumberInt(862),
"4": NumberInt(1973)
},
"cm": {
"0": NumberInt(57389) // CM - encodes country and mobile device
}
}

Data to this collection writes from 3 "front-end" servers (actually
buttons, which shows to users). This servers works user-agents,
cookies data, obtain ip, check country and another pre-processing.

One of type of report is statistics for every site by unique host, ip
and users for past 24 hours. This report generates every hour.
So most preferable way to do this report, from many others is
create collection {_id:{site_id, session}, t:DateTime()}
Make index on field "t".
Data older than 25 hour can be deleted.

Of course I don't want write this collection from "front-end" servers.
Because also need write such collection abount users and ip. In future
may be need write additional collections for new reports.

So my decision is write raw data from button to one collection. And
other proccesing make inside mongo.

---------- Forwarded message ----------
From: Sam Millman <sam.m...@gmail.com>
Date: 7 June 2012 14:05
Subject: Re: "can't use sharded collection from db.eval" Why?
To: Nod <pavel.ch...@gmail.com>

In gmail you can just forward your original mail from the sent folder to mongodb-user and it will forward the entire thing but here it is:

Actually full document from "views_raw" collection looks like this:
{
"_id": ObjectId("

4fd09e12d144ad3b020152ac"),
"sd": NumberInt(15498), // shard day - need for sharding for
parallel writes and best MapReduce tasks for reports by hour
"sh": NumberInt(12), // shard hour - need for sharding for
parallel writes and best MapReduce tasks for reports by hour
"sr": NumberInt(754), // shard random - need for sharding for
parallel writes and best MapReduce tasks for reports by hour
"sid": NumberInt(53255), // site id
"uh": "knigo.info", // site host
"up": "\/publ\/avtorskie_proekty\/sharzhi_s_alekseeva\/30", // site
uri
"cid": "4fd09c85647e5", // client id
"ses": "4fd09c8564e99", // session id
"ip": NumberInt(1595598202), // client ip
"t": ISODate("2012-06-07T12: 20: 21.0Z"), // request time
"hs": NumberInt(1), // number of hits (on buttons "front-end"
servers we have 5 minute preagregation)
"ix": {
"0": NumberInt(55), // IX - encodes browser, screensize,
operation system
"1": NumberInt(180),
"2": NumberInt(770),
"3": NumberInt(862),
"4": NumberInt(1973)
},
"cm": {
"0": NumberInt(57389) // CM - encodes country and mobile device
}
}

Data to this collection writes from 3 "front-end" servers (actually
buttons, which shows to users). This servers works user-agents,
cookies data, obtain ip, check country and another pre-processing.

One of type of report is statistics for every site by unique host, ip
and users for past 24 hours. This report generates every hour.
So most preferable way to do this report, from many others is
create collection {_id:{site_id, session}, t:DateTime()}
Make index on field "t".
Data older than 25 hour can be deleted.

Of course I don't want write this collection from "front-end" servers.
Because also need write such collection abount users and ip. In future
may be need write additional collections for new reports.

So my decision is write raw data from button to one collection. And
other proccesing make inside mongo.

On 7 June 2012 13:58, Nod <pavel.ch...@gmail.com> wrote:

Sorry, I mistakenly clicked on "Answer to author" and my answer
(prevous letter) is left only to your personal mail. Could you send me
a message at pavel.ch...@gmail.com. So I can repost it for
everyone in group. It difficult for me to write such message again )))

Thank you.

On 7 июн, 13:34, Sam Millman <sam.mill...@gmail.com> wrote:

> The schema and method could be heavily simplified if you orientate your
> data around your querying. Your trying to do dynamic aggregation which will
> just become harder and harder the more keys you get, you should look into
> pre-aggregation of essential data. Also I am ot sure why you house two
> session collections? It seems as though with some re-orientation of your
> schema you could house it all in a single insert and then aggregate on a
> single atomic upsert/modification of already existing data.
>

Reply all

Reply to author

Forward