[mongodb-user] How can I find and fix invalid utf-8 strings in the documents ?

3,656 views
Skip to first unread message

Nicolas Fouché

unread,
Apr 28, 2010, 6:47:30 AM4/28/10
to mongodb-user
Hi,

Following http://jira.mongodb.org/browse/SERVER-1056, which is more
about being able to re-launch map/reduce commands if the C code of
MongoDB raised an exception.

I try to find all documents containing invalid utf-8 strings. So as
soon as I fix the problem that generated this error, I can drop and
reinject these documents in MongoDB with my fixed code.

So I created a JS function, and runned it with db.eval().

function find_invalid_utf8() {
db.my_collection.find().forEach( function(obj) {
try {
obj.email
} catch (e) {
print(e.name + " " + obj._id); // never goes here
}
});
}
db.eval(find_invalid_utf8);

The call to "obj.email" raises an exception in
engine_spidermonkey.cpp. And this exception is not catched by the
Javascript code.

Is there a way to catch it ?

Or do I have to recompile mongodb so it logs instead of raising ? If I
query the document in Ruby, it retrieves the invalid string without
error, and I can check if it's UTF-8 or not with String#isutf8. The
drawback here is that it will take longer than using db.eval.

Or do you have a better idea to do this detection ? (mongodump/
mongorestore keeps the document)

Thanks,

Nicolas

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Eliot Horowitz

unread,
Apr 28, 2010, 2:46:37 PM4/28/10
to mongod...@googlegroups.com
Will see if I can convert this to a js error

Eliot Horowitz

unread,
Apr 28, 2010, 3:06:33 PM4/28/10
to mongod...@googlegroups.com
See: http://jira.mongodb.org/browse/SERVER-1063

Should be changed in master, so in tomorrow's nightly.

Nicolas Fouché

unread,
Apr 29, 2010, 4:38:31 AM4/29/10
to mongodb-user
Perfect, with the above script, it iterates on all docs and logs utf-8
problem on one specific key:

Thu Apr 29 10:33:25 connection accepted from 127.0.0.1:64630 #14
decode failed. probably invalid utf-8 string ["c?
cilegigi63"@orange.fr]
why: TypeError: malformed UTF-8 character sequence at offset 2
=== Error catched by the JS map function: Error, for document:
4ad044b00f8bc5399300002c ===
decode failed. probably invalid utf-8 string ["c?cilegigi"@orange.fr]
why: TypeError: malformed UTF-8 character sequence at offset 2
=== Error catched by the JS map function: Error, for document:
4ad044b00f8bc5399300002c ===
dbeval slow, time: 855ms veronica_production.$cmd
Thu Apr 29 10:33:26 query veronica_production.$cmd ntoreturn:1
command: { $eval: function find_invalid_utf8() {
var Utf8 = {encode:function (string... } reslen:61 874ms
Thu Apr 29 10:33:26 end connection 127.0.0.1:64630

So I'm ready to write a bigger script to check all the keys containing
the problematic strings.

Thanks.

On Apr 28, 9:06 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> See:http://jira.mongodb.org/browse/SERVER-1063
>
> Should be changed in master, so in tomorrow's nightly.
>
>
>
> On Wed, Apr 28, 2010 at 2:46 PM, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> > Will see if I can convert this to a js error
>
> > On Wed, Apr 28, 2010 at 6:47 AM, Nicolas Fouché <nico...@silentale.com> wrote:
> >> Hi,
>
> >> Followinghttp://jira.mongodb.org/browse/SERVER-1056, which is more
> >> For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
Reply all
Reply to author
Forward
0 new messages