Thanks for the links. We're sticking to ruby 1.8 for now, and I
haven't had too much success with iconv. I don't really know the
source encoding (the content could be copy-and-pasted from different
web pages), the different special characters can be from various and
mixed sources. I tried using source utf8, ascii, latin1, with
TRANSLIT//IGNORE options but just get the special character striped
out or mapped to some code that doesn't look right when I open up the
file in excel to check it.
Just curiously though, if anyone knows how come this same data imports
to mysql without a problem? For example: my data has a right-side
curly quote. In unix "less" command it is highlighted as a "<94>".
Mongo import complains that it is malformed utf8. But I can import it
to mysql without any warnings or errors and when I look at the data
using phpmyadmin, the data looks fine, like a curly quote in a
browser. When viewing the data using the mysql console, the data
character looks like a question mark... does this mean that mysql just
imported it straight without converting it to UTF8, or just that the
mysql console can't display non-ascii characters? In the ruby
console, when I pull up that data through an active record object the
data looks like "\224". But when I write that string out to a file
and look at it in "less" or "vi", again it looks like <94>. So
nothing was done to it?
I'm just so confused. I could have the data team go and manually fix
these utf8 issues in the csv files, but there are literally thousands
of these issues, and since we don't have to do anything on the mysql
system, they will ask me why this transliteration can't be done
programmatically... but I haven't found a solution for that.
Is there a way to allow non-utf8 characters into mongodb?
H
> On Sat, Jul 3, 2010 at 4:32 PM, Henry <
homanc...@gmail.com> wrote:
> > I have csv files with product data that I want to import to mongo db.
> > Some of the records are being rejected upon import because the
> > description fields (which were copy-pasted from manufacturer's
> > websites) having certain characters like copyright, trademark, plus
> > over minus symbols, degrees, etc, that apparently are not UTF8. I'm
> > using the ruby driver. Is there anything I can do in the import
> > process to cast those Non-utf8 characters into something mongo db can
> > accept, short of removing the characters or doing something like HTML
> > encode entities on the whole thing?
>
> > Thanks.
>
> > Henry
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "mongodb-user" group.
> > To post to this group, send email to
mongod...@googlegroups.com.
> > To unsubscribe from this group, send email to
> >
mongodb-user...@googlegroups.com<
mongodb-user%2Bunsu...@googlegroups.com>
> > .