Handling the long dash character

59 views
Skip to first unread message

Vamsee Kanakala

unread,
Jun 25, 2009, 10:39:21 AM6/25/09
to nokogi...@googlegroups.com
Hi,

I'm noticing that Nokogiri is not converting the long dash character (–)
properly. It seems to get parsed correctly, but doing a to_s is giving
some UTF junk. For example, this sentence:

Social Media News and Web Tips – Mashable – The Social Media Guide

Gets converted to:

Social Media News and Web Tips \342\200\223 Mashable \342\200\223 The
Social Media Guide

And does not render properly. What am I missing?

Thanks,
Vamsee.

Aaron Patterson

unread,
Jun 29, 2009, 2:23:23 PM6/29/09
to nokogi...@googlegroups.com

You should either a) convert the UTF-8 in to html entities, or b)
declare the charset as UTF-8 (I am assuming you're trying to view the
output in a browser).

For solution a, you could do something like the following:

doc = Nokogiri::HTML('<div>Social Media News and Web Tips – Mashable –
The Social Media Guide</div>')
puts doc.at('div').child.to_html

For b, add the charset to your Content-Type header:

Content-Type: text/html; charset=utf-8

http://www.w3.org/International/O-HTTP-charset

--
Aaron Patterson
http://tenderlovemaking.com/

Vamsee Kanakala

unread,
Jul 2, 2009, 10:45:56 AM7/2/09
to nokogi...@googlegroups.com
Aaron Patterson wrote:
> You should either a) convert the UTF-8 in to html entities, or b)
> declare the charset as UTF-8 (I am assuming you're trying to view the
> output in a browser).
>

Sorry to respond late, didn't get a chance to test earlier - I don't
think I explained properly, I'll try to explain with examples, probably
that will be clearer:


doc1 = Nokogiri::HTML(open('http://mashable.com'))
doc1.at('title')
=> <title>Social Media News and Web Tips – Mashable – The Social
Media Guide</title>


However, it doesn't have problem with these Japanese characters:


doc2 = Nokogiri::HTML(open('http://konieczny.be/unicode.html'))
doc2.at('i')
=> <i>Have no idea how to write 時掌握天下 or other <em>kanji</em> sign?
We will help you!
</i>


Both files have their Content-Type set to utf-8, so I'm wondering why
the first example is showing odd characters? Also, using to_html is not
converting the utf-8 characters to equivalent html entities; it just
shows characters like this:


doc2.at('i').to_html
=> "<i>Have no idea how to write
\346\231\202\346\216\214\346\217\241\345\244\251\344\270\213\302\240 or
other <em>kanji</em> sign? We will help you!\n </i>"


Shouldn't this give the html entity equivalent of the utf-8 character?
Something like &#26178;&#25484;&#25569;&#22825;&#19979; perhaps?

Thanks,
Vamsee.

Aaron Patterson

unread,
Jul 2, 2009, 12:01:15 PM7/2/09
to nokogi...@googlegroups.com
On Thu, Jul 2, 2009 at 7:45 AM, Vamsee Kanakala<vaml...@gmail.com> wrote:
>
> Aaron Patterson wrote:
>> You should either a) convert the UTF-8 in to html entities, or b)
>> declare the charset as UTF-8 (I am assuming you're trying to view the
>> output in a browser).
>>
>
> Sorry to respond late, didn't get a chance to test earlier - I don't
> think I explained properly, I'll try to explain with examples, probably
> that will be clearer:
>
>
> doc1 = Nokogiri::HTML(open('http://mashable.com'))
> doc1.at('title')
> => <title>Social Media News and Web Tips – Mashable – The Social
> Media Guide</title>

In this example, unfortunately, the encoding cant be intuited until
*after* the title tag. I believe there is a bug open for libxml2 for
this problem. But if you do this:

doc1 = Nokogiri::HTML(open('http://mashable.com'), nil, 'UTF-8')
doc1.at('title')

You'll get the right output. Since this particular page is XHTML you
could use the XML parser. If you use the XML parser, you do not need
provide an encoding, and you'll get the right content:

doc1 = Nokogiri::XML(open('http://mashable.com'))
puts doc1.at('title').content

> However, it doesn't have problem with these Japanese characters:
>
>
> doc2 = Nokogiri::HTML(open('http://konieczny.be/unicode.html'))
> doc2.at('i')
> => <i>Have no idea how to write 時掌握天下 or other <em>kanji</em> sign?
> We will help you!
> </i>

This is well after the encoding declaration, so I'm not surprised this
one works. :-)

> Both files have their Content-Type set to utf-8, so I'm wondering why
> the first example is showing odd characters? Also, using to_html is not
> converting the utf-8 characters to equivalent html entities; it just
> shows characters like this:

Actually, it looks like this page has it's content set to ISO-8859-2.

> doc2.at('i').to_html
> => "<i>Have no idea how to write
> \346\231\202\346\216\214\346\217\241\345\244\251\344\270\213\302\240 or
> other <em>kanji</em> sign? We will help you!\n </i>"
>
>
> Shouldn't this give the html entity equivalent of the utf-8 character?
> Something like &#26178;&#25484;&#25569;&#22825;&#19979; perhaps?

I suspect it's to do with the version of libxml2 you're using. I'm
using 2.7.3, and I get the entity equivalents. I find this
interesting though. In an HTML document, I don't think you're
required to use entities. If the browser knows the encoding of the
document, why bother with entities?

Anyway, if you're running nokogiri 1.3.x, run "nokogiri -v" to check
the version of libxml2 you're using. If you can, try upgrading to
2.7.3.

Hope that helps!

Vamsee Kanakala

unread,
Jul 2, 2009, 2:15:15 PM7/2/09
to nokogi...@googlegroups.com
Aaron Patterson wrote:
> In this example, unfortunately, the encoding cant be intuited until
> *after* the title tag. I believe there is a bug open for libxml2 for
> this problem. But if you do this:
>
> doc1 = Nokogiri::HTML(open('http://mashable.com'), nil, 'UTF-8')
> doc1.at('title')
>
> You'll get the right output.

Thanks so much, that's exactly what I needed. I retrospect, I should've
guessed it from the documentation. Sorry for the bother.

> I suspect it's to do with the version of libxml2 you're using. I'm
> using 2.7.3, and I get the entity equivalents. I find this
> interesting though. In an HTML document, I don't think you're
> required to use entities. If the browser knows the encoding of the
> document, why bother with entities?
>

You're right, I just wanted the html entities to cross check if it was
parsing the UTF chars correctly. Didn't realize it's a libxml2 problem.
Thanks again.


Vamsee.

Aaron Patterson

unread,
Jul 2, 2009, 9:48:29 PM7/2/09
to nokogi...@googlegroups.com
On Thu, Jul 2, 2009 at 11:15 AM, Vamsee Kanakala<vaml...@gmail.com> wrote:
>
> Aaron Patterson wrote:
>> In this example, unfortunately, the encoding cant be intuited until
>> *after* the title tag.  I believe there is a bug open for libxml2 for
>> this problem.  But if you do this:
>>
>>   doc1 = Nokogiri::HTML(open('http://mashable.com'), nil, 'UTF-8')
>>   doc1.at('title')
>>
>> You'll get the right output.
>
> Thanks so much, that's exactly what I needed. I retrospect, I should've
> guessed it from the documentation. Sorry for the bother.

No problem. Dealing with encoding is a PITA if you ask me.

>> I suspect it's to do with the version of libxml2 you're using.  I'm
>> using 2.7.3, and I get the entity equivalents.  I find this
>> interesting though.  In an HTML document, I don't think you're
>> required to use entities.  If the browser knows the encoding of the
>> document, why bother with entities?
>>
>
> You're right, I just wanted the html entities to cross check if it was
> parsing the UTF chars correctly. Didn't realize it's a libxml2 problem.
> Thanks again.

No problem, glad to help. :-)

Reply all
Reply to author
Forward
0 new messages