Problem with unicode and Tidy

583 views
Skip to first unread message

Kim Mosley

unread,
Mar 4, 2011, 11:35:47 AM3/4/11
to BBEdit Talk
When I use Tidy (either clean or reflow) the unicode is converted to
an em dash... which then is displayed as trash in Safari. Please help!
I need to use Tidy for formatting... but I don't want it to change
these.


Winedale Spring Festival—Last weekend of March or First weekend
in April

Robert A. Rosenberg

unread,
Mar 6, 2011, 11:11:27 PM3/6/11
to bbe...@googlegroups.com
At 08:35 AM -0800 on 03/04/2011, Kim Mosley wrote about Problem with
unicode and Tidy:

I assume when you say the Unicode is converted to an em dash, you
mean that a numbered entry (—) is converted into the character
it represents (ie: the em dash "�"). You also state this is displayed
as junk by which I assume you mean as 3 characters (xE2 x80 x94 ie: �
� �). That means that it is being saved as UTF-8. Is your Charset
UTF-8 or ISO-8859-1. If the latter then that is your problem. You
nust tell Safari that the HTML is UTF=8 for it to display correctly
by converting the 3 character UTF-8 string back into the em dash
glyph.

Kim Mosley

unread,
Mar 7, 2011, 8:46:59 AM3/7/11
to bbe...@googlegroups.com
What I'd like is for the unicodes to be let alone when I use Tidy. It seems that is the only way for browsers to display the right character. Is there a way to use Tidy, and not have the code of the page altered (the unicodes converted to their actual character)?

Thanks,

Kim

On Sun, Mar 6, 2011 at 10:11 PM, Robert A. Rosenberg <rar...@banet.net> wrote:
At 08:35 AM -0800 on 03/04/2011, Kim Mosley wrote about Problem with unicode and Tidy:
I assume when you say the Unicode is converted to an em dash, you mean that a numbered entry (&#8212;) is converted into the character it represents (ie: the em dash "‹"). You also state this is displayed as junk by which I assume you mean as 3 characters (xE2 x80 x94 ie: â Ä ê). That means that it is being saved as UTF-8. Is your Charset UTF-8 or ISO-8859-1. If the latter then that is your problem. You nust tell Safari that the HTML is UTF=8 for it to display correctly by converting the 3 character UTF-8 string back into the em dash glyph.

--
You received this message because you are subscribed to the "BBEdit Talk" discussion group on Google Groups.
To post to this group, send email to bbe...@googlegroups.com
To unsubscribe from this group, send email to
bbedit+un...@googlegroups.com
For more options, visit this group at
<http://groups.google.com/group/bbedit?hl=en>
If you have a feature request or would like to report a problem, please email "sup...@barebones.com" rather than posting to the group.
Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>



--
Kim Mosley
mrkim...@gmail.com
Website: http://kimmosley.com
Blog: http://kimmosley.com/blog

Robert Huttinger

unread,
Mar 7, 2011, 9:55:06 AM3/7/11
to bbe...@googlegroups.com
is there a way to use regex? you can do a search and replace for anyhtign that isnt unicode

Rich Siegel

unread,
Mar 7, 2011, 10:19:23 AM3/7/11
to bbe...@googlegroups.com
On Monday, March 7, 2011, Kim Mosley <mrkim...@gmail.com> wrote:

>What I'd like is for the unicodes to be let alone when I use Tidy. It
>seems that is the only way for browsers to display the right
>character. Is there a way to use Tidy, and not have the code of the
>page altered (the unicodes converted to their actual character)?

First: I think there's a terminology problem here that is mixing
things up. :-)

It sounds like you're using the term "unicodes" to refer to HTML
entities: they begin with an ampersand, have a few numeric
characters (or sometimes a name), and end with a semicolon. For
example: "&#8212;" or "&copy;". "Unicode" has a specific
meaning, and it's something else entirely. :-)

Second: Note that Tidy does many things, and converting entities
to actual Unicode characters is one of them. Depending on why
you're using Tidy, this can either be a help, or a nuisance.
Tidy isn't just a pretty printer: part of what it does is
rewrite your code in ways that it thinks are appropriate -- even
if you don't. :-)

Third: If Tidy is converting entities to actual characters, and
they are displaying incorrectly in the browser, then your
document either has an incorrect character set declaration, or
the web server is misconfigured and providing the document in
the wrong character set. If you're simply previewing the
documents and you observe incorrect display, than a missing or
incorrect character set declaration is the most likely explanation.

Finally: if you're simply trying to pretty-print your code, the
built-in formatters (not Tidy) are the way to go. (See the
formatter commands on the Utilities submenu of the Markup menu.)
The formatters will change the layout of your code, but will not
perform entity conversion or any other content transformations.
The "Pretty Print" option is probably your best bet.

R.
--
Rich Siegel Bare Bones Software, Inc.
<sie...@barebones.com> <http://www.barebones.com/>

Someday I'll look back on all this and laugh... until they
sedate me.

Kim Mosley

unread,
Mar 7, 2011, 10:20:15 AM3/7/11
to bbe...@googlegroups.com
This must be a simple fix... otherwise Tidy is useless. 

Starting with this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">

<html>
<head>
  <title></title>
</head>

<body>
  <P>Santa&#8211;Claus</P>
</body>
</html>

When I do Tidy the &#8211; is replaced by an en dash, which sometimes displays correctly and sometimes doesn't. I want to keep &#8211; as it is. How do I do that with Tidy?

Thanks,

Kim

Kim Mosley

unread,
Mar 7, 2011, 11:25:31 AM3/7/11
to bbe...@googlegroups.com

Alex Satrapa

unread,
Mar 7, 2011, 6:47:35 PM3/7/11
to bbe...@googlegroups.com
On 08/03/2011, at 02:20 , Kim Mosley wrote:

> When I do Tidy the &#8211; is replaced by an en dash, which sometimes displays correctly and sometimes doesn't. I want to keep &#8211; as it is. How do I do that with Tidy?

Interestingly enough, when I run your sample through Tidy, it replaces the &#8211; with a hyphen-minus (ie: &#002D;).

This appears to be because the "remove bogus markup" option of the "Tidy" command sends the "-bare" option to the html-tidy program. This option, according to the html-tidy documentation, "strips out smart quotes and em dashes, etc." If so, I would suggest to Bare Bones that the "remove bogus markup" option be renamed to "Use braindead punctuation" ('cos it's removing "smart" quotes).

So assuming the problem you describe is in fact the en-dash being replaced with a hyphen, your two options are:
- Use the format command (Markup -> Utilities -> Format…) (to avoid Tidy's obscure vagaries)
- Un-tick the "remove bogus markup" option when running the "Tidy" command

Hope this helps!
Alex

LuKreme

unread,
Mar 7, 2011, 6:48:49 PM3/7/11
to bbe...@googlegroups.com
On Mar 7, 2011, at 8:20, Kim Mosley <mrkim...@gmail.com> wrote:

> When I do Tidy the &#8211; is replaced by an en dash, which sometimes displays correctly and sometimes doesn't. I want to keep &#8211; as it is. How do I do that with Tidy?

Short answer: you don't.

Longer answer: you don't need to. Set you character encoding of the HTML properly to UTF-8 and everything will work.

Alex Satrapa

unread,
Mar 7, 2011, 7:18:36 PM3/7/11
to bbe...@googlegroups.com
On 08/03/2011, at 10:48 , LuKreme wrote:

> Longer answer: you don't need to. Set you character encoding of the HTML properly to UTF-8 and everything will work.

Only if you don't have “remove bogus markup” ticked.

One opines that “remove bogus markup” is mislabelled, since curly quotes*, em-dash, en-dash and various other punctuation aren't ‘bogus’ markup so much as ‘unfashionable according to some people.’

So perhaps “VT100 conformant punctuation”, or “remove bogan markup” would be better, since both express the opinion behind the process :)

*Curly quotes are Unicode. “Smart Quotes” is a mis-feature of Microsoft products which used proprietary extensions to ANSI/ASCII to display left/right single/double quotation marks. The “smart” part refers to the quotes being automatically curled depending on where you typed them, so " followed by a word would be corrected to '“', while " following a word would be corrected to ”. As Microsoft used proprietary encoding, “Smart Quotes” are bad. Curly quotes (i.e.: left and right quotation marks) expressed as Unicode are not “Smart Quotes” since they are Unicode, and thus should be displayed as the typographer intended.

If someone can give me an example of where “curly” quotes are “bad” I'll stop using them — perhaps there is a popular screen reader used by blind people that chokes on Unicode punctuation.

Alex

Doug McNutt

unread,
Mar 7, 2011, 7:56:53 PM3/7/11
to bbe...@googlegroups.com
At 11:18 +1100 3/8/11, Alex Satrapa wrote:
>If someone can give me an example of where "curly" quotes are "bad" I'll stop using them - perhaps there is a popular screen reader used by blind people that chokes on Unicode punctuation.
>

There is nothing like an overzealous email client that defaults to changing ASCII quotes to curly versions when the information being transmitted is a bunch of shell script or C source.

And more on topic it appears that Tidy is trying to be helpful. My experience with such things is pretty much always bad. Microsoft Excel is terrible that way.

We are on the verge of HTML-5. Does any one know if such things as &xxx; are going to get deprecated in favor of unicode?

--
--> The best programming tool is a soldering iron <--

Alex Satrapa

unread,
Mar 7, 2011, 10:23:53 PM3/7/11
to bbe...@googlegroups.com
On 08/03/2011, at 11:56 , Doug McNutt wrote:

> There is nothing like an overzealous email client that defaults to changing ASCII quotes to curly versions when the information being transmitted is a bunch of shell script or C source.

Ahh, the subtle difference between "Smart Quotes" and curly quotes.

> And more on topic it appears that Tidy is trying to be helpful. My experience with such things is pretty much always bad. Microsoft Excel is terrible that way.

From the web page of ‘demoronizer’ (http://www.fourmilab.ch/webtools/demoroniser/):

> A little detective work revealed that, as is usually the case when you encounter something shoddy in the vicinity of a computer, Microsoft incompetence and gratuitous incompatibility were to blame. Western language HTML documents are written in the ISO 8859-1 Latin-1 character set, with a specified set of escapes for special characters. Blithely ignoring this prescription, as usual, Microsoft use their own "extension" to Latin-1, in which a variety of characters which do not appear in Latin-1 are inserted in the range 0x82 through 0x95--this having the merit of being incompatible with both Latin-1 and Unicode, which reserve this region for additional control characters.


Thus my expectation is that “Remove bogus markup” is written by people who don’t mind throwing the baby (typographical punctuation) out with the bathwater (Microsoft’s “Smart Quotes”). They’re trying to be ‘helpful’ but in their typically overzealous manner they aren’t settling for tilting at windmills, they’re razing them. Rather than improve screen readers to be Unicode compatible, these people are dumbing down the content to suit the screen readers. Would they insist that Georges Seurat use only rectilinear grids for, “A Sunday Afternoon on the Island of La Grande Jatte”?

> We are on the verge of HTML-5. Does any one know if such things as &xxx; are going to get deprecated in favor of unicode?

Apart from a small set, HTML 5 deprecates named entities in favour of Unicode numerical entities, expressed in decimal. That is, rather than ‘&ldquo;’ or ‘&#x201c;’ use ‘&#8220;’.

I don’t follow the HTML5 community, so I’m not aware of any plans to deprecate entities in favour of straight Unicode. I prefer using Unicode characters, since that’s what I’m using in my day to day work.

Alex

LuKreme

unread,
Mar 7, 2011, 11:44:09 PM3/7/11
to bbe...@googlegroups.com
On Mar 7, 2011, at 17:56, Doug McNutt <doug...@macnauchtan.com> wrote:
> We are on the verge of HTML-5. Does any one know if such things as &xxx; are going to get deprecated in favor of unicode?

Entities are not deprecated, but with UTF-8 they are largely unnecessary.

Charlie Garrison

unread,
Mar 8, 2011, 12:37:20 AM3/8/11
to bbe...@googlegroups.com
Good afternoon,

Do you have any references for that? I've found a couple of
references that say they are, and I also got errors (or maybe
warnings) from w3c validator when using any but just a few
*named* entities.

This page indicates there are only 5 *named* entities which are
still valid:

<http://www.html-5.com/cheat-sheet/html-character-codes.html>

Of course with HTML5 spec still being draft, things could change.


Charlie

--
Ꮚ Charlie Garrison ♊ <garr...@zeta.org.au>

O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
http://www.ietf.org/rfc/rfc1855.txt

LuKreme

unread,
Mar 8, 2011, 12:48:32 AM3/8/11
to bbe...@googlegroups.com




On Mar 7, 2011, at 22:37, Charlie Garrison <garr...@zeta.org.au> wrote:

This page indicates there are only 5 *named* entities which are still valid:

<http://www.html-5.com/cheat-sheet/html-character-codes.html>

Ah, I had not seen that.

Of course with HTML5 spec still being draft, things could change.

I don't mind entities going away, but I would not be surprised to see them allowed, even if marked as officially deprecated.

I am pretty certain they will continue to work.

Watts Martin

unread,
Mar 9, 2011, 6:27:22 PM3/9/11
to BBEdit Talk
On Mar 7, 7:23 pm, Alex Satrapa <gr...@goldweb.com.au> wrote:

> Apart from a small set, HTML 5 deprecates named entities in favour of Unicode numerical entities, expressed in decimal. That is, rather than ‘&ldquo;’ or ‘&#x201c;’ use ‘&#8220;’.
>
> I don’t follow the HTML5 community, so I’m not aware of any plans to deprecate entities in favour of straight Unicode. I prefer using Unicode characters, since that’s what I’m using in my day to day work.

I don't see anything in the HTML5 spec that says anything about this
deprecation, although it's admittedly a rather dense spec and I may
easily have missed it. Nonetheless, there's a table in the HTML5 spec
right now reading "Named character references": "this table lists the
character reference names supported by HTML and the code points to
which they refer." This at least suggests that you're still going to
be able to use named entities.

http://dev.w3.org/html5/spec/Overview.html#named-character-references

Alex Satrapa

unread,
Mar 9, 2011, 7:52:04 PM3/9/11
to bbe...@googlegroups.com
On 10/03/2011, at 10:27 , Watts Martin wrote:

> On Mar 7, 7:23 pm, Alex Satrapa <gr...@goldweb.com.au> wrote:
>
>> Apart from a small set, HTML 5 deprecates named entities in favour of Unicode numerical entities, expressed in decimal. That is, rather than ‘&ldquo;’ or ‘&#x201c;’ use ‘&#8220;’.
>>
>> I don’t follow the HTML5 community, so I’m not aware of any plans to deprecate entities in favour of straight Unicode. I prefer using Unicode characters, since that’s what I’m using in my day to day work.
>
> I don't see anything in the HTML5 spec that says anything about this

> deprecation …

Hrmm… it appears that I have misrepresented my in-house recommendations as worldwide standards :\

Apologies to list

Charlie Garrison

unread,
Mar 9, 2011, 8:17:43 PM3/9/11
to bbe...@googlegroups.com
Good afternoon,

Yep, I saw that too, but one has to infer that named entities
are OK. Doing a validation with the w3c tool I get errors (or
maybe it was warnings) when using all but a few named entities;
so one would infer from that they are deprecated.

And this page explicitly states all but a few (5 of them) are no
longer valid:

<http://www.html-5.com/cheat-sheet/html-character-codes.html>

So until there is some clear documentation (that I can find),
I'm going with the most clear documentation rather than docco
that I have to make inferences from.

OK, I hadn't previously read the "8.1.4 Character references"
section in the URL you gave above. That does seem to clearly
indicate that the list of named references is allowed.

And I just tried validating pages again which contain named
references, and I'm not getting any errors now. I don't know
whether they have since fixed the validator, or maybe I was
dealing with a compound error problem when I got the named
entity errors (eg. maybe I had an <?xml?> stanza at the time).

Yep, that was it, when including an xml stanza then named
entities (except 5 of them listed on page I gave above) are not
allowed. Eg. I get the following error when using "&copy;":

reference to undeclared general entity copy

There is a way to declare the named entities, but I found it
easier to just use numeric entities, especially since the Entity
palette in BBEdit lists all the numbers with the names.

The <?xml ...?> stanza is only *recommended* for html5, and is
only needed if serving pages as application/xhtml (rather than
text/html) so I chose to leave off the xml stanza.

<http://www.html-5.com/tags/xml-declaration/index.html>
<http://www.html-5.com/tags/doctype-declaration/index.html> this
page shows how to declare named entities (although I couldn't
get it to validate)

Bottom line, based on my testing (I'd still like to find some
clear documentation), named entities are OK for html5 documents
served as text/html (with no <?xml?> stanza) and named entities
are NOT OK for documents served as application/xhtml (with
<?xml?> stanza).

The following named entities are always OK since they are needed
as part of the XML spec:

&amp; &lt; &gt; &quot; &apos;

LuKreme

unread,
Mar 10, 2011, 8:21:39 PM3/10/11
to bbe...@googlegroups.com
On 9-Mar-2011, at 18:17, Charlie Garrison wrote:
>
> Bottom line, based on my testing (I'd still like to find some clear documentation), named entities are OK for html5 documents served as text/html (with no <?xml?> stanza) and named entities are NOT OK for documents served as application/xhtml (with <?xml?> stanza).
>
> The following named entities are always OK since they are needed as part of the XML spec:
>
> &amp; &lt; &gt; &quot; &apos;

And UNnamed entities are, as far as I can tell, perfectly ok.

&#xF8FF; for example


--
It was where the city kept all those things it occasionally needed but
was uneasy about, like the Watch-house, the theatres, the prison and the
publishers. It was the place for all those things which might go off
bang in unexpected ways.

Reply all
Reply to author
Forward
0 new messages