Charset in TAB files

2,146 views
Skip to first unread message

Uffe Kousgaard

unread,
Feb 18, 2009, 9:31:29 AM2/18/09
to mapi...@googlegroups.com
In the mapbasic help there is a chapter on the Charset clause. This
lists "neutral" as one of the charsets where no character conversion is
performed. If you create a table in mapinfo from scratch the charset is
set to "windowslatin1". Is this the same as "neutral" ?

"windowslatin1" is not listed in the documentation.

If I create a TAB file with MITAB library it is written as "neutral" and
I would like to know, if this is really the same thing.

Regards
Uffe Kousgaard

Spencer Simpson

unread,
Feb 18, 2009, 10:42:55 AM2/18/09
to mapi...@googlegroups.com
I suspect the other user's question about UTF-8 caused you to look into
Charset just like it caused me to.

Anyway, MapBasic 5.0's help topic on the Charset clause does have a
"WindowsLatin1" entry. Its absence in your version is probably a
documentation error.

Think of a Charset clause the same way you would think of a Coordsys clause.
It defines the "projection" from the character codes in the data to
characters in the display font.

Recall that That Other GIS Program lets you display data with no coordinate
system on top of data with a coordinate system defined, and if the
coordinates in the layer without a coordinate system happen to match
coordinates in the other layer's coordinate system, objects in the layers
will line up correctly. If not, they won't. If both layers have a
coordinate system, it (and MapInfo) will work to convert coordinates in the
map to make the objects line up.

"Neutral" means "no character conversions" (analogous to "non-earth"). So
if character codes in your display font don't match character codes your
data was created for, MapInfo will display the wrong characters.

If you specify "WindowsLatin1" and there's a character code in your display
font that doesn't match a Windows Latin 1 code (e.g. something from a Mac),
MapInfo should translate the character code to the corresponding code in the
other charset and text should display properly.

I'm just disappointed there isn't an entry for EBCDIC.

________________________________

Spencer

Uffe Kousgaard

unread,
Feb 18, 2009, 11:20:35 AM2/18/09
to mapi...@googlegroups.com
Hi Spence,

Thanks for the explanation and comparison with coordsys.

My need is actually to write TAB files, using MITAB but from a unicode
string source. I think I will have to make some changes to MITAB to get
it producing more exact TAB files than just "neutral". Of course, true
unicode isn't possible (and that was a reply to that other user).

It is the mapbasic 9.5 manual which is missing windowslatin1 in the list.

Regards
Uffe

Eric_Bl...@mapinfo.com

unread,
Feb 18, 2009, 5:03:43 PM2/18/09
to mapi...@googlegroups.com

I will forward this documentation error on!

The explanation that Spencer gave is pretty reasonable except for the part about display.

Charset_neutral in a .tab means that we will make no attempt to convert and just hope and pray that it works. You might be able to tell that I don't like CharSet neutral and would have pushed for its removal other than the fact that we have it in the past.  It is like hoping to line up data with no coordinate system.

What is important to understand is that putting in a CharSet is a declaration just like Coordsys. It declares that all the characters in the table are from that set.  

At runtime if the charset matches your system, no conversion occurs. If it is anything else, we convert. Not all conversions are very useful and any conversions that cannot occur become an underscore ("_") character. Note that your system will always be one of the Windows character sets. So if you have data in ISO8859-1 (Latin1) and your locale is one of the Western Europe or English speaking countries, a conversion still occurs.

I must have missed the thread about UTF-8 so here's what I can tell about that.
  1. We are looking at support for UTF-8.
  2. Currently there is no absolutely safe way to use UTF-8 data. If the data is entirely ASCII, then I would just put WindowsLatin1 as the character set. The binary codes are identical. If you have data with Western European diacritics (accents, umlauts, tildes, Danish, Swedish special characters) then you will not be able to store them as UTF-8 and have them read correctly by Professional. You will need to convert them to Windows Latin1 before storing. Conversion can be fairly easily done via the Windows API and there are probably tools out there.

Note that what glyphs get used at draw time is a completely different issue. Any strings sent to Windows for drawing are expected to already be converted to the current Windows set.

Eric Blasenheim
Chief Product Architect

Group Technology Office
Pitney Bowes Business Insight



Mail List:grbounce-YvY1eQUAAAAJBprYSySRydkk7vpghP_9=mail_list=mapin...@googlegroups.com
From: Uffe Kousgaard <uf...@routeware.dk> on 02/18/2009 05:20 PM CET
To: mapi...@googlegroups.com
cc:
Subject: [MI-L] Re: Charset in TAB files

Uffe Kousgaard

unread,
Feb 19, 2009, 2:07:00 AM2/19/09
to mapi...@googlegroups.com
Hi Eric,

Perhaps you could get another "bug" corrected as well: It is not possible to copy from the documentation. If you select some text and press Ctrl+C, nothing happens. This is due to a setting in the software you are using for building the CHM documentation. This makes it almost impossible to reuse sample code in your own applications.

Regards
Uffe Kousgaard

Gentreau

unread,
Feb 19, 2009, 2:31:51 AM2/19/09
to mapi...@googlegroups.com
Uffe,
 
What does work however, is that you can right-click the selected text and select 'copy'
Still a pain, but at workaround for the time being.
 
Eric, if you give some attention to the help files, can you also please look at the issue that when I call the help with a keyword selected,
the help file opens with the keyword selected in the left window, but no in the right window. You have to hit enter to actually read the help.
 
Thanks
Gentreau.
 


From: mapi...@googlegroups.com [mailto:mapi...@googlegroups.com] On Behalf Of Uffe Kousgaard
Sent: Thursday, February 19, 2009 8:07 AM
To: mapi...@googlegroups.com
Subject: [MI-L] Re: Charset in TAB files

Mats Elfström

unread,
Feb 19, 2009, 2:51:05 AM2/19/09
to mapi...@googlegroups.com
Sligthly OT and friendly advice - not marketing

May I recommend Author-It for making help files (or many other kinds of structured documentation).
It's a data base driven one source - many outputs publishing system which I have used for many years to create and maintain the help system for a complex software.
In Uffes case above, the trick is to name the htm topic file and then use that name in the call to the chm help file from the context sensitive help button.

Regards, Mats.E


2009/2/19 Gentreau <goo...@gentreau.com>

Uffe Kousgaard

unread,
Feb 19, 2009, 2:57:23 AM2/19/09
to mapi...@googlegroups.com
Hi,

I have probably forgotten about that workaround next time it is relevant. I'm too keyboard-focused I guess.

Regards
Uffe Kousgaard

Uffe Kousgaard

unread,
Feb 19, 2009, 3:05:26 AM2/19/09
to mapi...@googlegroups.com
Still OT:

Author-IT starts at 5000 USD. Next time you should try Help&Manual for
500 USD, with more or less the same specifications as you list. I use it
for all my software.

Regards
Uffe Kousgaard

Lars I. Nielsen (GisPro)

unread,
Feb 19, 2009, 3:35:09 AM2/19/09
to mapi...@googlegroups.com
Hi Eric,

Always good to read your in-depth comments.


> Currently there is no absolutely safe way to use UTF-8 data. If the data is entirely ASCII, then I would just put WindowsLatin1 as the character set. The binary codes are identical. If you have data with Western European diacritics (accents, umlauts, tildes, Danish, Swedish special characters) then you will not be able to store them as UTF-8 and have them read correctly by Professional. You will need to convert them to Windows Latin1 before storing.

Isn't this a perfect example of a case when one should use Charset Neutral ? ;-)


> We are looking at support for UTF-8.

As far as I know, UTF-8 isn't a character set as such, it's just an 8 bit encoding. You'd still have to work with different 8 bit charsets, like Latin1 vs. Latin2 etc., unless of course if you're always encoding/using the full 16 bit Unicode charset.

Support for 16 bit Unicode (in 8 bit Windows) is of course a whole different ballgame.

Or am I missing a point or two here ?


Best regards / Med venlig hilsen
Lars I. Nielsen
GIS & DB Integrator
GisPro


Eric_Bl...@mapinfo.com skrev:

Uffe Kousgaard

unread,
Feb 19, 2009, 3:49:56 AM2/19/09
to mapi...@googlegroups.com
Hi,

As I understand it UTF-8 / 16 / 32 are all full character sets with different encodings (binary representations). Codepages such as WindowsLatin1 are subsets.

So, UTF-8 is just as correct for unicode as UTF-16.

XML (and GML, KML) uses UTF-8, so generally UTF-8 is more relevant in the GIS world than UTF-16.

This means you can always translate from a codepage to any UTF encoding, while the other way may not be possible.

See more here: http://en.wikipedia.org/wiki/Unicode

Regards
Uffe

Cox, Stuart TRAN:EX

unread,
Feb 19, 2009, 4:17:12 PM2/19/09
to mapi...@googlegroups.com
I've just looked at Author-It's site and gave up after failing to quickly locate a page addressing the making of .chm or .hlp help files.  Lot's of we're big, we're great but a tough site to find stuff on.

Stu Cox
Project Management Technician
Southern Interior Region
Ministry of Transportation and Infrastructure
342-447 Columbia St.,
Kamloops, BC V2C 2T3
p: 250-828-4320
f: 250-828-4229
stuar...@gov.bc.ca

 


From: mapi...@googlegroups.com [mailto:mapi...@googlegroups.com] On Behalf Of Mats Elfström
Sent: Wednesday, February 18, 2009 11:51 PM

Eric_Bl...@mapinfo.com

unread,
Feb 19, 2009, 6:47:29 PM2/19/09
to mapi...@googlegroups.com

Good and reasonable questions.

First, I will forward the list of issues with .CHM as a group.  CTRL-C does work for the MapBasic Help and for MapXtreme.NET Reference so that must be an oversight.

On the nature of UTF-8.  First of all, the Charset from Professional and codepage are interchangeable so if I mix up the terms, forgive me. Encoding is technically different although often used the same.
  • Yes UTF-8 is a binary encoding of Unicode and therefore different from subset such as codepages or what we call character sets. UTF-8 is just as correct for Unicode. However, UTF-8 data optimizes ASCII in size. For many other characters, UTF-8 encodes as two bytes, which is not bad compared with UTF-16 (where everything is 2 bytes). If most of your data was Chinese or Japanese (3-4 bytes ) then UTF-8 encodes larger than UTF-16.
  • It is not really correct to say that XML, GML, KML ALWAYS use UTF-8. UTF-8 is the default in XML. If you don't put an encoding declaration at the top, then all parsers assume UTF-8. However, other encodings are valid. Remember, that is a declaration not a conversion! You can't write out files in MapBasic, for example, and just "say" they are UTF-8 or anything else.
  • Experience has shown that not all XML readers will respect the encoding unless it is UTF-8.  We had an issue where Google Earth stopped accepting Windows Japanese.
  • Charset Neutral will not work for UTF-8 unless, as I said, all your data is 127 or below (ASCII) in which case your system set (such as WindowsLatin1) would also work.  I cannot think of a reason to use Neutral other than you don't know what it is and you just want to hope for the best. Presumably, you could take data set as Neutral and try it on Windows machines from every system set we support until one worked but that only works if the data is in some Windows character set!
  • However, if the data was partially non-ASCII but encoded as UTF-8, neither a standard Charset or Neutral will work. Again the reason it will not work is because we will do no conversion and any of the characters beyond ASCII will be interpreted as if they are already in the system set.. I am sure Uffe and Lars both have data with the Danish Å character. It this does not come through in your email encoding, the letter is a capital A with a small circle on top. It's value in Windows Latin1 is  197 (0xC5 ) . However, as UTF-8 that character is represented in two bytes 195(0xc3) 133(0x85). If that UTF-8 data was to be read by Professional in any of our "western" countries as charset Neutral or WindowsLatin1, I would expect two characters to be shown, Ã (Capital A with a tilde on top) and then three dots …, as those are the two characters assigned to those values in UTF-8.
  • What we are discussing is to support UTF-8 as a character set (not technically correct, I know) and be able to successfully read UTF-8 encoded data assuming that the resulting characters have a match in the current codepage. If they don't then the "_" will replace the character. Part of the reason for this is that data vendors are delivering files in UTF-8 even though all the characters are from one codepage and we would like to be able to read them. You could then save them in your standard character set.
Hope this helps!

Eric Blasenheim
Chief Product Architect
Pitney Bowes Business Insight



Mail List:grbounce-YvY1eQUAAAAJBprYSySRydkk7vpghP_9=mail_list=mapin...@googlegroups.com
From: Uffe Kousgaard <uf...@routeware.dk> on 02/19/2009 09:49 AM CET
To: mapi...@googlegroups.com
cc:
Subject: [MI-L] Re: Charset in TAB files

Hi,

As I understand it UTF-8 / 16 / 32 are all full character sets with different encodings (binary representations). Codepages such as WindowsLatin1 are subsets.

So, UTF-8 is just as correct for unicode as UTF-16.

XML (and GML, KML) uses UTF-8, so generally UTF-8 is more relevant in the GIS world than UTF-16.

This means you can always translate from a codepage to any UTF encoding, while the other way may not be possible.

See more here:
http://en.wikipedia.org/wiki/Unicode

Regards
Uffe


Lars I. Nielsen (GisPro) wrote:

Hi Eric,

Always good to read your in-depth comments.

>
Currently there is no absolutely safe way to use UTF-8 data. If the data is entirely ASCII, then I would just put WindowsLatin1 as the character set. The binary codes are identical. If you have data with Western European diacritics (accents, umlauts, tildes, Danish, Swedish special characters) then you will not be able to store them as UTF-8 and have them read correctly by Professional. You will need to convert them to Windows Latin1 before storing.

Isn't this a perfect example of a case when one should use Charset Neutral ? ;-)

>
We are looking at support for UTF-8.

As far as I know, UTF-8 isn't a character set as such, it's just an 8 bit encoding. You'd still have to work with different 8 bit charsets, like Latin1 vs. Latin2 etc., unless of course if you're always encoding/using the full 16 bit Unicode charset.

Support for 16 bit Unicode (in 8 bit Windows) is of course a whole different ballgame.

Or am I missing a point or two here ?


Best regards / Med venlig hilsen
Lars I. Nielsen
GIS & DB Integrator
GisPro
 



Eric_Bl...@mapinfo.com skrev:

I will forward this documentation error on!


The explanation that Spencer gave is pretty reasonable except for the part about display.


Charset_neutral in a .tab means that we will make no attempt to convert and just hope and pray that it works. You might be able to tell that I don't like CharSet neutral and would have pushed for its removal other than the fact that we have it in the past.  It is like hoping to line up data with no coordinate system.


What is important to understand is that putting in a CharSet is a declaration just like Coordsys. It declares that all the characters in the table are from that set.  


At runtime if the charset matches your system, no conversion occurs. If it is anything else, we convert. Not all conversions are very useful and any conversions that cannot occur become an underscore ("_") character. Note that your system will always be one of the Windows character sets. So if you have data in ISO8859-1 (Latin1) and your locale is one of the Western Europe or English speaking countries, a conversion still occurs.


I must have missed the thread about UTF-8 so here's what I can tell about that.

1.        We are looking at support for UTF-8.
2.        Currently there is no absolutely safe way to use UTF-8 data. If the data is entirely ASCII, then I would just put WindowsLatin1 as the character set. The binary codes are identical. If you have data with Western European diacritics (accents, umlauts, tildes, Danish, Swedish special characters) then you will not be able to store them as UTF-8 and have them read correctly by Professional. You will need to convert them to Windows Latin1 before storing. Conversion can be fairly easily done via the Windows API and there are probably tools out there.

Note that what glyphs get used at draw time is a completely different issue. Any strings sent to Windows for drawing are expected to already be converted to the current Windows set.


Eric Blasenheim
Chief Product Architect

Group Technology Office
Pitney Bowes Business Insight




Mail List:
grbounce-YvY1eQUAAAAJBprYSySRydkk7vpghP_9=mail_list=mapin...@googlegroups.com
From:
Uffe Kousgaard <uf...@routeware.dk> on 02/18/2009 05:20 PM CET
To: mapi...@googlegroups.com
cc:
Subject: [MI-L] Re: Charset in TAB files
> From: mapi...@googlegroups.com [mailto:mapi...@googlegroups.com] On

> Behalf Of Uffe Kousgaard
> Sent: Wednesday, February 18, 2009 9:31 AM
> To:
mapi...@googlegroups.com

Shindigo

unread,
Feb 20, 2009, 2:13:29 PM2/20/09
to MapInfo-L
I'm not sure why I can't reply to the last post of this thread, but...

Pardon me for jumping in at the end of the discussion; I have a
followup question.

I'm been given a road database sample file in TAB format for a Chinese
city. The road names in the TAB file are all numbers. They make
reference to a separate data file that can convert the numbers to
Chinese road names. The encoding of the data file (according to
Visual Studio) is "Unicode (Big-Endian) - Codepage 1201". This file
looks fine in the visual studio editor.

I have written a .NET program that I use to convert the source TAB
file to a TAB file with different columns, one of which will be the
names of the roads in Chinese. I am using mitab to create the TAB
file. Mitab is creating the tab file with "!charset Neutral" and when
I open it in the Geoset Manager all the road names appear as ????. I
have tried various guesses at how !charset should be set to show the
Chinese correctly, but I really haven't a clue what values can go
here.

What can I do to get the resulting TAB file into a form that will show
Chinese in Geoset Manager?

Thanks for any help you can provide.

Shindigo.

On Feb 19, 3:35 am, "Lars I. Nielsen (GisPro)" <L...@gispro.dk> wrote:
> : Hi Eric,
> Always good to read your in-depth comments.
> >Currently there is no absolutely safe way to use UTF-8 data. If the data is entirely ASCII, then I would just put WindowsLatin1 as the character set. The binary codes are identical. If you have data with Western European diacritics (accents, umlauts, tildes, Danish, Swedish special characters) then you will not be able to store them as UTF-8 and have them read correctly by Professional. You will need to convert them to Windows Latin1 before storing.
> Isn't this a perfect example of a case when one should use Charset Neutral ? ;-)
> >We are looking at support for UTF-8.As far as I know, UTF-8 isn't a character set as such, it's just an 8 bit encoding. You'd still have to work with different 8 bit charsets, like Latin1 vs. Latin2 etc., unless of course if you're always encoding/using the full 16 bitUnicodecharset.
> Support for 16 bitUnicode(in 8 bit Windows) is of course a whole different ballgame.
> Or am I missing a point or two here ?Best regards / Med venlig hilsen Lars I. Nielsen GIS & DB Integrator GisProEric...@mapinfo.comskrev:I will forward this documentation error on!The explanation that Spencer gave is pretty reasonable except for the part about display.Charset_neutral in a .tab means that we will make no attempt to convert and just hope and pray that it works. You might be able to tell that I don't like CharSet neutral and would have pushed for its removal other than the fact that we have it in the past.  It is like hoping to line up data with no coordinate system.What is important to understand is that putting in a CharSet is a declaration just like Coordsys. It declares that all the characters in the table are from that set.  At runtime if the charset matches your system, no conversion occurs. If it is anything else, we convert. Not all conversions are very useful and any conversions that cannot occur become an underscore ("_") character. Note that your system will always be one of the Windows character sets. So if you have data in ISO8859-1 (Latin1) and your locale is one of the Western Europe or English speaking countries, a conversion still occurs.I must have missed the thread about UTF-8 so here's what I can tell about that.We are looking at support for UTF-8.Currently there is no absolutely safe way to use UTF-8 data. If the data is entirely ASCII, then I would just put WindowsLatin1 as the character set. The binary codes are identical. If you have data with Western European diacritics (accents, umlauts, tildes, Danish, Swedish special characters) then you will not be able to store them as UTF-8 and have them read correctly by Professional. You will need to convert them to Windows Latin1 before storing. Conversion can be fairly easily done via the Windows API and there are probably tools out there.Note that what glyphs get used at draw time is a completely different issue. Any strings sent to Windows for drawing are expected to already be converted to the current Windows set.Eric Blasenheim
> Chief Product ArchitectGroup Technology Office
> Pitney Bowes Business InsightMail List:grbounce-YvY1eQUAAAAJBprYSySRydkk7vpghP_9=mail_list=mapin...@googlegroups.comFrom:Uffe Kousgaard<uf...@routeware.dk>on 02/18/2009 05:20 PM CETTo:mapi...@googlegroups.comcc:Subject:[MI-L] Re: Charset in TAB files
> Hi Spence,
> Thanks for the explanation and comparison with coordsys.
> My need is actually to write TAB files, using MITAB but from aunicode
> string source. I think I will have to make some changes to MITAB to get
> it producing more exact TAB files than just "neutral". Of course, trueunicodeisn't possible (and that was a reply to that other user).

Warren Vick

unread,
Feb 21, 2009, 7:18:19 AM2/21/09
to mapi...@googlegroups.com
Hello Shindigo,

Based on my experience of supporting Japanese data in MapInfo software, you will need to run a local (i.e. Chinese) version of Windows, or use a program like Microsoft AppLocale (a free download) to trick the application into thinking that it's running on a Chinese system. As long as you have the font support loaded, you should see your road names appear. You see "?" and "_" when characters are not decoded correctly.

Regards,
Warren Vick
Europa Technologies Ltd.
http://www.europa.uk.com
Reply all
Reply to author
Forward
0 new messages