how to get the encoding of a file?

Khamis Abuelkomboz

unread,

Oct 5, 2008, 8:36:52 AM10/5/08

to

Hi

files could have different encodings.

I want to open/modify/save files in the encoding they are created in. Modifying files using the tcl
text editor.

is there a general way to do this in tcl? something like [encoding info $file]

thanks
khamis

--
Try Code-Navigator on http://www.codenav.com
a source code navigating, analysis and developing tool.
It supports following languages:
* C/C++
* Java
* .NET (including CSharp, VB.Net and other .NET components)
* Classic Visual Basic
* PHP, HTML, XML, ASP, CSS
* Tcl/Tk,
* Perl
* Python
* SQL,
* m4 Preprocessor
* Cobol

Donal K. Fellows

unread,

Oct 5, 2008, 3:08:09 PM10/5/08

to

Khamis Abuelkomboz wrote:
> files could have different encodings.
>
> I want to open/modify/save files in the encoding they are created in.
> Modifying files using the tcl text editor.
>
> is there a general way to do this in tcl? something like [encoding info
> $file]

There's no truly general way to guess the encoding of an arbitrary file,
even if you look inside it. The issue is that you might have a file
containing just ASCII characters but which is "in ISO 8859-1" or 8859-15
or UTF-8 or one of loads of others. (Note that some file formats are
self-describing and so don't have this problem.)

However, it's often possible to do better than that in practice through
a mixture of guessing (Tcl almost always gets [encoding system] right,
and that's the default encoding) and suitable metadata (i.e. MIME and
HTTP headers). This works almost all the time, at least as long as
people only want to type characters on their keyboard. (This last point
sounds obvious, but is actually a heavy restriction for Power Users.)

Ideally everyone would use UTF-8 and the problem would go away. :-)

Donal.

Khamis Abuelkomboz

unread,

Oct 5, 2008, 4:25:29 PM10/5/08

to Donal K. Fellows

Donal K. Fellows schrieb:

Hi Donal

Thank you for the clarifation, now I realize the problem.
I woundered how the Microsoft's notepad programm can guess what encoding the file has. I found
references to the microsoft library mlang.dll.
See http://msdn.microsoft.com/en-us/library/aa741220(VS.85).aspx

Another interesting project is the ICU project. It has a C/C++ implementation of the library and
could be compiled for microsoft and unix world, what makes more since for a tcl extension based on
this library.
See http://www.icu-project.org/apiref/icu4c/

Both solutions are to be made suitable to the Tcl World. Did somebody already has done this work or
know about a solution?

regards

Donal K. Fellows

unread,

Oct 5, 2008, 7:22:30 PM10/5/08

to

Khamis Abuelkomboz wrote:
> Thank you for the clarifation, now I realize the problem.
> I woundered how the Microsoft's notepad programm can guess what encoding
> the file has. I found references to the microsoft library mlang.dll.
> See http://msdn.microsoft.com/en-us/library/aa741220(VS.85).aspx

That looks for Byte-Order Marks (in a few formats) and otherwise falls
back to guessing the value that Tcl's [encoding system] is supposed to
match. (Even better things can be achieved with XML parsing, but that's
getting a bit further off-topic.)

Duplicating the BOM-guesser is a few lines of code, and would be a fine
addition to tcllib. If someone hasn't done it already, that is. :-)

> Both solutions are to be made suitable to the Tcl World. Did somebody
> already has done this work or know about a solution?

I'd favour having it scripted in a library rather than making it a
universal feature of the Tcl library. This is because it is the script
author who knows when they are writing an application where having a
guess is appropriate.

Donal.

suchenwi

unread,

Oct 7, 2008, 5:15:56 AM10/7/08

to

On 5 Okt., 14:36, Khamis Abuelkomboz <kha...@web.de> wrote:
> I want to open/modify/save files in the encoding they are created in. Modifying files using the tcl
> text editor.
>
> is there a general way to do this in tcl? something like [encoding info $file]

No - because one byte value above 7F can be valid in most if not all
encodings, even though it corresponds to a different character. And
except for XML (and Unicode BOM), encoding is not stored in a file.

But you could do a statistical approach: for a number of files of
different known encodings, compute the relative frequency of byte
values 00 .. FF. For the unknown file in question, compute that as
well, and see which of the known "encoding classes" matches best.

schlenk

unread,

Oct 7, 2008, 6:40:19 AM10/7/08

to

On Oct 7, 11:15 am, suchenwi <richard.suchenwirth-

There is code for something like that inside the Mozilla project and
there is a python port of it, but thats mostly tuned for web pages, so
might not be useful for your problem. Have a look at:
http://chardet.feedparser.org/docs/faq.html

If needed you can use that from Tcl via Tclpython or rewrite it in Tcl
(or wrap the original Mozilla C++ code as a Tcl extension).

Michael

LeandroAB

unread,

Oct 15, 2008, 8:10:09 AM10/15/08

to

On 5 out, 16:08, "Donal K. Fellows" <donal.k.fell...@manchester.ac.uk>
wrote:

That´s not like that, in some cases UTF-8 whould not work.

For example, i´m writing an configuration file in Apache 2.0.59,
as I wrote in UTF-8 codification, it could´t read it.

But, in many cases it works.

LeandroAB