File encoding auto detection

457 views
Skip to first unread message

Neil Hodgson

unread,
Aug 2, 2011, 5:43:23 AM8/2/11
to scite-interest
There has been another discussion of encoding auto-detection on the
feature request tracker:
http://sourceforge.net/tracker/?func=detail&atid=352439&aid=3324341&group_id=2439

There was a similar discussion on this list about a year ago but
from a different perspective. I'm now fairly convinced that there is
no single approach which will satisfy everybody so its time for that
old programming favourite: 'another level of indirection'. By
providing an option to run a user supplied program after loading a
file, it is possible to have that program be responsible for setting
the encoding and any other properties desired. SciTE collects these
properties into a new per-buffer property set which overrides other
property sets. The called program prints out a list of property
settings, one per line just like a .properties file.

While it is somewhat slow to call an external program, it only
takes between 50 and 100 milliseconds to run an example script on my
main installation so this is probably OK for many people.

If it thinks the file is in shift-JIS, it may print out, for example:

code.page=932
character.set=128

The property for the command is currently called
props.discovery.script and it may be set up something like:

props.discovery.script=python C:\Users\Neil\FileDetect.py "$(FilePath)"

This code is available from

http://www.scintilla.org/scite.zip Source
http://www.scintilla.org/wscite.zip Windows 7 or Vista executable

The copy of Scintilla in that download uses Direct2D/DirectWrite so
will not run on XP. For those on XP, copy in a Scintilla DLL from a
working SciTE installation. This is also working on Linux but has to
be compiled.

An example script is attached to this mail. It works by reading up
to 1000 lines from the file, discarding any encodings that have
errors. When finished, it prints out the properties associated with
the highest priority working encoding. This isn't a high quality
detector and I'm sure better ones will be written. For debugging, it
also prints out other successful encoding properties but commented
out. Finally it prints out a change to the caret colour so you can see
it was called:

code.page=936
character.set=134
#code.page=949
#character.set=129
caret.fore=#4499FF

Neil

FileDetect.py

Jingcheng Zhang

unread,
Aug 2, 2011, 12:47:27 PM8/2/11
to scite-i...@googlegroups.com
Thanks very much! This feature is very useful for CJK users.


--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-i...@googlegroups.com.
To unsubscribe from this group, send email to scite-interes...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.




--
Best regards,
Jingcheng Zhang
Beijing, P.R.China

Neil Hodgson

unread,
Aug 4, 2011, 8:42:37 PM8/4/11
to scite-interest
The file encoding detection command has been committed. The
property name was changed to 'command.discover.properties'.

Available from Hg or from
http://www.scintilla.org/scite.zip Source
http://www.scintilla.org/wscite.zip Windows executable

The Windows executable is back to running on XP - no DirectWrite code.

Neil

Sylvain Brunerie

unread,
Aug 7, 2011, 7:12:43 PM8/7/11
to scite-i...@googlegroups.com
I'm glad to see that a compromise has been eventually found. I will
try this soon. Thanks!

— Sylvain Brunerie
http://innsbay.toile-libre.org


2011/8/2 Neil Hodgson <nyama...@gmail.com>


>
>   There has been another discussion of encoding auto-detection on the
> feature request tracker:
> http://sourceforge.net/tracker/?func=detail&atid=352439&aid=3324341&group_id=2439
>
>   There was a similar discussion on this list about a year ago but
> from a different perspective. I'm now fairly convinced that there is
> no single approach which will satisfy everybody so its time for that
> old programming favourite: 'another level of indirection'. By
> providing an option to run a user supplied program after loading a
> file, it is possible to have that program be responsible for setting
> the encoding and any other properties desired. SciTE collects these
> properties into a new per-buffer property set which overrides other
> property sets. The called program prints out a list of property

> settings, one per line just like a .properties filev

jmfauth

unread,
Aug 15, 2011, 12:02:34 PM8/15/11
to scite-interest


On 2 août, 11:43, Neil Hodgson <nyamaton...@gmail.com> wrote:
>    There has been another discussion of encoding auto-detection
> ...


Hi,

I fall on this thread a little bit by chance. I'm a long time
SciTE Windows user and I kwow the Scintilla control quite very
well. Just for the presentations.

To the subject.

Detecting the coding of the characters of file or any bytes stream
is simply impossible. A bytes stream supposed to represent a plain
text
is valid only if you know its coding. No more, no less.

The only thing that can be done is to detect the "token" in the bytes
stream which hold the coding information, BOM in the five UTF formats
in unicode or an explicit coding declaration, eg
# -*- coding: cp1252 -*- for Python files. SciTE is doing this
job very well.

Assuming detecting the coding can be safely done, it does not
solve the problem. Such a tool may detect a coding. This is a
*possible* coding, but it may not correspond to the real coding
which is supposed to be used. The problem is accute for an editor.
"utf-8 without BOM" is the typical exemple of this mess.

There are some tools to detect a coding, like chardet. One
should understand in which context, how and when they are used.
Such a tool can be used for a viewer, a web browser where a wrong
or improper detection may "only" lead to a wrong character display
without any consequence. Far away from an editor, which is supposed
to write absolute correct code.

Just to take the problem from the other side. Why does Python,
Xe(La)TeX or a web browser requires a coding directive to work
safely?

You wrote: "I'm now fairly convinced that there is
no single approach ..."

Indeed, there is simply no solution to this. It is an unsolvable
problem.
This is the nature of the coding of the characters. Usually every
attempt
to "magically solve" this introduces more troubles than it solves.


Technical note:

cp65001 is not equivalent to utf-8 (As far as I know, I never deeply
dive in this coding).
Ditto, for the "java utf-8" version.
Ditto, for CESU-8

The Python core developpers know this ;-)

Regards,
jmf

Neil Hodgson

unread,
Aug 16, 2011, 7:35:26 PM8/16/11
to scite-i...@googlegroups.com
jmfauth:

> cp65001 is not equivalent to utf-8 (As far as I know, I never deeply
> dive in this coding).

Older versions of Windows (pre-Vista) encoded lone surrogates or
mismatched surrogate pairs incorrectly. Not something I'd consider a
problem as these are already errors. Scintilla is happy to handle
invalid UTF-8 by displaying the errors as hex blobs.

> Ditto, for the "java utf-8" version.
> Ditto, for CESU-8

These should not be used for ouput text files so shouldn't be a concern.

> The Python core developpers know this ;-)

Python codecs have to handle cases where encoding is being done for
complex serialization and non-text communication. SciTE only has to
deal with text files.

Neil

Reply all
Reply to author
Forward
0 new messages