There was a similar discussion on this list about a year ago but
from a different perspective. I'm now fairly convinced that there is
no single approach which will satisfy everybody so its time for that
old programming favourite: 'another level of indirection'. By
providing an option to run a user supplied program after loading a
file, it is possible to have that program be responsible for setting
the encoding and any other properties desired. SciTE collects these
properties into a new per-buffer property set which overrides other
property sets. The called program prints out a list of property
settings, one per line just like a .properties file.
While it is somewhat slow to call an external program, it only
takes between 50 and 100 milliseconds to run an example script on my
main installation so this is probably OK for many people.
If it thinks the file is in shift-JIS, it may print out, for example:
code.page=932
character.set=128
The property for the command is currently called
props.discovery.script and it may be set up something like:
props.discovery.script=python C:\Users\Neil\FileDetect.py "$(FilePath)"
This code is available from
http://www.scintilla.org/scite.zip Source
http://www.scintilla.org/wscite.zip Windows 7 or Vista executable
The copy of Scintilla in that download uses Direct2D/DirectWrite so
will not run on XP. For those on XP, copy in a Scintilla DLL from a
working SciTE installation. This is also working on Linux but has to
be compiled.
An example script is attached to this mail. It works by reading up
to 1000 lines from the file, discarding any encodings that have
errors. When finished, it prints out the properties associated with
the highest priority working encoding. This isn't a high quality
detector and I'm sure better ones will be written. For debugging, it
also prints out other successful encoding properties but commented
out. Finally it prints out a change to the caret colour so you can see
it was called:
code.page=936
character.set=134
#code.page=949
#character.set=129
caret.fore=#4499FF
Neil
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-i...@googlegroups.com.
To unsubscribe from this group, send email to scite-interes...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Available from Hg or from
http://www.scintilla.org/scite.zip Source
http://www.scintilla.org/wscite.zip Windows executable
The Windows executable is back to running on XP - no DirectWrite code.
Neil
— Sylvain Brunerie
http://innsbay.toile-libre.org
2011/8/2 Neil Hodgson <nyama...@gmail.com>
>
> There has been another discussion of encoding auto-detection on the
> feature request tracker:
> http://sourceforge.net/tracker/?func=detail&atid=352439&aid=3324341&group_id=2439
>
> There was a similar discussion on this list about a year ago but
> from a different perspective. I'm now fairly convinced that there is
> no single approach which will satisfy everybody so its time for that
> old programming favourite: 'another level of indirection'. By
> providing an option to run a user supplied program after loading a
> file, it is possible to have that program be responsible for setting
> the encoding and any other properties desired. SciTE collects these
> properties into a new per-buffer property set which overrides other
> property sets. The called program prints out a list of property
> settings, one per line just like a .properties filev
> cp65001 is not equivalent to utf-8 (As far as I know, I never deeply
> dive in this coding).
Older versions of Windows (pre-Vista) encoded lone surrogates or
mismatched surrogate pairs incorrectly. Not something I'd consider a
problem as these are already errors. Scintilla is happy to handle
invalid UTF-8 by displaying the errors as hex blobs.
> Ditto, for the "java utf-8" version.
> Ditto, for CESU-8
These should not be used for ouput text files so shouldn't be a concern.
> The Python core developpers know this ;-)
Python codecs have to handle cases where encoding is being done for
complex serialization and non-text communication. SciTE only has to
deal with text files.
Neil