[FarGroup/FarManager] master: Use IMultiLanguage2 for codepage detection (1e2de69c1)

0 views
Skip to first unread message

farg...@farmanager.com

unread,
Oct 4, 2022, 5:15:51 PM10/4/22
to farco...@googlegroups.com
Repository : https://github.com/FarGroup/FarManager
On branch : master
Link : https://github.com/FarGroup/FarManager/commit/1e2de69c18bffac42dc0a1762832b5b5dfad1062

>---------------------------------------------------------------

commit 1e2de69c18bffac42dc0a1762832b5b5dfad1062
Author: Alex Alabuzhev <alab...@gmail.com>
Date: Tue Oct 4 22:03:54 2022 +0100

Use IMultiLanguage2 for codepage detection


>---------------------------------------------------------------

1e2de69c18bffac42dc0a1762832b5b5dfad1062
far/FarCze.hlf.m4 | 30 ++++++-------------
far/FarEng.hlf.m4 | 30 ++++++-------------
far/FarGer.hlf.m4 | 30 ++++++-------------
far/FarHun.hlf.m4 | 30 ++++++-------------
far/FarPol.hlf.m4 | 25 +++++-----------
far/FarRus.hlf.m4 | 27 +++++------------
far/FarSky.hlf.m4 | 30 ++++++-------------
far/FarUkr.hlf.m4 | 30 ++++++-------------
far/filestr.cpp | 75 ++++++++++++++++++++++++++++++++++--------------
far/platform.headers.hpp | 1 +
10 files changed, 123 insertions(+), 185 deletions(-)

diff --git a/far/FarCze.hlf.m4 b/far/FarCze.hlf.m4
index f82b7bdb6..a03cf37fd 100644
--- a/far/FarCze.hlf.m4
+++ b/far/FarCze.hlf.m4
@@ -5596,27 +5596,15 @@ If current value of an option is other than the default, the option is marked wi

@Codepages.NoAutoDetectCP
$ #far:config Codepages.NoAutoDetectCP#
- This string parameter defines the code pages which will be excluded
-from Universal Codepage Detector (UCD) autodetect. Sometimes, especially
-on small files, UCD annoyingly chooses wrong code pages.
-
- The default value is empty string #""#. In this case all code pages
-detectable by UCD (about 20, much less than there is usually available
-in the system) are enabled.
-
- If this parameter is set to string #"-1"# and the #Other# section
-of the ~Code pages~@CodePagesMenu@ menu is hidden (#Ctrl+H# key
-combination), only #System# (ANSI, OEM), #Unicode#, and #Favorites# code
-pages will be enabled for UCD. If the #Other# section is visible, all
-code pages are enabled.
-
- Otherwise, this parameter should contain comma separated list
-of code page numbers disabled for UCD. For example,
-#"1250,1252,1253,1255,855,10005,28592,28595,28597,28598,38598"#.
-
- Since Unicode code pages (1200, 1201, 65001) are detected outside
-of UCD, they cannot be disabled even if they appear on the exclusions
-list.
+ This parameter allows to exclude specific code pages from the heuristic code page detection results.
+Such detection is unreliable by definition: it depends on statistical data and could guess wrong, especially when the amount of input data is small.
+
+ By default the parameter is empty and there are no restrictions which code pages could be detected heuristically.
+
+ If this parameter is set to #-1#, only the code pages, currenltly visible in the ~Code pages~@CodePagesMenu@ menu, will be accepted.
+You can control which code pages are visible there with the #Ctrl+H# key combination and the #Favorites# section.
+
+ If this parameter contains a comma-separated list of code page numbers, all the specified code pages will be excluded from the heuristic detection.

This parameter can be changed via ~far:config~@FarConfig@ only.

diff --git a/far/FarEng.hlf.m4 b/far/FarEng.hlf.m4
index 874c63485..aa9f369db 100644
--- a/far/FarEng.hlf.m4
+++ b/far/FarEng.hlf.m4
@@ -5563,27 +5563,15 @@ If current value of an option is other than the default, the option is marked wi

@Codepages.NoAutoDetectCP
$ #far:config Codepages.NoAutoDetectCP#
- This string parameter defines the code pages which will be excluded
-from Universal Codepage Detector (UCD) autodetect. Sometimes, especially
-on small files, UCD annoyingly chooses wrong code pages.
-
- The default value is empty string #""#. In this case all code pages
-detectable by UCD (about 20, much less than there is usually available
-in the system) are enabled.
-
- If this parameter is set to string #"-1"# and the #Other# section
-of the ~Code pages~@CodePagesMenu@ menu is hidden (#Ctrl+H# key
-combination), only #System# (ANSI, OEM), #Unicode#, and #Favorites# code
-pages will be enabled for UCD. If the #Other# section is visible, all
-code pages are enabled.
-
- Otherwise, this parameter should contain comma separated list
-of code page numbers disabled for UCD. For example,
-#"1250,1252,1253,1255,855,10005,28592,28595,28597,28598,38598"#.
-
- Since Unicode code pages (1200, 1201, 65001) are detected outside
-of UCD, they cannot be disabled even if they appear on the exclusions
-list.
+ This parameter allows to exclude specific code pages from the heuristic code page detection results.
+Such detection is unreliable by definition: it depends on statistical data and could guess wrong, especially when the amount of input data is small.
+
+ By default the parameter is empty and there are no restrictions which code pages could be detected heuristically.
+
+ If this parameter is set to #-1#, only the code pages, currenltly visible in the ~Code pages~@CodePagesMenu@ menu, will be accepted.
+You can control which code pages are visible there with the #Ctrl+H# key combination and the #Favorites# section.
+
+ If this parameter contains a comma-separated list of code page numbers, all the specified code pages will be excluded from the heuristic detection.

This parameter can be changed via ~far:config~@FarConfig@ only.

diff --git a/far/FarGer.hlf.m4 b/far/FarGer.hlf.m4
index b4be1cd8e..2f21cb219 100644
--- a/far/FarGer.hlf.m4
+++ b/far/FarGer.hlf.m4
@@ -5654,27 +5654,15 @@ If current value of an option is other than the default, the option is marked wi

@Codepages.NoAutoDetectCP
$ #far:config Codepages.NoAutoDetectCP#
- This string parameter defines the code pages which will be excluded
-from Universal Codepage Detector (UCD) autodetect. Sometimes, especially
-on small files, UCD annoyingly chooses wrong code pages.
-
- The default value is empty string #""#. In this case all code pages
-detectable by UCD (about 20, much less than there is usually available
-in the system) are enabled.
-
- If this parameter is set to string #"-1"# and the #Other# section
-of the ~Code pages~@CodePagesMenu@ menu is hidden (#Ctrl+H# key
-combination), only #System# (ANSI, OEM), #Unicode#, and #Favorites# code
-pages will be enabled for UCD. If the #Other# section is visible, all
-code pages are enabled.
-
- Otherwise, this parameter should contain comma separated list
-of code page numbers disabled for UCD. For example,
-#"1250,1252,1253,1255,855,10005,28592,28595,28597,28598,38598"#.
-
- Since Unicode code pages (1200, 1201, 65001) are detected outside
-of UCD, they cannot be disabled even if they appear on the exclusions
-list.
+ This parameter allows to exclude specific code pages from the heuristic code page detection results.
+Such detection is unreliable by definition: it depends on statistical data and could guess wrong, especially when the amount of input data is small.
+
+ By default the parameter is empty and there are no restrictions which code pages could be detected heuristically.
+
+ If this parameter is set to #-1#, only the code pages, currenltly visible in the ~Code pages~@CodePagesMenu@ menu, will be accepted.
+You can control which code pages are visible there with the #Ctrl+H# key combination and the #Favorites# section.
+
+ If this parameter contains a comma-separated list of code page numbers, all the specified code pages will be excluded from the heuristic detection.

This parameter can be changed via ~far:config~@FarConfig@ only.

diff --git a/far/FarHun.hlf.m4 b/far/FarHun.hlf.m4
index 8896283d9..6e36dbe89 100644
--- a/far/FarHun.hlf.m4
+++ b/far/FarHun.hlf.m4
@@ -5672,27 +5672,15 @@ If current value of an option is other than the default, the option is marked wi

@Codepages.NoAutoDetectCP
$ #far:config Codepages.NoAutoDetectCP#
- This string parameter defines the code pages which will be excluded
-from Universal Codepage Detector (UCD) autodetect. Sometimes, especially
-on small files, UCD annoyingly chooses wrong code pages.
-
- The default value is empty string #""#. In this case all code pages
-detectable by UCD (about 20, much less than there is usually available
-in the system) are enabled.
-
- If this parameter is set to string #"-1"# and the #Other# section
-of the ~Code pages~@CodePagesMenu@ menu is hidden (#Ctrl+H# key
-combination), only #System# (ANSI, OEM), #Unicode#, and #Favorites# code
-pages will be enabled for UCD. If the #Other# section is visible, all
-code pages are enabled.
-
- Otherwise, this parameter should contain comma separated list
-of code page numbers disabled for UCD. For example,
-#"1250,1252,1253,1255,855,10005,28592,28595,28597,28598,38598"#.
-
- Since Unicode code pages (1200, 1201, 65001) are detected outside
-of UCD, they cannot be disabled even if they appear on the exclusions
-list.
+ This parameter allows to exclude specific code pages from the heuristic code page detection results.
+Such detection is unreliable by definition: it depends on statistical data and could guess wrong, especially when the amount of input data is small.
+
+ By default the parameter is empty and there are no restrictions which code pages could be detected heuristically.
+
+ If this parameter is set to #-1#, only the code pages, currenltly visible in the ~Code pages~@CodePagesMenu@ menu, will be accepted.
+You can control which code pages are visible there with the #Ctrl+H# key combination and the #Favorites# section.
+
+ If this parameter contains a comma-separated list of code page numbers, all the specified code pages will be excluded from the heuristic detection.

This parameter can be changed via ~far:config~@FarConfig@ only.

diff --git a/far/FarPol.hlf.m4 b/far/FarPol.hlf.m4
index 3ffded8fa..3cd67ec51 100644
--- a/far/FarPol.hlf.m4
+++ b/far/FarPol.hlf.m4
@@ -5560,28 +5560,17 @@ Jeżeli bieżąca wartość opcji jest inna niż domyślna, opcja jest oznaczona

@Codepages.NoAutoDetectCP
$ #far:config Codepages.NoAutoDetectCP#
- Ten parametr tekstowy definiuje strony kodowe, które będą wyłączane
-z autodetekcji Universal Codepage Detector (UCD). Czasami, szczególnie
-w przypadku małych plików, UCD irytująco wybiera niewłaściwe strony kodowe.
+ This parameter allows to exclude specific code pages from the heuristic code page detection results.
+Such detection is unreliable by definition: it depends on statistical data and could guess wrong, especially when the amount of input data is small.

- Domyślną wartością jest pusty łańcuch #""#. W takim przypadku wszystkie
-strony kodowe wykrywane przez UCD (około 20, znacznie mniej niż zazwyczaj
-jest dostępnych w systemie) są włączone.
+ By default the parameter is empty and there are no restrictions which code pages could be detected heuristically.

- Jeżeli ten parametr jest ustawiony na #"-1"# i sekcja #Pozostałe# w menu
-~Strony kodowe~@CodePagesMenu@ jest ukryta (kombinacja klawiszy #Ctrl+H#),
-tylko strony kodowe #Systemowe# (ANSI, OEM), #Unicode#, i #Ulubione# będą
-włączone dla UCD. Jeżeli sekcja #Pozostałe# jest widoczna, to wszystkie
-strony kodowe są włączone.
+ If this parameter is set to #-1#, only the code pages, currenltly visible in the ~Code pages~@CodePagesMenu@ menu, will be accepted.
+You can control which code pages are visible there with the #Ctrl+H# key combination and the #Favorites# section.

- W przeciwnym wypadku, parametr powinien zawierać listę stron kodowych
-wyłączonych dla UCD, oddzieloną przecinkami. Np.:
-#"1250,1252,1253,1255,855,10005,28592,28595,28597,28598,38598"#.
+ If this parameter contains a comma-separated list of code page numbers, all the specified code pages will be excluded from the heuristic detection.

- Ponieważ strony kodowe Unicode (1200, 1201, 65001) są wykrywane poza UCD,
-nie można ich wyłączyć, nawet jeżeli znajdują się na liście wykluczeń.
-
- Ten parametr można zmienić tylko w ~far:config~@FarConfig@.
+ This parameter can be changed via ~far:config~@FarConfig@ only.


@Help.ActivateURL
diff --git a/far/FarRus.hlf.m4 b/far/FarRus.hlf.m4
index b5336b6a5..33821ebda 100644
--- a/far/FarRus.hlf.m4
+++ b/far/FarRus.hlf.m4
@@ -5638,30 +5638,17 @@ $ #Редактор конфигурации#

@Codepages.NoAutoDetectCP
$ #far:config Codepages.NoAutoDetectCP#
- Этот строковый параметр задаёт кодовые страницы, которые будут
-исключены из автоматического определения Universal Codepage Detector'ом
-(UCD). Иногда (особенно на небольших файлах) UCD назойливо выбирает
-неподходящие кодовые страницы.
+ This parameter allows to exclude specific code pages from the heuristic code page detection results.
+Such detection is unreliable by definition: it depends on statistical data and could guess wrong, especially when the amount of input data is small.

- Значение по умолчанию -- это пустая строка #""#. В этом случае все
-кодовые страницы, которые может определить UCD (около двух десятков,
-гораздо меньше, чем обычно доступно в системе) разрешены.
+ By default the parameter is empty and there are no restrictions which code pages could be detected heuristically.

- Если параметр равен строке #"-1"#, и раздел #Прочие# в меню
-~кодовых страниц~@CodePagesMenu@ скрыт (комбинация клавиш #Ctrl+H#),
-то для UCD будут разрешены только #Системные# (ANSI, OEM), #Юникодные#
-и #Избранные# кодовые страницы. Если раздел #Прочие# виден, все кодовые
-страницы разрешены.
+ If this parameter is set to #-1#, only the code pages, currenltly visible in the ~Code pages~@CodePagesMenu@ menu, will be accepted.
+You can control which code pages are visible there with the #Ctrl+H# key combination and the #Favorites# section.

- В противном случае параметр должен содержать список номеров кодовых
-страниц, запрещённых для UCD. Например,
-#"1250,1252,1253,1255,855,10005,28592,28595,28597,28598,38598"#.
+ If this parameter contains a comma-separated list of code page numbers, all the specified code pages will be excluded from the heuristic detection.

- Поскольку юникодные кодовые страницы (1200, 1201, 65001) проверяются
-отдельно от UCD, они не могут быть запрещены, даже если они есть
-в списке исключений.
-
- Изменить этот параметр можно только через ~far:config~@FarConfig@.
+ This parameter can be changed via ~far:config~@FarConfig@ only.


@Help.ActivateURL
diff --git a/far/FarSky.hlf.m4 b/far/FarSky.hlf.m4
index 75ef8552f..9beb7302b 100644
--- a/far/FarSky.hlf.m4
+++ b/far/FarSky.hlf.m4
@@ -5557,27 +5557,15 @@ If current value of an option is other than the default, the option is marked wi

@Codepages.NoAutoDetectCP
$ #far:config Codepages.NoAutoDetectCP#
- This string parameter defines the code pages which will be excluded
-from Universal Codepage Detector (UCD) autodetect. Sometimes, especially
-on small files, UCD annoyingly chooses wrong code pages.
-
- The default value is empty string #""#. In this case all code pages
-detectable by UCD (about 20, much less than there is usually available
-in the system) are enabled.
-
- If this parameter is set to string #"-1"# and the #Other# section
-of the ~Code pages~@CodePagesMenu@ menu is hidden (#Ctrl+H# key
-combination), only #System# (ANSI, OEM), #Unicode#, and #Favorites# code
-pages will be enabled for UCD. If the #Other# section is visible, all
-code pages are enabled.
-
- Otherwise, this parameter should contain comma separated list
-of code page numbers disabled for UCD. For example,
-#"1250,1252,1253,1255,855,10005,28592,28595,28597,28598,38598"#.
-
- Since Unicode code pages (1200, 1201, 65001) are detected outside
-of UCD, they cannot be disabled even if they appear on the exclusions
-list.
+ This parameter allows to exclude specific code pages from the heuristic code page detection results.
+Such detection is unreliable by definition: it depends on statistical data and could guess wrong, especially when the amount of input data is small.
+
+ By default the parameter is empty and there are no restrictions which code pages could be detected heuristically.
+
+ If this parameter is set to #-1#, only the code pages, currenltly visible in the ~Code pages~@CodePagesMenu@ menu, will be accepted.
+You can control which code pages are visible there with the #Ctrl+H# key combination and the #Favorites# section.
+
+ If this parameter contains a comma-separated list of code page numbers, all the specified code pages will be excluded from the heuristic detection.

This parameter can be changed via ~far:config~@FarConfig@ only.

diff --git a/far/FarUkr.hlf.m4 b/far/FarUkr.hlf.m4
index d45878805..be456efce 100644
--- a/far/FarUkr.hlf.m4
+++ b/far/FarUkr.hlf.m4
@@ -5649,27 +5649,15 @@ If current value of an option is other than the default, the option is marked wi

@Codepages.NoAutoDetectCP
$ #far:config Codepages.NoAutoDetectCP#
- This string parameter defines the code pages which will be excluded
-from Universal Codepage Detector (UCD) autodetect. Sometimes, especially
-on small files, UCD annoyingly chooses wrong code pages.
-
- The default value is empty string #""#. In this case all code pages
-detectable by UCD (about 20, much less than there is usually available
-in the system) are enabled.
-
- If this parameter is set to string #"-1"# and the #Other# section
-of the ~Code pages~@CodePagesMenu@ menu is hidden (#Ctrl+H# key
-combination), only #System# (ANSI, OEM), #Unicode#, and #Favorites# code
-pages will be enabled for UCD. If the #Other# section is visible, all
-code pages are enabled.
-
- Otherwise, this parameter should contain comma separated list
-of code page numbers disabled for UCD. For example,
-#"1250,1252,1253,1255,855,10005,28592,28595,28597,28598,38598"#.
-
- Since Unicode code pages (1200, 1201, 65001) are detected outside
-of UCD, they cannot be disabled even if they appear on the exclusions
-list.
+ This parameter allows to exclude specific code pages from the heuristic code page detection results.
+Such detection is unreliable by definition: it depends on statistical data and could guess wrong, especially when the amount of input data is small.
+
+ By default the parameter is empty and there are no restrictions which code pages could be detected heuristically.
+
+ If this parameter is set to #-1#, only the code pages, currenltly visible in the ~Code pages~@CodePagesMenu@ menu, will be accepted.
+You can control which code pages are visible there with the #Ctrl+H# key combination and the #Favorites# section.
+
+ If this parameter contains a comma-separated list of code page numbers, all the specified code pages will be excluded from the heuristic detection.

This parameter can be changed via ~far:config~@FarConfig@ only.

diff --git a/far/filestr.cpp b/far/filestr.cpp
index 4b721b779..e58009238 100644
--- a/far/filestr.cpp
+++ b/far/filestr.cpp
@@ -42,8 +42,10 @@ THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#include "config.hpp"
#include "codepage_selection.hpp"
#include "global.hpp"
+#include "log.hpp"

// Platform:
+#include "platform.com.hpp"

// Common:
#include "common/algorithm.hpp"
@@ -322,34 +324,65 @@ static bool GetUnicodeCpUsingWindows(const void* Data, size_t Size, uintptr_t& C
return false;
}

-static bool GetCpUsingUniversalDetectorWithExceptions(std::string_view const Str, uintptr_t& Codepage)
+static bool GetCpUsingML(std::string_view Str, uintptr_t& Codepage, function_ref<bool(uintptr_t)> const IsCodepageAcceptable)
{
- if (!GetCpUsingUniversalDetector(Str, Codepage))
- return false;
+ SCOPED_ACTION(os::com::initialize);

- // This whole block shouldn't be here
- if (Global->Opt->strNoAutoDetectCP.Get() == L"-1"sv)
- {
- if (Global->Opt->CPMenuMode && none_of(Codepage, encoding::codepage::ansi(), encoding::codepage::oem()))
- {
- const auto CodepageType = codepages::GetFavorite(Codepage);
- if (!(CodepageType & CPST_FAVORITE))
- {
- return false;
- }
- }
- }
- else
+ os::com::ptr<IMultiLanguage2> ML;
+ if (const auto Result = CoCreateInstance(CLSID_CMultiLanguage, {}, CLSCTX_INPROC_SERVER, IID_IMultiLanguage2, IID_PPV_ARGS_Helper(&ptr_setter(ML))); FAILED(Result))
{
- if (contains(enum_tokens(Global->Opt->strNoAutoDetectCP.Get(), L",;"sv), str(Codepage)))
- {
- return false;
- }
+ LOGWARNING(L"CoCreateInstance(CLSID_CMultiLanguage): {}"sv, os::format_error(Result));
+ return false;
}

+ int Size = static_cast<int>(Str.size());
+ DetectEncodingInfo Info[10];
+ int InfoCount = static_cast<int>(std::size(Info));
+
+ if (const auto Result = ML->DetectInputCodepage(MLDETECTCP_NONE, 0, const_cast<char*>(Str.data()), &Size, Info, &InfoCount); FAILED(Result))
+ return false;
+
+ const auto Scores = span(Info, InfoCount);
+ std::sort(ALL_CONST_RANGE(Scores), [](DetectEncodingInfo const& a, DetectEncodingInfo const& b) { return a.nDocPercent > b.nDocPercent; });
+
+ const auto It = std::find_if(ALL_CONST_RANGE(Scores), [&](DetectEncodingInfo const& i) { return i.nLangID != 0xffffffff && IsCodepageAcceptable(i.nCodePage); });
+ if (It == Scores.cend())
+ return false;
+
+ Codepage = It->nCodePage;
return true;
}

+static bool GetCpUsingHeuristicsWithExceptions(std::string_view const Str, uintptr_t& Codepage)
+{
+ const auto IsCodepageNotBlacklisted = [](uintptr_t const Cp)
+ {
+ return !contains(enum_tokens(Global->Opt->strNoAutoDetectCP.Get(), L",;"sv), str(Cp));
+ };
+
+ const auto IsCodepageWhitelisted = [](uintptr_t const Cp)
+ {
+ if (!Global->Opt->CPMenuMode)
+ return true;
+
+ if (any_of(Cp, encoding::codepage::ansi(), encoding::codepage::oem()))
+ return true;
+
+ const auto CodepageType = codepages::GetFavorite(Cp);
+ return (CodepageType & CPST_FAVORITE) != 0;
+ };
+
+ const auto IsCodepageAcceptable =
+ Global->Opt->strNoAutoDetectCP == L"-1"sv?
+ function_ref(IsCodepageWhitelisted) :
+ function_ref(IsCodepageNotBlacklisted);
+
+ if (GetCpUsingUniversalDetector(Str, Codepage) && IsCodepageAcceptable(Codepage))
+ return true;
+
+ return GetCpUsingML(Str, Codepage, IsCodepageAcceptable);
+}
+
// If the file contains a BOM this function will advance the file pointer by the BOM size (either 2 or 3)
static bool GetFileCodepage(const os::fs::file& File, uintptr_t DefaultCodepage, uintptr_t& Codepage, bool& SignatureFound, bool& NotUTF8, bool& NotUTF16, bool UseHeuristics)
{
@@ -398,7 +431,7 @@ static bool GetFileCodepage(const os::fs::file& File, uintptr_t DefaultCodepage,

NotUTF8 = true;

- return GetCpUsingUniversalDetectorWithExceptions({ Buffer.data(), ReadSize }, Codepage);
+ return GetCpUsingHeuristicsWithExceptions({ Buffer.data(), ReadSize }, Codepage);
}

uintptr_t GetFileCodepage(const os::fs::file& File, uintptr_t DefaultCodepage, bool* SignatureFound, bool UseHeuristics)
diff --git a/far/platform.headers.hpp b/far/platform.headers.hpp
index 66473a85f..91a028515 100644
--- a/far/platform.headers.hpp
+++ b/far/platform.headers.hpp
@@ -96,6 +96,7 @@ THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#include <ntddscsi.h>
#include <lmdfs.h>
#include <dbgeng.h>
+#include <mlang.h>

#define _NTSCSI_USER_MODE_



Reply all
Reply to author
Forward
0 new messages