searching through UTF-16 files?

46 views
Skip to first unread message

Richard Simões

unread,
Aug 23, 2018, 2:32:57 PM8/23/18
to ack users
Is it possible to configure ack to search through UTF-16 files? Setting the system locale isn't a workable solution for me, as there cannot be an en_US.utf-16 locale for Linux. I've been able to use iconv to first convert the files to be searched, but doing this is suboptimal for casual ack usage.

David Cantrell

unread,
Aug 23, 2018, 6:26:30 PM8/23/18
to ack-...@googlegroups.com
On 2018-08-23 19:25, Richard Simões wrote:

> Is it possible to configure ack to search through UTF-16 files? Setting
> the system locale isn't a workable solution for me, as there cannot be
> an en_US.utf-16 locale for Linux.

I'd be shocked if there *can't* be. It's probably just not got the
necessary data installed on your machine. If you're on Debian the
incantation is:

$ sudo dpkg-reconfigure locales

--
David Cantrell | Enforcer, South London Linguistic Massive

Perl: the only language that makes Welsh look acceptable

Bill Ricker

unread,
Aug 23, 2018, 7:58:03 PM8/23/18
to ack-...@googlegroups.com
On Thu, Aug 23, 2018 at 6:26 PM David Cantrell <da...@cantrell.org.uk> wrote:
On 2018-08-23 19:25, Richard Simões wrote:

> Is it possible to configure ack to search through UTF-16 files?

Ack doesn't have a FAQ answer for that yet, since this is the first time it's been asked.
However, I'm intrigued, so I'm going to look into it to see if it's possible!

 
Setting
> the system locale isn't a workable solution for me, as there cannot be
> an en_US.utf-16 locale for Linux.

I'd be shocked if there *can't* be. It's probably
...

Alas he's correct that Linux reportedly requires system-wide Locales to be ASCII compatible, which UTF-16 is not, so en_US.utf-16 is not the solution.

I will have to poke and prod a bit to see if there's a Perl solution.


--

Bill Ricker

unread,
Aug 24, 2018, 1:01:43 PM8/24/18
to ack-...@googlegroups.com


On Thu, Aug 23, 2018, 19:57 Bill Ricker <bill....@gmail.com> wrote:


On Thu, Aug 23, 2018 at 6:26 PM David Cantrell <da...@cantrell.org.uk> wrote:
On 2018-08-23 19:25, Richard Simões wrote:

> Is it possible to configure ack to search through UTF-16 files?

It looks, so far, as if it should be possible to implement a patch to enable this at least. We already peak at files' magic number and shebang #! lines, so checking for a Unicode prefix shouldn't disrupt existing code much.

 I'm going to wander through CPAN and see if there's a Unicode module that will automate detect ing the BOM and setting the UTF binmode accordingly.

(It's vaguely possible one of them would work without a patch via
  Perl -MUnicode::something bin/ack
but I'm not that optimistic -- yet. But it's plausible.)

Richard, do you happen to know of a public (CC or PD etc) file that is already UTF-16 that we can use as test data?

Bill

Bill Ricker

unread,
Aug 24, 2018, 5:53:30 PM8/24/18
to ack-...@googlegroups.com
YES!  Sort-of, mostly.

First, I found some testfiles in UCS-2 encoding (which is almost the same as UTF-16)
at http://www.humancomp.org/unichtm/unichtm.htm
(for anyone who wants to follow along at home but doesn't have UTF-16 files sitting around).
(But alas they are academic fair use from presumably copyright sources,  so we can't include them in test suite because Debian requires CC, GPL, etc..., so if we're going to support this usage or any better usage later, we'll need to find something with explicit license or PD/USG copyright waiver.)

So ok, yes we can trick ack into processing UCS-2/UTF-16 files, even without patching or making a feature request, but it's ugly, by injecting a global encoding declaration for all file opens:

$ perl  -C '-Mopen IO=>":encoding(UCS-2LE)"' ~/bin/ack --noenv 'langues|wastes'  russmnvr.htm tongtwst.htm unilang.htm

tongtwst.htm
164:    The wild wolf roams the wintry wastes.

unilang.htm
35:L'enseignement et l'étude des langues


CAVEATS
  • if your 'ack' is somewhere other than ~/bin/ack, use that path instead, obviously enough.
  • note the quotes, they're required for perl to get the modifiers for open. (Reverse "" and '' pairing on Windows, I suspect.)
  • examples work with both Ack2 and (pre-release beta) Ack3 !
  • the -C is required for the UTF8 output to STDOUT be correct ... otherwise étude will have a blot instead of an e-accent.
  • I have to include --noenv flag, because otherwise it will apply the universal UCS or UTF open option to my .ackrc file, which of course is ASCII or at most UTF-8 per Locale, not UCS-2/UTF-16, and chokes.
    So if you have -S --smart-case etc in your .ackrc, you'll need to repeat it or -i on your commandline,
    and any --type-set or --pager definitions there that are needed for the immediate search too etc.
    If you don't have a .ackrc, you might not need the --noenv. (But you should have one!)
  • The IO=> prefix above is actually optional
    as -MOpen understands :encoding means IO=>:encoding ,
    but the long form is documenting use of Perl::IO so is preferred.
    (When it only costs 4 characters ... i'll do it the long way.)
  • If your files are true UTF-16 (meaning with Surrogate pairs designating some 32bit characters),
    It will tell you with
       UCS-2LE:no surrogates allowed
    replace UCS-2LE above with UTF-16LE;
    and if big-endian (00 nn 00 nn) instead of (nn 00 nn 00) , LE with BE. (Error will discuss FFFE or FFEF )
  • If files have bad codepoints, it will be a fatal error, and ack will not move on to the next file;
    so you will need to filter out unclean UTF/UCS if searching recursively or with glob wildcards.
    (either by iterating until no fatal errors, or filtering with a UTF validator earlier in the pipeline)
    (why? the way we've universally injected UTF-ness via the open pragma has not injected permissive CHECK=>0 version of Unicode en/decode, because it's not an option for JFDI Everywhere. :-/ )
    (if we decide to add a feature to handle UTF-16/UCS-2, we should be able to use CHECK=>0 version to replace bad charcters with blots?)
    (Example: One of the test files at above page, the Zen bibliograph ("Most of a large bibliography related to Zen Buddhism, containing works in Chinese, Japanese, and other languages: UCS-2, UTF-8)", claims to be UCS-2 but has bogus surrogate pair values, and is rejected hard both as UCS-2 (surrogate pairs not allowed) and UTF-16 (bad HI surrogate). Firefox likes it just fine though?! 
    Sometimes we don't even get GIGO, it just chokes.)
  • the commandline search PCRE pattern is LATIN-1 (or ASCII or your Locale). 
    I tried including a greek string |παπια| from Tongue Twisters in Many Languages: UCS-2, UTF-8 in my search pattern and it doesn't work, but i can search for English or French just fine. :-(
    (alas unicode_start in the terminal session didn't make this better either; i'm not sure where this gets mangled. May need to sort out unicode patterns with UTF8 first!)
  • I initially presumed I could just use \N{NAME} Named Characters to build a pattern matching παπια verbosely ala  \N{GREEK LOWER CASE PI} but no, for security ack creates the regex from effectively a singlequoted string, it never gets double-quote interpolation so \N{NAME} and some others don't work .
    But I can use numeric Unicode char names to match a Unicode word of pi, alpha, iota only:
           \b[\N{U+03C0}\N{U+03B1}\N{U+03B9}]+\b
    (see below for example using that; and obviously one could spell the word exactly that way too)
  • Even though the test files have the BOM (Byter Order Marker) at the start, I had to specify byte order explicitly with the LE or BE (as well as UCS-2 vs UTF-16).
    If you guess wrong, it will report something like
    UTF-16BE:Unicode character fffe is illegal at /usr/bin/ack line 739.
    UCS-2LE:Unicode character feff is illegal at /usr/bin/ack line 739.
    In which case, switch B<->L and try again!.
    (This makes me sad, I thought BOM should disambiguate it for me.  I guess that only works if i slurp first and then decode, which can't be injected, would require patching.)
  • The above means all the files in one pass of ack must be same encoding and same byte order. 
    (You should be albe to mix UTF-8 and ASCII/Locale files.)
    (Workaround: filter mixed collection into subsets first; consider using xargs with lists of each type  if not sorting into folders/directories. )
    (If you don't have a Unicode validator to check what type each is and if it's clean, let me know and we'll think of something.)
Extra example -- Searching Greek tongue twisters :

$ perl   -C  '-Mopen ":encoding(UTF-16LE)"' ~/bin/ack --noenv '(?x: langues | \b[\N{U+03C0}\N{U+03B1}\N{U+03B9}]+\b | wastes)'   [a-y]*.htm
tongtwst.htm
50:    Μια παπια μα πια παπια?
164:    The wild wolf roams the wintry wastes.

unilang.htm
35:L'enseignement et l'étude des langues

The (?x: ) wrapper sets Expanded syntax so the extra spaces aren't matched to make the disjunction more readable.
(Oddly i didn't find the (?u:) wrapper to force the patter to be a Unicode string was required.)
(Why [a-y]*.htm ? to avoid the zenbibl.htm file with the bad codepoint. )

DISCUSSION
Since we primarily position Ack as a programmer's code search / spelunking tool, I'm not confident we'll accept a feature request to enable UTF-16/UCS-2 and auto-detection via BOM of 8/16/32 and BE/LE ... but I'm tempted to try anyway; no promises!

This may go into Perl3's "Cookbook"  documentation -- it's not quite a FAQ but it's worth recording, both to show what's possible and to show the limits.

I hope this helps -- hit me with any follow-up questions this raises.

And thank you for a fun rabbit hole to go spelunking in !


// Bill

David Cantrell

unread,
Aug 28, 2018, 8:44:25 AM8/28/18
to ack-...@googlegroups.com
On Fri, Aug 24, 2018 at 05:53:17PM -0400, Bill Ricker wrote:

> *DISCUSSION*
> Since we primarily position Ack as a programmer's code search / spelunking
> tool, I'm not confident we'll accept a feature request to enable
> UTF-16/UCS-2 and auto-detection via BOM of 8/16/32 and BE/LE ... but I'm
> tempted to try anyway; no promises!

FWIW while I do mostly use ack for grovelling over code, I also use it
for grovelling over documentation and config data, both of which can
contain non-ASCII code points. Thankfully *I* only have to work (at
least currently) with utf8 and ascii documentation and data, but I can
imagine that there are poor souls out there working with other
encodings.

I'd love to see ack spawn something like an --encoding=utf-16 tentacle,
with maybe an --encoding=automatic that could be stuffed into a .ackrc.

--
David Cantrell | Cake Smuggler Extraordinaire

Immigration: making Britain great since AD43

Bill Ricker

unread,
Aug 28, 2018, 11:34:47 AM8/28/18
to ack-...@googlegroups.com
On Tue, Aug 28, 2018 at 8:44 AM David Cantrell <da...@cantrell.org.uk> wrote:
On Fri, Aug 24, 2018 at 05:53:17PM -0400, Bill Ricker wrote:

> *DISCUSSION*
> Since we primarily position Ack as a programmer's code search / spelunking
> tool, I'm not confident we'll accept a feature request to enable
> UTF-16/UCS-2 and auto-detection via BOM of 8/16/32 and BE/LE ... but I'm
> tempted to try anyway; no promises!

FWIW while I do mostly use ack for grovelling over code, I also use it
for grovelling over documentation and config data, both of which can
contain non-ASCII code points.

Likewise !   I collect a lot of out-of-copyright history/historic books as PDF and txt, and export my .doc[x]/xls[x]/od[stp] files as txt, and then use the tree of TXT as an index to all the PDFs etc via ack.

I am however aware that using Ack as part of my ad hoc CLI Document Retrieval System is an "off label" use of a code-search tool.

How often do programmers with native language not English use accented characters in their variable/subroutine/Module::method names ?
(We may not see much of it in CPAN but it may be more prevalent in-house usage?)
How often do we need to search for emojis that aren't expressed as Hex or \N{NAME} but are  visible emoji characters in the source file?

(I wonder, is the shebang `#! perl` processed properly if it's preceded by a UTF BOM ?  Is it LOCALE dependent?)


Thankfully *I* only have to work (at
least currently) with utf8 and ascii documentation and data, but I can
imagine that there are poor souls out there working with other
encodings.

UTF-16/32 will be rare outside of shops processing Asian languages, where there are tradeoffs, but i can see where WORD as Codepoint would have the same advantage (in modern Big RAM world) that our former BYTE as Codepoint had had in our innocent (=naive/guilty) days.

Running across a  UTF file without BOM that had been separated from metadata specification of which UTF will be problematic.
I'd love to see ack spawn something like an --encoding=utf-16 tentacle,
with maybe an --encoding=automatic that could be stuffed into a .ackrc.

Inspired by above thread, i ran an experiment, with hopes of implementing above suggestions.

Automatic encoding detection can work reliably only when a BOM (Byte Order Mark) prefix is included in the front of the file .
Detection of encodings without a BOM is intrinsically heuristic at best.
Binary data which may be some UTF or might not can't reliably be distinguished by trying everything and catching conversion failures, there will be false positives which can only be accepted/rejected by recognizing plausible content, which is a natural language problem. I can't tell if a series of Chinese etc ideographs is nonsense or sensible, and neither can Perl. (OTOH Google Translate tells me the sequence of Chinese ideographs emitted by a wrong UTF guess is nonsense.)

( Automatic BOM detection would even help with UTF8 files currently being detected as ASCII. )

I've added a feature request to Ack3 RT queue for --encoding=utf32 and for automagic interpretation of BOM (with dis/en-able flags).
But reading the document as Unicode opens a can of works regarding Unicode REs ... when is the RE to be treated as (?u:)? when does /[c]/ match "ç" ? When does /\w/ match "à á â ç è é ê ì í î ô ü µ 𝛷 𝛹 𝛳 ô 𝟇 𝝿 𝜎 τ" ?
Not sure what the prognosis would be.

Richard Simões

unread,
Sep 2, 2018, 3:48:36 AM9/2/18
to ack users
Oh, wow, I forgot I even posted this. My current necessity for UTF-16 searching is admittedly an unusual situation for a programmer: I'm receiving CSVs to process from various collaborators, all of whom are using Microsoft Excel on either Windows or OS X. Amazingly, it turns out no version of Excel on any platform can output a CSV in UTF-8: If there are any characters outside the ASCII range, Excel will encode in the given operating system's historical proprietary encoding (i.e., Windows-1252 or Mac OS Roman). This inevitably lead to the collaborators corrupting their own files when passing them among themselves. My suggestion that everyone switch to LibreOffice was met with blank stares.

After some research, we discovered that Excel on all platforms can output tab-separated values encoded in UTF-16 (w/ BOM). LibreOffice Calc can do this, too: Save a file with the "Text CSV" format and tick the "Edit filter settings" checkbox to be presented with encoding and delimiter options. This solution was acceptable to everyone, including me, until the first time I tried to ack through one of the new UTF-16-encoded files.

Bill Ricker

unread,
Sep 2, 2018, 4:16:30 PM9/2/18
to ack-...@googlegroups.com
On Sun, Sep 2, 2018 at 3:48 AM Richard Simões <rsi...@gmail.com> wrote:
Oh, wow, I forgot I even posted this. My current necessity for UTF-16 searching is admittedly an unusual situation for a programmer: I'm receiving CSVs to process from various collaborators, all of whom are using Microsoft Excel on either Windows or OS X. Amazingly, it turns out no version of Excel on any platform can output a CSV in UTF-8: If there are any characters outside the ASCII range, Excel will encode in the given operating system's historical proprietary encoding (i.e., Windows-1252 or Mac OS Roman). This inevitably lead to the collaborators corrupting their own files when passing them among themselves.

Passing spreadsheets around is problematic in the best of times but wow, that's special.

May I ask what sort of non-ASCII in the Excel was forcing encoding of the CSV? The usual Latin-1 mix of multinational characters, emojis, or non-Western scripts?

It did kinda made sense historically to default to OS proprietary extended character set for a native app, even a multi-platform native app like Excel, but wow, it doens't have UTF-8 option yet???  Adding UTF-16 as an option does make sense for Asian language file sharing, lucky it's there for you!

My suggestion that everyone switch to LibreOffice was met with blank stares.

Sigh. yeah, folks with sunk cost both in actual $$ and training / experience with MS Office are resistant to the idea that they can get better for less, since it challenges the validity of their prior decisions. And don't want to straddle two programs one to collaborate with this group and one to collaborate with everyone else.

For some use-cases, Spreadsheets in the cloud can be better for shared data than passing them around -- Office 360 or better yet Google Sheets.  Especially  if it's data collection, Google Sheets have data entry forms.  (Not appropriate for highly sensitive information, even at the "only people with URL can see" level, of course.)

After some research, we discovered that Excel on all platforms can output tab-separated values encoded in UTF-16 (w/ BOM).

That is useful information!
( Let us be thankful it is with BOM! )
LibreOffice Calc can do this, too: Save a file with the "Text CSV" format and tick the "Edit filter settings" checkbox to be presented with encoding and delimiter options.

Cool!
This solution was acceptable to everyone, including me,

This is somewhat surprising, i didn't expect to see UTF-16 be useful outside of Asian text processing!

until the first time I tried to ack through one of the new UTF-16-encoded files.

Since Ack is positioned as a programmers' code search tool, with data search as supplementary use, and natural language search as "off label" us, nice if it works but not really supported, we've been rather casual in our UTF-8 support. Yes, Perl and some other languages allow use of  UTF8 encoding of source files -- which allows non-ASCII hi-bit or multi-octet chars in variable and subroutine etc names and in string constants in code files, not just in data files/streams -- but it's mostly worked (sometimes requiring PERL_UNICODE=SAD enviroment prefix). This is the first time we've run across a need to detect BOM tag for file encoding.

UTF-16 is not in good odor on Linux for historical reasons.  (I forget the details.)  Hence LOCALE=en_US.utf-16 is not an option to force Perl to deal with it.So the simple layer of tricks we use to force UTF8 onto Ack without explicit decoding support won't work for UTF16. (although if one installed an Asian UTF32 locale in one's Ubuntu etc, they might work? untried with Ack!)

That you're searching DATA -- that you may have programs processing and so are using Ack to peek into the data while debugging (since we debug the GIGO data as often as we debug code!) -- will add just a little weight to the idea of detecting BOM prefixes and doing the right thing with them (decoding to internal).  BOM detection and processing looks simple to _do_ but not simple to expand the test suite adequately to assure it doesn't result interact  badly elsewhere.
// Bill

Richard Simões

unread,
Sep 2, 2018, 5:08:31 PM9/2/18
to ack users

On Sunday, September 2, 2018 at 3:16:30 PM UTC-5, bill....@gmail.com wrote:

May I ask what sort of non-ASCII in the Excel was forcing encoding of the CSV? The usual Latin-1 mix of multinational characters, emojis, or non-Western scripts?

Just the usual Latin-1 mix: The data includes full names of people, some of which are Hispanic or French, and have the expected diacritics.

For some use-cases, Spreadsheets in the cloud can be better for shared data than passing them around -- Office 360 or better yet Google Sheets.  Especially  if it's data collection, Google Sheets have data entry forms.  (Not appropriate for highly sensitive information, even at the "only people with URL can see" level, of course.)

This option was considered, but Google Sheets had intolerable performance issues for the larger files. And I had so much hope for V8.
 
This solution was acceptable to everyone, including me,

This is somewhat surprising, i didn't expect to see UTF-16 be useful outside of Asian text processing!

It was definitely a surprise relief. For my own purposes it was just a matter of explicitly setting I/O encodings in the project codebase. It took a couple minutes and yielded zero a complaint from the test suite.
 

until the first time I tried to ack through one of the new UTF-16-encoded files.

That you're searching DATA -- that you may have programs processing and so are using Ack to peek into the data while debugging (since we debug the GIGO data as often as we debug code!) -- will add just a little weight to the idea of detecting BOM prefixes and doing the right thing with them (decoding to internal).  BOM detection and processing looks simple to _do_ but not simple to expand the test suite adequately to assure it doesn't result interact  badly elsewhere.
// Bill

For what it's worth, I'd happily settle for an explicit encoding flag that could be tucked away in a working directory's .ackrc file. Either way, thanks for the useful discussion!

--
Richard Simões
Internet

Bill Ricker

unread,
Sep 2, 2018, 5:14:29 PM9/2/18
to ack-...@googlegroups.com
For what it's worth, I'd happily settle for an explicit encoding flag that could be tucked away in a working directory's .ackrc file. Either way, thanks for the useful discussion!

That, perhaps including a BOM-magic option, seems least unlikely.

Can you confirm that my suggested hack with explicit Perl invocation and IO code injection works with ack and your BOM-UTF-16 data?

Richard Simões

unread,
Sep 4, 2018, 2:08:16 AM9/4/18
to ack users
Can you confirm that my suggested hack with explicit Perl invocation and IO code injection works with ack and your BOM-UTF-16 data?

With Perl 5.26 and ack 2.18:

$ perl  -C '-Mopen IO=>":encoding(UTF-16LE)"' /usr/local/bin/ack --noenv test
UTF-16LE:Partial character at /usr/local/bin/ack line 525.
UTF-16LE:Partial character at /usr/local/bin/ack line 525.
UTF-16LE:Partial character at /usr/local/bin/ack line 543, <__ANONIO__> line 1.

Bill Ricker

unread,
Sep 4, 2018, 11:45:11 AM9/4/18
to ack-...@googlegroups.com
I'm shocked, shocked that an MS App writes a naughty character in UTF-16.
(Pleasantly surprised you didn't get bad surrogate pair warnings.)

Guessing they've not UTF-16LE encoded the trailing Ctrl-Z EOF ?
(Confusingly in UTF-16LE, EOF should be ^Z \000 .  Any app not doing UTF should treat as Binary and ignore ^Z !)


If you 2>/dev/null,  to mask the noise, and search for a word that appears in the files, do you get sensible results ?

Richard Simões

unread,
Sep 6, 2018, 3:40:48 PM9/6/18
to ack users

On Tuesday, September 4, 2018 at 10:45:11 AM UTC-5, bill....@gmail.com wrote:
I'm shocked, shocked that an MS App writes a naughty character in UTF-16.
(Pleasantly surprised you didn't get bad surrogate pair warnings.)

Guessing they've not UTF-16LE encoded the trailing Ctrl-Z EOF ?
(Confusingly in UTF-16LE, EOF should be ^Z \000 .  Any app not doing UTF should treat as Binary and ignore ^Z !)


If you 2>/dev/null,  to mask the noise, and search for a word that appears in the files, do you get sensible results ?


Sorry, I was being foolish: There were other files ack was searching and croaking on. Against just one file of the right encoding your trick works perfectly.

Bill Ricker

unread,
Sep 6, 2018, 5:12:08 PM9/6/18
to ack-...@googlegroups.com
> Sorry, I was being foolish: There were other files ack was searching and croaking on. Against just one file of the right encoding your trick works perfectly.

Ah yes with either the UTF-8 SAD or the general case Unicode injection
hacks, the list of files to search must be uniformly encoded --
whether list implicit (segregated by directory) or explicit (filtered
and list fed e.g. via xargs or $(cmd)).

(Which is the advantage to adding a BOM-detector option. I don't know
if i can pull off a full fait-accompli on that, as the impact to the
test suite would add complication elsewhere. )
Reply all
Reply to author
Forward
0 new messages