On 2018-08-23 19:25, Richard Simões wrote:
> Is it possible to configure ack to search through UTF-16 files?
Setting
> the system locale isn't a workable solution for me, as there cannot be
> an en_US.utf-16 locale for Linux.
I'd be shocked if there *can't* be. It's probably
...
On Thu, Aug 23, 2018 at 6:26 PM David Cantrell <da...@cantrell.org.uk> wrote:On 2018-08-23 19:25, Richard Simões wrote:
> Is it possible to configure ack to search through UTF-16 files?
On Fri, Aug 24, 2018 at 05:53:17PM -0400, Bill Ricker wrote:
> *DISCUSSION*
> Since we primarily position Ack as a programmer's code search / spelunking
> tool, I'm not confident we'll accept a feature request to enable
> UTF-16/UCS-2 and auto-detection via BOM of 8/16/32 and BE/LE ... but I'm
> tempted to try anyway; no promises!
FWIW while I do mostly use ack for grovelling over code, I also use it
for grovelling over documentation and config data, both of which can
contain non-ASCII code points.
Thankfully *I* only have to work (at
least currently) with utf8 and ascii documentation and data, but I can
imagine that there are poor souls out there working with other
encodings.
I'd love to see ack spawn something like an --encoding=utf-16 tentacle,
with maybe an --encoding=automatic that could be stuffed into a .ackrc.
Oh, wow, I forgot I even posted this. My current necessity for UTF-16 searching is admittedly an unusual situation for a programmer: I'm receiving CSVs to process from various collaborators, all of whom are using Microsoft Excel on either Windows or OS X. Amazingly, it turns out no version of Excel on any platform can output a CSV in UTF-8: If there are any characters outside the ASCII range, Excel will encode in the given operating system's historical proprietary encoding (i.e., Windows-1252 or Mac OS Roman). This inevitably lead to the collaborators corrupting their own files when passing them among themselves.
My suggestion that everyone switch to LibreOffice was met with blank stares.
After some research, we discovered that Excel on all platforms can output tab-separated values encoded in UTF-16 (w/ BOM).
LibreOffice Calc can do this, too: Save a file with the "Text CSV" format and tick the "Edit filter settings" checkbox to be presented with encoding and delimiter options.
This solution was acceptable to everyone, including me,
until the first time I tried to ack through one of the new UTF-16-encoded files.
May I ask what sort of non-ASCII in the Excel was forcing encoding of the CSV? The usual Latin-1 mix of multinational characters, emojis, or non-Western scripts?
For some use-cases, Spreadsheets in the cloud can be better for shared data than passing them around -- Office 360 or better yet Google Sheets. Especially if it's data collection, Google Sheets have data entry forms. (Not appropriate for highly sensitive information, even at the "only people with URL can see" level, of course.)
This solution was acceptable to everyone, including me,This is somewhat surprising, i didn't expect to see UTF-16 be useful outside of Asian text processing!
until the first time I tried to ack through one of the new UTF-16-encoded files.
That you're searching DATA -- that you may have programs processing and so are using Ack to peek into the data while debugging (since we debug the GIGO data as often as we debug code!) -- will add just a little weight to the idea of detecting BOM prefixes and doing the right thing with them (decoding to internal). BOM detection and processing looks simple to _do_ but not simple to expand the test suite adequately to assure it doesn't result interact badly elsewhere.
// Bill
For what it's worth, I'd happily settle for an explicit encoding flag that could be tucked away in a working directory's .ackrc file. Either way, thanks for the useful discussion!
Can you confirm that my suggested hack with explicit Perl invocation and IO code injection works with ack and your BOM-UTF-16 data?
I'm shocked, shocked that an MS App writes a naughty character in UTF-16.(Pleasantly surprised you didn't get bad surrogate pair warnings.)Guessing they've not UTF-16LE encoded the trailing Ctrl-Z EOF ?(Confusingly in UTF-16LE, EOF should be ^Z \000 . Any app not doing UTF should treat as Binary and ignore ^Z !)If you 2>/dev/null, to mask the noise, and search for a word that appears in the files, do you get sensible results ?