Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Java Jive wrote:
> On 13/08/2021 20:28, Java Jive wrote:
>> I have the following lines in a shell script ...
>>
>> while [ -n "${LINE}" ]
>> do
>> if [ -n "${LINE} ]
>> then
>> # Do processing
>> fi
>> done < "${DATA}"
>>
>> .... and this works fine for all but two lines in the data file, which
>> contain accented characters. A file erroneously named with an e acute
>> needs to be renamed to have an e grave, and a filename containing an e
>> umlaut needs to be moved to a new location and given a new name.
>
> Uggghhh! The reason for this disgust will become clear shortly!
>
> This is a follow up question about character encodings ...
>
> Previously I have released to my family two versions of the same archive
> of family documents going back to the reign of Queen Anne, some items
> possibly a little earlier. These documents were scanned (1o for
> original scan) and then put through four possible stages of
> post-processing:
> 2n Contrast 'normalised' using pnnorm
> 3t Textcleaned
> 4nt n followed by 3
> 5tn t followed by n
>
> For each document, the best result was copied into the main archive,
> while the above preprocessing stages were left in an '_all'
> sub-directory structure, with five subdirectories named as above, each
> of which having beneath it a directory tree mirroring the main archive.
>
> The main version of the archive, which most family members seem to have
> downloaded, only included the main archive and didn't include the _all
> subdirectory with all the pre-processing results, the full version
> included this directory. IIRC, the former was compressed by WinZip from
> the archive as it existed on a Windows PC at the time, but WinZip threw
> a wobbly over the size of the full archive, so for that I had to use 7zip.
>
> Now the crunch, when I unzip these on a Linux machine, I see different
> bastardisations of accented characters. So, for example where the full
> 7zip archive when extracted shows an e acute correctly in both a console
> and a file manager listing ...
> "Chat Botté, Le" [e is correctly acute]
> ... (if you're wondering, a French children's picture book version of
> apparently 'Puss In Boots'), while with the WinZip main archive a
> console listing shows a very odd character sequence instead of the e
> acute ...
> "Chat Bott'$'\302\202'', Le"
> ... and a file manager listing has a graphic character resembling a 2x2
> matrix, concerning which note that while \302 octal = \xC2 hex, and
> \202 octal = \x82 hex, only the second of these and not the first
> appears in the symbol:
> |00|
> |82|
>
> My problem is that I can't find a search term to trap this strange
> character to correct it, for example the following, and a few similar
> that I've tried, don't work because they don't find the directory:
> mv "Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
> mv Chat\ Bott\'$\'\\302\\202\'\',\ Le "Chat Botté, Le"
>
> I could use a glob wildcard character such as '?', but currently all the
> filenames are within quotes, where globbing doesn't seem to work, and it
> would be a hell of a business removing the quotes, because many names in
> the archive use many characters that would each need to be anticipated
> and escaped for in an unquoted filename, such as spaces, ampersands,
> brackets, etc.
>
> Can anyone suggest a sequence that will find the file, when put inside
> quotes as the filename in the controlling data file mentioned previously
> in the thread, so that it can just be treated like all the other lines?
> As someone here suggested the data file is now stored as UTF-8 rather
> than ANSI as it was formerly, and some example lines are given below in
> a form for easier readability in a ng - in reality the fields are tab
> separated but here are separated by double spacing and have been further
> abbreviated to keep them from wrapping; leading symbols such as '+' and
> '=' have special meanings for the program doing the work; and, yes, the
> commands are basically DOS commands which for Linux are translated to
> their bash equivalents:
>
> =ATTRIB -R "./F H/Close/Sts Mary & John Churchyard Monuments.pdf"
> =RD "./F H /_all/1o/Blessig & Heyder"
> REN "./Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
> MOVE "./Photo - D & M Close.png" "./Photos/D & M Close.png"
> [etc]
>
https://stackoverflow.com/questions/4177783/xc3-xa9-and-other-codes/4177813#4177813
It looks like perhaps this "text string" for the filename,
went through some web encoding at some point. With a hex
editor, I can change C3 A9 to E9 hex, and the character in
the hex editor (on the right hand side) looks visually correct.
https://i.postimg.cc/TP57bLD9/C3-A9-to-E9.gif
You could do such an operation, in Perl, right on the
file system.
***********************
rename2.ps *************************
printf("this is a test\n");
$start = "Chat Bott";
$finish = ", Le";
$naughty1 = <\x{C3}\x{A9}> ;
$naughty2 = <\x{E9}> ;
$x = $start.$finish ;
$y = $start.$naughty1.$finish ;
$z = $start.$naughty2.$finish ;
open(OUT, ">>$x") || die("Cannot create X");
close(OUT);
open(OUT, ">>$y") || die("Cannot create Y");
close(OUT);
open(OUT, ">>$z") || die("Cannot create Z");
close(OUT);
use Cwd;
$c = getcwd ;
printf("Making a mess in %s\n", $c );
#rename( $y , $z );
exit(0);
*********************** end of
rename2.ps *************************
I ran this in Windows 11, by double-clicking the file. I
could not run it using one of their terminals. I just thought
it was mildly amusing as to what the filenames looked like.
The idea of the script above, is you run it multiple times,
commenting out a line here or there, while you do your tests.
For example, comment out the creation of file $z and
enable the rename(y,z) command near the bottom, to see
if the created $y can be renamed to the presumed operational $z value.
https://i.postimg.cc/gksLyGFL/rename2-output.gif [Picture]
So far, I only tested it as copy/pasted above. I haven't
tested the rename.
Then, you'd need to pick up a recursive tree ("find-next-file")
type pattern, and look for a filename with $naughty1 in it,
and rename it somehow. Maybe something like one of the
examples here. You would probably need to look for a
substring of $naughty1, in the filenames returned.
https://stackoverflow.com/questions/5089680/how-to-find-files-folders-recursively-in-perl-script
File renaming, is the only thing I've done with Perl :-)
I'll never be a Perl person I guess.
Paul