On 13/08/2021 20:28, Java Jive wrote:
> I have the following lines in a shell script ...
>
> while [ -n "${LINE}" ]
> do
> if [ -n "${LINE} ]
> then
> # Do processing
> fi
> done < "${DATA}"
>
> .... and this works fine for all but two lines in the data file, which
> contain accented characters. A file erroneously named with an e acute
> needs to be renamed to have an e grave, and a filename containing an e
> umlaut needs to be moved to a new location and given a new name.
Uggghhh! The reason for this disgust will become clear shortly!
This is a follow up question about character encodings ...
Previously I have released to my family two versions of the same archive
of family documents going back to the reign of Queen Anne, some items
possibly a little earlier. These documents were scanned (1o for
original scan) and then put through four possible stages of post-processing:
2n Contrast 'normalised' using pnnorm
3t Textcleaned
4nt n followed by 3
5tn t followed by n
For each document, the best result was copied into the main archive,
while the above preprocessing stages were left in an '_all'
sub-directory structure, with five subdirectories named as above, each
of which having beneath it a directory tree mirroring the main archive.
The main version of the archive, which most family members seem to have
downloaded, only included the main archive and didn't include the _all
subdirectory with all the pre-processing results, the full version
included this directory. IIRC, the former was compressed by WinZip from
the archive as it existed on a Windows PC at the time, but WinZip threw
a wobbly over the size of the full archive, so for that I had to use 7zip.
Now the crunch, when I unzip these on a Linux machine, I see different
bastardisations of accented characters. So, for example where the full
7zip archive when extracted shows an e acute correctly in both a console
and a file manager listing ...
"Chat Botté, Le" [e is correctly acute]
... (if you're wondering, a French children's picture book version of
apparently 'Puss In Boots'), while with the WinZip main archive a
console listing shows a very odd character sequence instead of the e
acute ...
"Chat Bott'$'\302\202'', Le"
... and a file manager listing has a graphic character resembling a 2x2
matrix, concerning which note that while \302 octal = \xC2 hex, and
\202 octal = \x82 hex, only the second of these and not the first
appears in the symbol:
|00|
|82|
My problem is that I can't find a search term to trap this strange
character to correct it, for example the following, and a few similar
that I've tried, don't work because they don't find the directory:
mv "Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
mv Chat\ Bott\'$\'\\302\\202\'\',\ Le "Chat Botté, Le"
I could use a glob wildcard character such as '?', but currently all the
filenames are within quotes, where globbing doesn't seem to work, and it
would be a hell of a business removing the quotes, because many names in
the archive use many characters that would each need to be anticipated
and escaped for in an unquoted filename, such as spaces, ampersands,
brackets, etc.
Can anyone suggest a sequence that will find the file, when put inside
quotes as the filename in the controlling data file mentioned previously
in the thread, so that it can just be treated like all the other lines?
As someone here suggested the data file is now stored as UTF-8 rather
than ANSI as it was formerly, and some example lines are given below in
a form for easier readability in a ng - in reality the fields are tab
separated but here are separated by double spacing and have been further
abbreviated to keep them from wrapping; leading symbols such as '+' and
'=' have special meanings for the program doing the work; and, yes, the
commands are basically DOS commands which for Linux are translated to
their bash equivalents:
=ATTRIB -R "./F H/Close/Sts Mary & John Churchyard Monuments.pdf"
=RD "./F H /_all/1o/Blessig & Heyder"
REN "./Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
MOVE "./Photo - D & M Close.png" "./Photos/D & M Close.png"
[etc]