I wanted to see what a *.doc uses for 'single quote char',
so based on the default mc 'script' which shows most
file formats as plain-text:
... catdoc -w %f || word2x -f text %f - || strings %f
[which I think means, try catdoc OR word2x or strings
on the file, but since I've only got 'strings';
I tried:
cat <*.doc> | strings | fmt | > doc2text
Which AFAIK simply, extracts-only-ascii and formats it to
line-len < <about 74> & then saves to file: doc2text
BUT: the single quotes were all missing. Like in "man's hat".
And when I search in the original *.doc, which is problematic,
eg. # cat headsofargumentoffirstand.doc | grep tate
==Binary file (standard input) matches,
because most text utils are line based, and expect the
unix-line-terminator, I can't see the single-quote-char.
Eg. mc's edit finds: "...constitutes a breach of the States.."
where the missing single-quote-char is probably there between
the "e" & "s", but, being non-ascii/M$hit-style is unrendered.
BTW: cat x | fmt | grep State
cuts the lines to managable length by 'fmt', but still
# cat x | fmt | grep State
== Binary file (standard input) matches
And man grep ==
" -a, --text
Process a binary file as if it were text; this is equivalent to
the --binary-files=text option. "
prompting to try: # cat x | fmt | grep -a State
== shows eg. "..the States positive obligation.."
So here's the 2nd question: what common util will show the
hex/binary/octal of a byte/S in a specifiable position of
a file ?
In this particular case, perhaps sed could extract the known
'line' [after applying fmt] to a file, where mc would show
the ascii-value of all, including no-ascii chars ?
== Chris Glur.
PS. this the type of 'transparent': show the reasoning
behind; which *I'd* also like to get on Usenet, rather than
the common format:
"do wizz-bang-wow",
without any background explanation.
MS uses their own exotic versions of many common *ASCII* punctionuation
characters. For some reason, known *only* to the people at MS, the
*standard* ASCII characters are not good enough. I have *heard* it has
something to do with variable with fonts, but I don't understand why
these fonts can't use suitable glyhs at the *standard* ASCII character
positions. Hotmail.com footers often use these exotic characters as
well, which does bad things when such messages are read by a plain ASCII
based E-Mail client.
>
>
>
--
Robert Heller -- 978-544-6933
Deepwoods Software -- Download the Model Railroad System
http://www.deepsoft.com/ -- Binaries for Linux and MS-Windows
hel...@deepsoft.com -- http://www.deepsoft.com/ModelRailroadSystem/
> MS uses their own exotic versions of many common *ASCII* punctionuation
> characters. For some reason, known *only* to the people at MS, the
> *standard* ASCII characters are not good enough.
Purposeful un-interoperability is the reason.
>> MS uses their own exotic versions of many common *ASCII* punctionuation
>> characters. For some reason, known *only* to the people at MS, the
>> *standard* ASCII characters are not good enough.
>
> Purposeful un-interoperability is the reason.
/modquote
Not exactly true. They're using matched curling quote marks, have been
since Word 6 IIRC. Very irritating when you wanted just plain quotes but
it is a standard now, covered by Unicode, and the PHB likes it (always a
MS priority. What? You think they got rich by making good software? :) )
Quick google turned up this which seems to cover the basics:
http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html
Should be easy enough to transcode the fancy quotes to old fashioned
neutral ones with a bit of sed or something.
Blumf
I'm not sure why it was expected that opening and closing quotes would
be the same in a Word document, or any other word processing, desktop
publishing, or document markup language. They are not the same
character in real formatted text.
- Kurt
True. But Microsoft doesn't even use ASCII apostrophe for apostrophes
in MS-Word documents! LaTeX uses ` and ' (`` and '' for doubles).
>
> - Kurt
You got me curious - neither does LaTeX. The .tex input file is ASCII -
by definition....
The .dvi file uses 0x60 (`) and 0x27 (') for single quotes. It also
uses 0x22 (") for closing double quotes. But, the .dvi file appears to
use 0x5c for opening double quotes?
Either way, the MS codes for this situation seem to be fairly well
documented. No idea if MS supplied them, or if they had to be reverse
engineered. It is their format, they should be able to do whatever they
want with it. I really can't fault them here.
- Kurt
>> True. But Microsoft doesn't even use ASCII apostrophe for apostrophes
>> in MS-Word documents! LaTeX uses ` and ' (`` and '' for doubles).
>
> Either way, the MS codes for this situation seem to be fairly well
> documented. No idea if MS supplied them, or if they had to be reverse
> engineered. It is their format, they should be able to do whatever they
> want with it. I really can't fault them here.
people use them in email, and if you run a mailing list that converts
single messages into digest mode, it's one of those things you have to
convert....
Well, people are stupid. I still can't understand what exactly people
are gaining where I work with all the fancy formatted email that gets
sent around now days. It is as bad as the people who feel the need to
send a jpg image inside of a PowerPoint presentation (no joke, seen it
happen), or an Excel spreadsheet for a short list of items. Hell, some
morons actually compose email in Word, and then send that as an
attachment!
- Kurt
I can beat that; screen shot of a console window, embedded in a Word
doc. Punch line being, half the text I was interested in had scrolled
off the visible area.
Blumf
Wow, and I thought HTML email was annoying...
Honestly, I find that 99% of computer users don't think about file
formats at all. They don't think "ah, this is a Microsoft Word
document, MIME type 'application/msword', and here's the reason I have
chosen to use this format", but rather "this is a Word document. This
is what I use to type stuff."
Damnit, people. *Care* more!
> Honestly, I find that 99% of computer users don't think about file
> formats at all. They don't think "ah, this is a Microsoft Word
> document, MIME type 'application/msword', and here's the reason I have
> chosen to use this format", but rather "this is a Word document. This
> is what I use to type stuff."
>
> Damnit, people. *Care* more!
Most users think a .doc file is text.
It was, back in the heyday of MSDOS...
--
"Ubuntu" -- an African word, meaning "Slackware is too hard for me".
The Usenet Improvement Project: http://improve-usenet.org
Ahhhhhhhh!: http://brandybuck.site40.net/pics/relieve.jpg
> On Fri, 10 Jul 2009 16:31:26 +0000, jellybean stonerfish wrote:
>
>> On Fri, 10 Jul 2009 15:13:06 +0000, Logan Rathbone wrote:
>>
>>> Honestly, I find that 99% of computer users don't think about file
>>> formats at all. They don't think "ah, this is a Microsoft Word
>>> document, MIME type 'application/msword', and here's the reason I have
>>> chosen to use this format", but rather "this is a Word document. This
>>> is what I use to type stuff."
>>>
>>> Damnit, people. *Care* more!
>>
>> Most users think a .doc file is text.
>
> It was, back in the heyday of MSDOS...
...twenty years ago. (I still sometimes find software with plain-text .doc
files. Kinda rare, but still happens.)
--
Alcohol makes you immune to gravity. And bulletproof.
Yea, a library in my current project has a bunch of .doc documentation
files that are all ASCII.
- Kurt
> Not exactly true. They're using matched curling quote marks, have been
> since Word 6 IIRC. Very irritating when you wanted just plain quotes but
> it is a standard now, covered by Unicode, and the PHB likes it (always a
> MS priority. What? You think they got rich by making good software? :) )
To be fair, m4 used to configure sendmail uses the very odd quote character
on the top left of the keyboard, iirc so does bash.
Pete
The backquote character is a ligitimate ASCII character. M$ uses 8-bit
character codes *instead* of available ASCII characters and does so
where they shouldn't (like in Hotmail,com et. al. footers).
>
> Pete
>To be fair, m4 used to configure sendmail uses the very odd quote character
>on the top left of the keyboard, iirc so does bash.
Don't use m4 here (directly). It's a backtick '`' (common name).
Bash no longer uses (in recommended usage) the backtick, so commands
in a new shell now use $(), for example 'timestamp=$(date +%F-%T)'
instead of the obsolete backtick version 'timestamp=`date +%F-%T`'.
Slackware scripts are full of backticks, showing their incredible age ;)
I miss the days (daze?) of 7-bit ascii, so much easier back then, at
least for those with English as only language and only the odd US English
spelling issue to worry about :)
And WordStar used the character high bit (msb, bit 7) as a word delimiter
in doc mode.
Grant.
--
http://bugsplatter.id.au
>The backquote character is a ligitimate ASCII character. M$ uses 8-bit
>character codes *instead* of available ASCII characters and does so
>where they shouldn't (like in Hotmail,com et. al. footers).
MSFT does all sorts of things they shouldn't -- virtual monopolies are
like that...
Recently I viewed a .pdf file where the author had let the .pdf engine
replace all unisex quotes with 66, 99 and 6, 9 style quotes. Okay, you
think? No, it destroyed all the scripts in the document because a
straight quoted value like "a" became ``a´´ (pair each of #96 'grave
accent' - backtick, and #180 'acute accent').
It's a mess, yes?
Grant.
--
http://bugsplatter.id.au
> On Sat, 11 Jul 2009 10:12:17 -0500, Robert Heller <hel...@deepsoft.com>
> wrote:
>
>>The backquote character is a ligitimate ASCII character. M$ uses 8-bit
>>character codes *instead* of available ASCII characters and does so
>>where they shouldn't (like in Hotmail,com et. al. footers).
>
> MSFT does all sorts of things they shouldn't -- virtual monopolies are
> like that...
"Virtual"?
--
Armageddon, here we come!