Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Copying text from n2479.pdf

48 views
Skip to first unread message

Keith Thompson

unread,
Sep 25, 2020, 2:13:23 PM9/25/20
to
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2479.pdf is a recent
draft of C20.

When I copy text from n2479.pdf, I get things like this:

The
:
strdup function
::::::
creates
::
a
:::: copy::: of::: the:::::: string::::::: pointed:: to::: by:: s:: in:: a ::::: space::::::::: allocated :: as:: if::: by :a:::: call
::
to
:::::::
malloc.
:

(It varies slightly depending on which PDF viewer I use.)

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */

Pankaj Jangid

unread,
Sep 25, 2020, 11:18:57 PM9/25/20
to
On Fri, Sep 25 2020, Keith Thompson wrote:

> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2479.pdf is a recent
> draft of C20.
>
> When I copy text from n2479.pdf, I get things like this:
>
> The
> :
> strdup function
> ::::::
> creates
> ::
> a
> :::: copy::: of::: the:::::: string::::::: pointed:: to::: by:: s::
> in:: a ::::: space::::::::: allocated :: as:: if::: by :a:::: call
> ::
> to
> :::::::
> malloc.
> :

It is because of those wavy underlines.

Keith Thompson

unread,
Sep 26, 2020, 2:06:01 AM9/26/20
to
Yes, that explains it, thanks. So I can copy-and-paste from N2478,
which doesn't have the wavy wavy underlining:

The strndup function creates a string initialized with no more than
size initial characters of the array pointed to by s and up to the
first null character, whichever comes first, in a space allocated as
if by a call to malloc .

Paul

unread,
Sep 26, 2020, 7:51:57 AM9/26/20
to
Keith Thompson wrote:
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2479.pdf is a recent
> draft of C20.
>
> When I copy text from n2479.pdf, I get things like this:
>
> The
> :
> strdup function
> ::::::
> creates
> ::
> a
> :::: copy::: of::: the:::::: string::::::: pointed:: to::: by:: s:: in:: a ::::: space::::::::: allocated :: as:: if::: by :a:::: call
> ::
> to
> :::::::
> malloc.
> :
>
> (It varies slightly depending on which PDF viewer I use.)
>

PDF files can be read into Office Word, but this only works
when the author has generated a dual-representation type of
PDF which holds info Office can use.

LibreOffice Draw can read in PDF, but not likely with any
good purpose in mind. Don't try it on this document!!!
Use it on a single page test PDF just to see how it works.

So far, nothing I have handy here, looks immediately useful
in the "pure GUI power tool" department.

*******

I tried this.

mutool convert -F pdf -O decompress,clean -o n2479_out.pdf n2479.pdf # a mess

The underline effect seems to be a font with a single character (sinewave)
in it. In the document, where it underlines the word "underlining", the
stanza looks like... ten sinewaves underneath an eleven character word.

/F3 5.9776 Tf
1 0 0 1 230.857 349.568 Tm
[<0001000100010001000100010001000100010001>] TJ

If converted to Postscript, the underline method looks like this.

.895628 .7673 0 0 cmyk
VWZQUL+LASY6*1 [5.9776 0 0 -5.9776 0 0 ]msf
320.52 467.331 mo
(::::::)
[4.98111 4.98114 4.98111 4.98111 4.98114 0 ]xsh

Neither method was of sufficient quality to be part of a workflow.
The document does not convert cleanly enough for this.

*******

Converted to HTML, there were no complaints about font conversion.
Loading the HTML into a browser sorta works OK. The above spaghetti
shows what the HTML section with the "underlining" text looks like.
The color is blue #0000ff.

mutool convert -F html -o n2479.html n2479.pdf

<p style="position:absolute;white-space:pre;margin:0;padding:0;top:480pt;left:111pt">
<span style="font-family:URWPalladioL,serif;font-size:9.024493pt">
text that has been deleted and
</span>

<i><span style="font-family:LASY6,serif;font-size:5.9776pt;color:#0000ff"> <=== to be
::::::::::</span></i> <=== removed

</p>

<p style="position:absolute;white-space:pre;margin:0;padding:0;top:480pt;left:233pt">
<span style="font-family:URWPalladioL,serif;font-size:8.9664pt;color:#0000ff">
underlining
</span>
<span style="font-family:URWPalladioL,serif;font-size:9.024493pt">
text that has been added. Pages that contain changes
</span></p>

This removed some of them, until I found a ">:: ::: :<" one.
The second expression may have got rid of more of them. What I'm
doing, is just removing the strings of colons and replacing
them with a blank >< pair, an empty text string. Rather than
edit the whole string in front of it.

sed 's/>:*</></g' n2479.html > n2479sed.html

sed 's/>[: ]*</></g' n2479.html > n2479sed.html

That's as far as I got.

Still no good HTML to text function has shown up.
I'd like to preserve some of the positioning so the
file is human-readable.

The colored text still has to be corrected. The HTML version
did not preserve the strikethru effect, and if the file is
converted to text, both old and new strings will be
included. And not all red text is strikethru text, so
finding red coloring and removing strings likely won't
work right either.

You can copy/paste out of Firefox after using

firefox n2479sed.html

That should be workable for small samples.

Paul
0 new messages