OT Export Table From PDF

BeeJ

unread,

Apr 24, 2013, 9:20:35 PM4/24/13

to

I have a 50 page doc with a table spanning many pages.
The source is an HTML webpage that I printed to a .PDF.

I want to extract the table to a .CSV.

So my options are:

Show HTML page source. Opens in Notepad++ Then ?

Print the HTML page to what? Is there a Win XP printer driver that I
can use to print to a pure text file then fiddle with. A lot of work.

Print to PDF as I have done and use Adobe to "print" with SaveAs to any
number of formats that I am not sure will easily get me to a .CSV
Adobe SaveAs -->
Choices that are other than image like JPG.
EPS
HTML
DOC
PS
RTF
TXT
XML

Maybe the DOC would allow exporting tables to a XLS and then to a .CSV
but I will have to find Office to do that.

Are there any tools that would help?

ralph

unread,

Apr 24, 2013, 9:57:32 PM4/24/13

to

Simple mining of the HTML may be enough. And I mean a basic, brute,
enumerate and capture of the data elements.
How are the items contained in the HTML file? In a /table, /list, ...
?

-ralph

CoderX

unread,

Apr 25, 2013, 12:46:34 AM4/25/13

to

Googe it, lazy fuck. Or join a PDF group, they are out there. You'd know
this, had you done so.

Do you understand how badly you suck?

"BeeJ" <spa...@nospam.com> wrote in message
news:kla0d2$vub$1...@speranza.aioe.org...

Deanna Earley

unread,

Apr 25, 2013, 3:54:56 AM4/25/13

to

On 25/04/2013 02:20, BeeJ wrote:
> I have a 50 page doc with a table spanning many pages.
> The source is an HTML webpage that I printed to a .PDF.
>
> I want to extract the table to a .CSV.

Most modern spread sheets should allow you to import it and save as CSV.

--
Deanna Earley (dee.e...@icode.co.uk)
iCatcher Development Team
http://www.icode.co.uk/icatcher/

iCode Systems

(Replies direct to my email address will be ignored. Please reply to the
group.)

Deanna Earley

unread,

Apr 25, 2013, 4:01:28 AM4/25/13

to

On 25/04/2013 08:54, Deanna Earley wrote:
> On 25/04/2013 02:20, BeeJ wrote:
>> I have a 50 page doc with a table spanning many pages.
>> The source is an HTML webpage that I printed to a .PDF.
>>
>> I want to extract the table to a .CSV.
>
> Most modern spread sheets should allow you to import it and save as CSV.

applications ^

BeeJ

unread,

Apr 25, 2013, 12:54:43 PM4/25/13

to

I finally did it but with great pain.

All the attempts to use Adobe failed to produce usable output from the
PDF. I got text but with table columns messed up and a blizzard of
blank lines. Adobe produced a 500 MByte .DOC that Libre Office choked
on; stalled and had to be Task Manager killed.
I tried all the Adobe Saveas to see what i could get. Nada.

So I went back the the webpage and viewed the source.
Ctrl-A then Ctrl-C and Ctrl-V into notepad++.

There I tried to remove blank lines using the \x option searching for
\n\n and replace with \n. but that did not work. I could search for
\n and it found it where expected but could not find \n\n. Questions:
can notepad++ show those control characters and extended chars as plain
text?

In notepad++ is could see that the table cells had a tab seperator. A
glimmer of hope loomed.

I save the file as .txt
I opened Libre Office spreadsheet and sucked in the text file
designating the seperator as a tab. It loaded quickly and display a
spreadsheet that had one column. i quickly discovered that the auto
column size was on and went about setting the column width.
Then I scanned the file for long text string and removed all
superfulous text and finally got a good looking spreadsheet. A little
time consuming but what the heck. what else do I have to do.

Then i tried to save it as .CSV only to find that the resultant file
had tabs as seperators, not commas. Why? Maybe because i had
specified tab on import. Well tabs will work too but i would like to
know why I could not specify the seperator like I can in MS Office. I
gave up on office long ago when they tried to put a ribbon in my hair.
I'm just not that type. I have pulled enough hair on menus so anyway
there is little left for a ribbon.

ralph

unread,

Apr 25, 2013, 1:41:00 PM4/25/13

to

Brute force text parsing is probably starting to look good about now.
<g>

-ralph

MikeB

unread,

Apr 25, 2013, 2:56:53 PM4/25/13

to

"ralph" <nt_con...@yahoo.com> wrote in message
news:kjqin89ogenhbkcvb...@4ax.com...

I am surprised Mayayana hasn't chirped about the level of childs play it
would be to put some (vb)(java)Script in the top of the captured HTML and
use the DOM to output to a diskfile to your liking..

> -ralph

MikeB

unread,

Apr 25, 2013, 2:58:14 PM4/25/13

to

"ralph" <nt_con...@yahoo.com> wrote in message
news:kjqin89ogenhbkcvb...@4ax.com...

Course I guess you could do the same thing in VB using the web contol.

> -ralph

ralph

unread,

Apr 25, 2013, 11:22:00 PM4/25/13

to

On Thu, 25 Apr 2013 14:56:53 -0400, "MikeB" <m.by...@frontier.com>
wrote:

>
>I am surprised Mayayana hasn't chirped about the level of childs play it
>would be to put some (vb)(java)Script in the top of the captured HTML and
>use the DOM to output to a diskfile to your liking..
>

Might work.

Considering all the trouble the OP has reported with so many tools, I
suspect the original 'data' is presented in a format other than a
simple list or table with no defined associations (records, fields,
...), yet he expects such associations to magically appear if only
given the right wand and incanation.

There is more to the story.

-ralph

Deanna Earley

unread,

Apr 26, 2013, 4:32:29 AM4/26/13

to

On 25/04/2013 17:54, BeeJ wrote:
> I finally did it but with great pain.
>
> All the attempts to use Adobe failed to produce usable output from the
> PDF. I got text but with table columns messed up and a blizzard of
> blank lines. Adobe produced a 500 MByte .DOC that Libre Office choked
> on; stalled and had to be Task Manager killed.
> I tried all the Adobe Saveas to see what i could get. Nada.

Have you tried just a very simple copy and paste from the browser to
your spreadsheet application, or opening it directly rather than faffing
around printing it to intermediary documents like PDFs which aren't
designed for data interchange.

> So I went back the the webpage and viewed the source.
> Ctrl-A then Ctrl-C and Ctrl-V into notepad++.
>
> There I tried to remove blank lines using the \x option searching for
> \n\n and replace with \n. but that did not work. I could search for \n
> and it found it where expected but could not find \n\n. Questions: can
> notepad++ show those control characters and extended chars as plain text?

Windows uses CRLF for line endings, \r\n\r\n is a double new line. \n is
just the LF.

> I gave up on office long ago when they tried to put a ribbon in my
> hair. I'm just not that type.

If it's in your hair, then you appear to be #doingitwrong.

Mayayana

unread,

Apr 26, 2013, 9:44:14 AM4/26/13

to

| I am surprised Mayayana hasn't chirped about the level of childs play it
| would be to put some (vb)(java)Script in the top of the captured HTML and
| use the DOM to output to a diskfile to your liking..
|

:) Actually, DOM or straight parsing via script
were my first thought, but the question seemed
too non-specific to answer. BeeJ only said he
wanted to get a CSV from an HTML. And it
sounded to me like he was fervently hoping for a
no-sweat conversion method; as though there
might be some sort of standard, 1-click tool for
conversion between HTML tables and CSV.

MikeB

unread,

Apr 26, 2013, 10:29:51 AM4/26/13

to

"Mayayana" <maya...@invalid.nospam> wrote in message
news:kle058$b8g$1...@dont-email.me...

This has doubtless been done hundreds (if not thousands) of times by Script
mavens, so a little googling should have born some fruit.

BeeJ

unread,

Apr 26, 2013, 1:12:08 PM4/26/13

to

The complication seemed to be that there were multiple frames on the
webpage. Adobe does not handle that well. Adobe 8.
So when the browser I used allowed Source it seems to have taken only
the source of the frame the mouse was in. That helped me isolate the
table stuff. I saved this source to a .HTML then opened that in a
browser, Opera I think. Then I was able to mouse copy paste the
browser vie into notepad.exe. now I saved as a test file and then
loaded that into LO spreadsheet. Sorry too many browsers here.

yes i was hoping for a one-click tool but now that i have a method that
works for some cases I can go with that.

Thanks all for the suggestions.

Expertise is where you find it and I find it here. Google many times
has very out of date info to sort through. So why not go directly to
the horse's mouth. (for those who may not know, i was told that i could
tell the age and thus the life left in the horse by going to the
horse's mouth and counting the missing teeth). Just don't look in my
mouth. I'm almost done.

ObiWan

unread,

Apr 27, 2013, 5:22:22 AM4/27/13

to

> I have a 50 page doc with a table spanning many pages.
> The source is an HTML webpage that I printed to a .PDF.
>
> I want to extract the table to a .CSV.
>
> So my options are:
>
> Show HTML page source. Opens in Notepad++ Then ?

nope, one of the "Jet Isam" engines allows to read a recordset from an
HTML page; basically the Isam parses the page and returns a number of
tables, you then decide which one you want (in case you just have one
that's easy) and read it into a recordset; from that point on, things
should be pretty straightforward; as an example

http://ewbi.blogs.com/develops/2006/12/reading_html_ta.html

http://mumrah.net/export-an-ado-recordset-as-csv-or-xml

HTH