Extracting tables from PDF

Skip to first unread message


Aug 21, 2013, 9:06:08 PM8/21/13
to pdfne...@googlegroups.com
We are using PDFTron Text Extractor to extract data (especially tabular info) from a PDF page.
Some pages may contain tables and in these cases we may wrong order of lines. For example:


John Doe

Albert Square

00150                 England

Gets the following order in C# when printing out the extracted text information :



John Doe

Albert Square



Do you guys have any solution for this?


Could you please send us a sample document and we will take a look into it. TextExtractor does not have built-in capability to recognize things such as tables, figures, header/footers etc. Unfortunately this type of structure information is usually not explicitly stored in PDF, and we need to rely on potentially error prone techniques (similar to OCR) in order to reconstruct the info.


For example, you could use text positioning and styling information provided by TextExtractor to figure out what text belongs to a table etc.


We have implemented a prototype solution (based on TextExtractor) that tries to recognize text and dumps reflow-able HTML that contains tables.


The following is a sample C# that extracts PDF and reflow-able HTML and also recognizes tables:


using System;

using System.IO;


using pdftron;

using pdftron.Common;

using pdftron.PDF;


namespace pdftron


    class test


              static void Main(string[] args)





                           using (PDFDoc doc = new PDFDoc(input_file))




                    pdftron.PDF.Convert.HtmlOutputOptions options = new pdftron.PDF.Convert.HtmlOutputOptions();



                    // Creates a file with original filename in the given folder

                    pdftron.PDF.Convert.ToHtml(doc, output_path, options);



                     catch (PDFNetException e) {







To test drive this functionality you can use one of the following links:


  (.Net 4, 64-bit)           :  https://pdftron.com/ID-zJWLuhTffd3c/22jdk340d/PDFNet64DotNet4.zip

 (.Net 1.1-3.5, 32-bit) :  https://pdftron.com/ID-zJWLuhTffd3c/22jdk340d/PDFNet.zip


The other PDFNet variants will be available in the near future.

Reply all
Reply to author
0 new messages