Pdfizer, a dumb HTML to PDF converter, in C#

111 views

Skip to first unread message

Steven Lee

unread,

Feb 8, 2006, 7:22:57 PM2/8/06

to iTextSharp

This library converts simple HTML documents to PDF.
http://www.codeproject.com/csharp/pdfizer.asp

Introduction
This article presents a basic HTML to PDF converter: with this library,
you can transform simple HTML pages to nice and printable PDF files.

The HTML cleaning is done with NTidy (see [1]), a .NET wrapper for the
HTML Tidy library (see [2]). The PDF generation is done with
iTextSharp, a PDF generation library (see [3]).

Transformation Pipe
Transforming HTML documents to PDF is a fairly complex task. Hopefully,
there exists powerful tools on the web that could help me accomplish
this.

Parsing HTML
The first problem to handle was that HTML is usually "dirty": the
structure is usually not XML conformant and trying to parse HTML pages
with the XmlDocument will usually lead to a failure.

To overcome this problem, I had to write a .NET wrapper around HTML
Tidy (see [2]). HTML Tidy is a very useful application that takes
"dirty" HTML and returns it cleaned as much as possible. The .NET
wrapper exposes a DOM-like class structure so that you can use it much
like XmlDocument.

Hence, with NTidy, we can safely parse HTML document.

Creating PDF
The PDF creation is done by iTextSharp (see [3]), a .NET library hosted
on SourceForge, that gives you the tool to create PDF easily. Hence,
the PDF creation problem is solved.

Reading, Traversing
With NTidy and iTextSharp on my toolset, I could start to create the
generator. The generator works like this: it first reads the input with
NTidy, then traverses the DOM tree and generates the PDF fragments with
iTextSharp.

Quick Example
The library usage is done through the HtmlToPdfConverter class.
Creating a PDF file is done through the following steps, as illustrated
in the example:

Create a converter,
Open a new PDF file using the Open method,
Add a chapter,
Feed HTML to the converter,
If you want another chapter, go to 3.
When finished, close the PDF file by calling Close.
// create converter
HtmlToPdfConverter html2pdf = new HtmlToPdfConverter();

// open new pdf file
html2pdf.Open(@"test");
// start a chapter
html2pdf.AddChapter(@"Dummy Chapter");
string html = ...;
// convert string
html2pdf.Run(html);
// add a new chapter
html2pdf.AddChapter(@"Boost page");
// read web page
html2pdf.Run(new Uri(@"http://www.boost.org/libs/libraries.htm"));
// close and finish pdf file.
html2pdf.Close();
What to expect and not expect
Don't expect too much from this tool, it will not work with complex
HTML pages and will give fairly good results with simple HTML pages.
Specially, tables are not yet supported.