How skip HTML validation while generating PDF ?

402 views
Skip to first unread message

Ezhil

unread,
Oct 6, 2023, 6:24:47 AM10/6/23
to Flying Saucer Users
Team - I am trying to create a PDF using page url. But I am getting an error saying that "Can't load the XML resource (using TrAX transformer). org.xml.sax.SAXParseException; lineNumber: 6; columnNumber: 14; Open quote is expected for attribute "name" associated with an  element type  "meta"."

It looks like renderer.setDocument(urlcheck); check whether the URL has proper start and end HTML tag. Is there any we can skip this validation ?

try {
// Define the URL
String urlcheck = "https://en.wikipedia.org/wiki/IPhone_15";

// Establish a URL connection
HttpURLConnection connection = (HttpURLConnection) new URL(urlcheck).openConnection();
connection.setRequestMethod("GET");

// Check the response code (200 indicates success)
int responseCode = connection.getResponseCode();
if (responseCode == 200) {
// Get the input stream from the connection
InputStream urlInputStream = connection.getInputStream();

// Create an ITextRenderer instance
ITextRenderer renderer = new ITextRenderer();

// Set the HTML content as the document
renderer.setDocument(urlcheck);

// Render to PDF
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
renderer.layout();
renderer.createPDF(outputStream);
renderer.finishPDF();

}

Peter Brant

unread,
Oct 6, 2023, 7:23:55 AM10/6/23
to flying-sa...@googlegroups.com
FS will attempt to parse the input as an XML document. However, it can take any W3C Document as input too. In turn, there are other parsers out there that can create a W3C Document from e.g. HTML5. Examples include jsoup and the validator.nu HTML5 parser.

--
You received this message because you are subscribed to the Google Groups "Flying Saucer Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flying-saucer-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flying-saucer-users/362d399b-6780-49e5-a5f4-09114dbcffccn%40googlegroups.com.

Ezhil

unread,
Oct 6, 2023, 8:28:24 AM10/6/23
to Flying Saucer Users
Thanks for your response Peter Brant. Sorry, I am not getting what do you mean by it can take any W3C document. I want to generate a PDF from the URL as web browser(chrome, firefix etc) renders( skip the error and warning).

In same way, Is it possible to skip the error in HTML and generate PDF?

Peter Brant

unread,
Oct 6, 2023, 8:51:09 AM10/6/23
to flying-sa...@googlegroups.com
I'm afraid that won't work in general. FS is a pretty complete static implementation of CSS 2.1. It does not support JavaScript or the many, many features subsequently added to CSS and HTML.

In order to limit the number of external dependencies, FS only supports XML input out of the box, but it provides the facilities to use your own parser as long as the output of that parser is a W3C Document value.

Ezhil

unread,
Oct 6, 2023, 12:43:39 PM10/6/23
to Flying Saucer Users
Peter Brant - Thanks for your valuable input. In that case FS may not work based on URL. It works based on a string which is a W3C standard document and it will not create PDF even if there is a single error in the provided HTML(<p>test) like this. 

Is there any other utility or API available to generate PDF based on URL ?

Ezhil

unread,
Oct 11, 2023, 1:01:17 PM10/11/23
to Flying Saucer Users
There is java library available to beautify and clean the HTML. I have solved by referring  https://statics.teams.cdn.office.net/evergreen-assets/safelinks/1/atp-safelinks.html
Reply all
Reply to author
Forward
0 new messages