How skip HTML validation while generating PDF ?

Ezhil

unread,

Oct 6, 2023, 6:24:47 AM10/6/23

to Flying Saucer Users

Team - I am trying to create a PDF using page url. But I am getting an error saying that "Can't load the XML resource (using TrAX transformer). org.xml.sax.SAXParseException; lineNumber: 6; columnNumber: 14; Open quote is expected for attribute "name" associated with an element type "meta"."

It looks like renderer.setDocument(urlcheck); check whether the URL has proper start and end HTML tag. Is there any we can skip this validation ?

try {
  // Define the URL
  String urlcheck = "https://en.wikipedia.org/wiki/IPhone_15";

  // Establish a URL connection
  HttpURLConnection connection = (HttpURLConnection) new URL(urlcheck).openConnection();
  connection.setRequestMethod("GET");

  // Check the response code (200 indicates success)
  int responseCode = connection.getResponseCode();
  if (responseCode == 200) {
    // Get the input stream from the connection
    InputStream urlInputStream = connection.getInputStream();

    // Create an ITextRenderer instance
    ITextRenderer renderer = new ITextRenderer();

    // Set the HTML content as the document
    renderer.setDocument(urlcheck);

    // Render to PDF
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    renderer.layout();
    renderer.createPDF(outputStream);
    renderer.finishPDF();

  } 

Peter Brant

unread,

Oct 6, 2023, 7:23:55 AM10/6/23

to flying-sa...@googlegroups.com

FS will attempt to parse the input as an XML document. However, it can take any W3C Document as input too. In turn, there are other parsers out there that can create a W3C Document from e.g. HTML5. Examples include jsoup and the validator.nu HTML5 parser.

--
You received this message because you are subscribed to the Google Groups "Flying Saucer Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flying-saucer-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flying-saucer-users/362d399b-6780-49e5-a5f4-09114dbcffccn%40googlegroups.com.

Ezhil

unread,

Oct 6, 2023, 8:28:24 AM10/6/23

to Flying Saucer Users

Thanks for your response Peter Brant. Sorry, I am not getting what do you mean by it can take any W3C document. I want to generate a PDF from the URL as web browser(chrome, firefix etc) renders( skip the error and warning).

In same way, Is it possible to skip the error in HTML and generate PDF?

Peter Brant

unread,

Oct 6, 2023, 8:51:09 AM10/6/23

to flying-sa...@googlegroups.com

I'm afraid that won't work in general. FS is a pretty complete static implementation of CSS 2.1. It does not support JavaScript or the many, many features subsequently added to CSS and HTML.

In order to limit the number of external dependencies, FS only supports XML input out of the box, but it provides the facilities to use your own parser as long as the output of that parser is a W3C Document value.

To view this discussion on the web visit https://groups.google.com/d/msgid/flying-saucer-users/8d2af2ec-7622-47d0-a343-7075b53f166cn%40googlegroups.com.

Ezhil

unread,

Oct 6, 2023, 12:43:39 PM10/6/23

to Flying Saucer Users

Peter Brant - Thanks for your valuable input. In that case FS may not work based on URL. It works based on a string which is a W3C standard document and it will not create PDF even if there is a single error in the provided HTML(<p>test) like this.

Is there any other utility or API available to generate PDF based on URL ?

Ezhil

unread,

Oct 11, 2023, 1:01:17 PM10/11/23

to Flying Saucer Users

There is java library available to beautify and clean the HTML. I have solved by referring https://statics.teams.cdn.office.net/evergreen-assets/safelinks/1/atp-safelinks.html

Reply all

Reply to author

Forward