JSP file parsing - extend html comments ??

Michael Brown

unread,

Apr 10, 2024, 11:17:43 AMApr 10

to beautifulsoup

Hi,

I'm working on a very old web app which uses JSPs. I'm using Beautiful Soup to parse the JSP files used by the web server and not parsing the dynamically rendered code sent to the browser.

BeautifulSoup's html5lib parser fails to parse the html unless I replace the JSP scriptlets denoted by <% %> and <%= %> with html comments . Surprisingly, html.parse is able to parse the html input without replacing <% %> with . Both html5lib and html.parse will parse the html when the scriptlet tags are replaced by html comments

Is there a way to extend BeautifulSoup's comment recognition logic to treat JSP scriptlet blocks the same way as html comments?

I'm using:

Beautiful Soup 4.12.3

Python 3.12.1

html5lib 1.1

lxml 5.2.1.0

Regards

Mike

Chris Papademetrious

unread,

Apr 10, 2024, 12:09:04 PMApr 10

to beautifulsoup

Hi Mike,

Does the lxml parser handle the JSP scriptlets?

I've never heard of this JSP stuff before. If your favorite parser isn't parsing these <% %> constructs, can you simply do a regex substitution right before parsing the content?

- Chris

Michael Brown

unread,

Apr 10, 2024, 1:44:46 PMApr 10

to beautifulsoup

Hi Chris,

Yes, lxml does work when the <% %> tags are present.

html.parser is working.

html5lib is not working

Yes a regex substitution of <% and %> does seem to allow html5lib to work.

I'm asking to learn if there's a way to extend Beautiful Soup so the regex substitution is not needed.

Regards

Mike

leonardr

unread,

Apr 11, 2024, 11:16:24 AMApr 11

to beautifulsoup

Mike,

The way to handle the JSP syntax would be to write a TreeBuilder implementation that can handle that syntax and plug in an instance of that class, rather than using the default TreeBuilders created when you ask for lxml/html5lib/html.parser. Since the TreeBuilder implementation would also need to parse regular HTML, the easiest way to do this would be to extend one of the existing TreeBuilder classes or the underlying parser code.

Theoretically html5lib is the most extensible of the three parsers, but I've found its architecture to be very convoluted and difficult to understand. You might have better luck subclassing Python's built-in HTMLParser. I'd start by taking a look at the goahead() method.

Leonard

Reply all

Reply to author

Forward