JSP file parsing - extend html comments ??

27 views
Skip to first unread message

Michael Brown

unread,
Apr 10, 2024, 11:17:43 AMApr 10
to beautifulsoup
Hi,

I'm working on a very old web app which uses JSPs.   I'm using Beautiful Soup to parse the JSP files used by the web server and not parsing the dynamically rendered code sent to the browser.

BeautifulSoup's html5lib parser fails to parse the html unless I replace the JSP scriptlets denoted by <% %> and <%= %> with html comments <!-- -->.  Surprisingly, html.parse is able to parse the html input without replacing <% %> with <!-- -->.  Both html5lib and html.parse will parse the html when the scriptlet tags are replaced by html comments

Is there a way to extend BeautifulSoup's comment recognition logic to treat JSP scriptlet blocks the same way as html comments?

I'm using:
Beautiful Soup 4.12.3
Python 3.12.1
html5lib 1.1
lxml 5.2.1.0

Regards
Mike

Chris Papademetrious

unread,
Apr 10, 2024, 12:09:04 PMApr 10
to beautifulsoup
Hi Mike,

Does the lxml parser handle the JSP scriptlets?

I've never heard of this JSP stuff before. If your favorite parser isn't parsing these <% %> constructs, can you simply do a regex substitution right before parsing the content?

 - Chris

Michael Brown

unread,
Apr 10, 2024, 1:44:46 PMApr 10
to beautifulsoup

Hi Chris,

Yes, lxml does work when the <% %> tags are present.
html.parser is working.
html5lib is not working

Yes a regex substitution of <% and %> does seem to allow html5lib to work.   

I'm asking to learn if there's a way to extend Beautiful Soup so the regex substitution is not needed.

Regards
Mike

leonardr

unread,
Apr 11, 2024, 11:16:24 AMApr 11
to beautifulsoup
Mike,

The way to handle the JSP syntax would be to write a TreeBuilder implementation that can handle that syntax and plug in an instance of that class, rather than using the default TreeBuilders created when you ask for lxml/html5lib/html.parser. Since the TreeBuilder implementation would also need to parse regular HTML, the easiest way to do this would be to extend one of the existing TreeBuilder classes or the underlying parser code.

Theoretically html5lib is the most extensible of the three parsers, but I've found its architecture to be very convoluted and difficult to understand. You might have better luck subclassing Python's built-in HTMLParser. I'd start by taking a look at the goahead() method.

Leonard
Reply all
Reply to author
Forward
0 new messages