Splitting html input outside html tags

Skip to first unread message

Liam Kirsher

Jun 11, 2022, 7:24:34 AMJun 11
to beautifulsoup
Hi, I'm wondering if this is possible with Beautiful Soup, or if anyone can suggest how to do it.

My application sends html-format messages to Telegram. These are html fragments, they don't contain a <body> tag, just various tags for marking up the text.
Telegram limits the message length to 4096 bytes.
Longer messages must be broken up into multiple smaller messages.

Splitting the messages on the 4096-byte boundary causes an exception when it splits messages in the middle of an html entity.
Allowed html entities are a reduced set: b, strong, i, u, img, a.

I thought I might be able to customize html.parser to do this, by creating a stack that tracked the position in the file of start/end tags, and then splitting the file only in places where the stack was empty (i.e. outside of any html tags).  However, this was starting to look complicated.

Any suggestions on how to go about this, and might it be possible with Beautiful Soup?


Alex Krupp

Jun 11, 2022, 8:08:33 AMJun 11
to beauti...@googlegroups.com
Mozilla's Bleach library is good about fixing messages that are cut off due to length limits halfway between an HTML tag, or after an opening tag but before a closing tag:

You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/2eb6c44c-4615-4125-9fce-5f2be141a743n%40googlegroups.com.

Alex Krupp
Cell: (607) 351 2671
Read my Email: www.fwdeveryone.com/u/alex3917
Subscribe to my blog: https://alexkrupp.typepad.com/
My homepage: www.alexkrupp.com
Reply all
Reply to author
0 new messages