RE: matching text pattern in spite of intervening markup

29 views

Skip to first unread message

Peter Constable

unread,

Jul 11, 2024, 12:00:37 PMJul 11

to beauti...@googlegroups.com

I’d like to gather information from a set of HTML pages created over many years. The information I’d like to compile all have predictable text patterns; for example, “[123-A45] Action Item”. To search for these patterns, I could use this regex:

\[[[0-9]*-(AI?|C|M|N)[0-9]*] (Action Item|Consensus|Motion|Note)

From there, I’d want to find the parent element and then get the text content for the element so that I get all of the content that follows the above prefix patterns.

However, a problem is that the patterns above can (and in many cases will) be broken up with markup. This is exacerbated by different markup having been used at different times over 25+ years. In current practice for authoring these pages, we’d have markup like the following:

[<a name="179-A1" href="...">179-A1</a>] Action Item for …

So, the string I’d like to search for is broken up by an anchor element.

(The class on the p tag would make this easy if that had always been used, but that’s not the case.)

Is there a way to search for string content that might cross elements, then navigate to the common parent?

Thanks

Peter Constable

Chris Papademetrious

unread,

Jul 28, 2024, 8:07:49 AMJul 28

to beautifulsoup

Hi Peter,

I cannot think of a generalized solution to this problem that doesn't rely on recursion and have horrific runtime. (Imagine pathological cases where each character of the search string is in its own element, distributed across element hierarchies, etc.)

In your case, is it good enough to search each element's text (obtained with Tag.get_text()) to see what information exists in it?