I’d like to gather information from a set of HTML pages created over many years. The information I’d like to compile all have predictable text patterns; for example, “[123-A45] Action Item”. To search for these patterns, I could use this regex:
\[[[0-9]*-(AI?|C|M|N)[0-9]*] (Action Item|Consensus|Motion|Note)
From there, I’d want to find the parent element and then get the text content for the element so that I get all of the content that follows the above prefix patterns.
However, a problem is that the patterns above can (and in many cases will) be broken up with markup. This is exacerbated by different markup having been used at different times over 25+ years. In current practice for authoring these pages, we’d have markup like the following:
<p class="action"><b>[<a name="179-A1" href="...">179-A1</a>] Action Item for</b> … </p>
So, the string I’d like to search for is broken up by an anchor element.
(The class on the p tag would make this easy if that had always been used, but that’s not the case.)
Is there a way to search for string content that might cross elements, then navigate to the common parent?
Thanks
Peter Constable