Re: Extract text between two strings (or remove everything else)?

1,350 views
Skip to first unread message

Steve

unread,
Jan 27, 2013, 8:26:00 PM1/27/13
to textwr...@googlegroups.com
Do you want to dispose of everything before and after your search criteria? Also, is the "/?" in "file/?que=" a literal forward-slash followed by a question mark, or the word "file" with (or without) a training "/" and then a literal "que=" ?

Search:
(?:^.*?file/?que=)(.*?)(?:</span></a></h5>)

will match the start of the line, the "file/?que=" (where the forward slash can occur 0 or 1 time) but it won't capture it (meaning the "\1" will not match it); then the (.*?) will match all the text up to the next group (and WILL store it in the "\1" reference); then "<span></a></h5>" would have to immediately follow the group, but will also not be captured (as in, stored in a "\2" reference).

Replace:
\1

-Steve

On Saturday, January 26, 2013 1:52:30 PM UTC-8, emikysa wrote:
Hello.  I have a very long txt file that has quite a few pieces of information that I would like to extract.  The information I would like is always preceded by "file/?qwe=" and is always followed by "</span></a></h5>".  Can anyone help me with a find/replace search (grep?) or something else that would leave me just with the information I need?
Thank you,
Erik

Tom Humiston

unread,
Jan 27, 2013, 10:35:31 PM1/27/13
to textwr...@googlegroups.com
I'm not an expert at these things, and you haven't included samples of the raw input and desired output, but I'll take a crack at it. Based on your delimiter strings, I suppose you're trying, first, to reduce your file to only those strings that look like this:

<h5><a href="file/?qwe=queryString">
<span>link text</span></a></h5>

and then further parse them down to something like this:

queryString, link text

I have sometimes wondered how to reduce an HTML file to nothing but its links, and you've pushed me to finally figure this out (thank you!):

Find: (?s).+?(<a .+?</a>|\z)
Replace all with: \1\r

In English, that's, "Ignoring line breaks, look for a link (or the end of the file), preceded by any amount of text. Replace all of it with just the link (or end of file) plus a line break."

To apply this to the specific combination of H5, URL, and span, we can expand the Find string (remembering to escape the URL's question mark with a backslash). Here's one way of doing it:

(?s).+?(<h5><a .*?file/\?qwe=(.+?)".*?<span>(.+?)</span>|\z)

and change the Replace string so it reflects the added pairs of parentheses above:

\2, \3\r

Sure, it leaves a lonely comma on the last line, but it does the job. Whether it's exactly what you need or not, I hope it helps.

- TH
Reply all
Reply to author
Forward
0 new messages