Select all instances that meet the search regex

25 views
Skip to first unread message

Ivaylo

unread,
Jan 10, 2009, 5:23:04 PM1/10/09
to Regex
Hi,

I'm using EditPad Pro to process a large HTML file. I need to extract
only part of the text. For this purpose, I do the following regex
search:

<td valign=top>(.*?)</td>

in order to find any text between the opening tag "<td valign=top>"
and the closing tag "</td>".

Now I need either:

1. to copy all the instances that match the search regex string and
copy them to a separate file for further processing

or

2. delete all the text that doesn't match the search regex string.

Can you advise me hot to do this?

Thank you in advance!

Eugeny Sattler

unread,
Jan 11, 2009, 3:00:47 AM1/11/09
to re...@googlegroups.com
On Sun, Jan 11, 2009 at 2:23 AM, Ivaylo <ivanov...@gmail.com> wrote:
>
> Hi,
>
> I'm using EditPad Pro to process a large HTML file. I need to extract
> only part of the text. For this purpose, I do the following regex
> search:
>
> <td valign=top>(.*?)</td>
>
> in order to find any text between the opening tag "<td valign=top>"
> and the closing tag "</td>".
>
> Now I need either:
>
> 1. to copy all the instances that match the search regex string and
> copy them to a separate file for further processing
For this not EditPad Pro, but PowerGREP is more suitable tool. Use the
same regex <td valign=top>(.*?)</td> and $1 in collect field.
Although it is expensive ($149 single license) but trying PowerGREP is
free. Most likely you will be able to solve your problem already in
trial mode.


> 2. delete all the text that doesn't match the search regex string.

this is far more complex task as you need to write a regex that
matches all the rest.
What i suggest is far from being elegant but still have a look at this:
1) manually delete all text from the beginning of file till first <td
valign = top>
2) manually delete all text after the last <td valign=top>(.*?)</td>
3) cyclically delete all text between two neighbouring <td
valign=top>(.*?)</td>
until you can find nothing....

Here is the regex i suggest for it with comments below it (view in
CourierNew font or the like).

(<td valign=top>(.*?)</td>)(.*?)<td valign=top>
\ / ^
\_________$1__________/ |
|
to be deleted

and replace it with
$1<td valign=top>

Thus actually you will delete the text that was matched by the second
(.*?) in our regular expression

Please report if it worked.

--
Regards,
Eugeny

Eugeny....@gmail.com

unread,
Jan 11, 2009, 3:08:39 AM1/11/09
to Regex
> and replace it with
> $1 <td valign=top>
^
|_ or put some line breaks here...

in order to visually separate neighbouring <TD>...</TD> tag pairs in
resulting file.

Ivaylo

unread,
Jan 12, 2009, 11:22:38 AM1/12/09
to Regex
Hi Eugeny,

Thank you for your suggestions.

I found a workaround method: I pasted the html code in MS Word 2003,
then I used the Search function with the following search string:

\<td valign=top\>*\</td\>

Then, I checked the box Sellect All found instances (or something like
that; I don't remember the exact name of this function). Word found
and selected all instances that matched the search string, and then I
copied and pasted them in a separate file.

It turns out that some editors lack this function.

On Jan 11, 9:00 am, "Eugeny Sattler" <eugeny.satt...@gmail.com> wrote:
> Eugeny- Hide quoted text -
>
> - Show quoted text -
Reply all
Reply to author
Forward
0 new messages