Grep for removing injection spam links in a database or html

51 views
Skip to first unread message

Jefferis Peterson

unread,
Jan 28, 2020, 11:31:38 AM1/28/20
to BBEdit Talk

I am dealing with old hacked sites in Wordpress where there are injection spam links on images.  Also, need this for standard html sites.

I have access to the database and would like to remove links that look like this:


<a style="text-decoration:none" href="/ansaid-retail-cost">.</a>
 

Now the link varies inside the href and  it might be for cialas or any product, but the rest doesn't vary. I want to remove the entire LINK, so the result is a single space.


I don't know regex, so I would appreciate the help. I've tried online regex generators but they don't seem to be working.


In this case, the html link is attached to an image caption in the database. However, finding that type of string in the database is what I need to replace by removing it entirely. If I find and replace just the source code <a style="text-decoration:none" href=" portion it will leave a lot of empty tags or erase things I don't want it to.


TIA, 

Jeff

Roland Küffner

unread,
Jan 29, 2020, 6:55:09 AM1/29/20
to BBEdit Talk
If only the content of the href attribute varies, this search should do the trick:

<a style="text-decoration:none" href=".+?">\.</a>

if you have no other <a> tags with just a colon in it, you could probably boil it down to
<a.+?>\.</a>

Replacement text would be a space (of course).

-Roland

GP

unread,
Jan 29, 2020, 8:46:09 AM1/29/20
to BBEdit Talk
Given your sample link you want to find and replace with a space, the following regular expression will find the hacking links:

<a style="text-decoration:none" href="[^"]*">\.<\/a>

and for the replacement:

 

Note: the replacement is a space character.

The [^"]* matches anything that isn't a " character and will match any string (including no string) between the quotes in that position regardless whether the string is a valid URL format. If you need to be more specific or careful about what is getting matched in the quoted string, post some more example patterns of what you want to match along with pattern strings you specifically don't want matched.

For a test sample string:

test text
 text preceeding <a style="text-decoration:none" href="/ansaid-retail-cost">.</a> text succeeding <a style="text-decoration:none" href="/ansaid-retail-cost">.</a>

the global find and replace will result in:

test text
 text preceeding   text succeeding  


I suggest doing extensive testing on sample database copies to ensure you're getting the results you want. BBEdit's find and replace can really change a whole lot in a short time so you want to make sure you aren't going to regret it before clicking replace all.

Jefferis Peterson

unread,
Jan 30, 2020, 4:26:47 PM1/30/20
to BBEdit Talk
Thank you GP     That worked very well for 80 links, but i found another set of links masked by making the text invisible in the posts...
<a style="text-decoration:none; color: #f9f9f9;" href="http://alwaysvaltrexonline.com">buy valtrex online</a>

There are 3 variations...  the addition of colons in the style, and the addition of the color#    and instead of using a single period to disguise it, they have put in full text in the link

I tried to modify what you did but  nothing I did seemed to work. 
 This is the closest to what you did I thought:        I thought that this set of characters would find any text: [^”]*     So i added it where the strings were for color and link text, but it did not find these type of strings:

<a style="text-decoration:none[^”]*" href=“[^”]*”>[^”]*<\/a>

Can you tell me what I did wrong?  And are spaces ignored in the string? 

Rick Gordon

unread,
Jan 30, 2020, 5:20:30 PM1/30/20
to bbe...@googlegroups.com
One thing to watch for is that in your search, you are looking for curly
quotes (actually a mix of straight and curly quotes), which should all
be straight.

Rick Gordon

--------------------
On January 30, 2020 at 2:18:30 PM [-0800], Jefferis Peterson wrote in an
email entitled "Re: Grep for removing injection spam links in a database
or html":
> ___________________________________________
RICK GORDON
EMERALD VALLEY GRAPHICS AND CONSULTING
___________________________________________
WWW: http://www.shelterpub.com

GP

unread,
Jan 30, 2020, 10:19:53 PM1/30/20
to BBEdit Talk


On Thursday, January 30, 2020 at 1:26:47 PM UTC-8, Jefferis Peterson wrote:
Thank you GP     That worked very well for 80 links, but i found another set of links masked by making the text invisible in the posts...
<a style="text-decoration:none; color: #f9f9f9;" href="http://alwaysvaltrexonline.com">buy valtrex online</a>

There are 3 variations...  the addition of colons in the style, and the addition of the color#    and instead of using a single period to disguise it, they have put in full text in the link

I tried to modify what you did but  nothing I did seemed to work. 
 This is the closest to what you did I thought:        I thought that this set of characters would find any text: [^”]*     So i added it where the strings were for color and link text, but it did not find these type of strings:

  [^”]* isn't any text; rather, any text BUT a quote character. 


<a style="text-decoration:none[^”]*" href=“[^”]*”>[^”]*<\/a>

Can you tell me what I did wrong?  And are spaces ignored in the string? 

Somehow you got typograher's [curly] quotes in your  [^”]* additions; whereas, the text you're trying to match has straight quotes. So the first  [^”]* greedily consumes all the remaining text because there's no curly quote to be found leaving nothing available to match the rest of the regular expression. (There is one exception when the last text in the database matches href=“[^”]*”>[^”]*<\/a> trailing part of the regular expression.)

Second, using [^”]* in the ”>[^”]*<\/a> portion doesn't stop text matching consumption  when the </a> tag is encountered. You don't want to match everything but a quote character; instead, you want to match everything but a < character (which will leave unconsumed any </a> text for subsequent matching the <\/a> regular expression part).

Fixing those issues yields:

<a style="text-decoration:none[^"]*" href="[^"]*">[^<]+<\/a>

With the [^<]+, I'm assuming you're always going to have at least one character between the > </a> text pattern in the bogus links you're trying to get rid of. If that isn't true you can change the + to  a * which will allow a match even if there isn't a character in that position.

Supplementing the test text with the additional HTML example:

test text
 text preceeding <a style="text-decoration:none" href="/ansaid-retail-cost">.</a> text succeeding <a style="text-decoration:none" href="/ansaid-retail-cost">.</a>

<a style="text-decoration:none; color: #f9f9f9;" href="http://alwaysvaltrexonline.com">buy valtrex online</a>  text preceeding <a style="text-decoration:none" href="/ansaid-retail-cost">.</a> text succeeding <a style="text-decoration:none" href="/ansaid-retail-cost">.</a>

and applying the revised regular expression yields:

test text
 text preceeding   text succeeding  

   text preceeding   text succeeding  

Jefferis Peterson

unread,
Jan 31, 2020, 10:41:27 AM1/31/20
to BBEdit Talk
THANK YOU!!! That worked!!!
Reply all
Reply to author
Forward
0 new messages