Multi-line regex puzzle

nmyshkin

unread,

Jan 15, 2025, 12:29:48 PM1/15/25

to Tasker

Years ago I created a routine using my local newspaper rss feed to produce a digest of stories without ads, etc. About a year ago the "California" feed began to include wildfire bot entries (ranging from a dumpster fire somewhere to an actual cataclysm). The article content is only one or two sentences but the number of bot entries is large and actual news stories are sprinkled among them depending on when a reporter posted.

I'd like to eliminate the the fire bot stuff altogether before I start massaging the rss input. I've tried a variety of regex expressions to isolate and delete the offending entries with no success. The presence of a CR and LF at the end of each line of text is making it doubly difficult.

I've tried replacing all the \r and \n with the actual letters CRLF and then attacking the problem that way (restoring them after works fine) but I still can't seem to isolate the text I want to delete. Here's a brief extract with the \r and \n represented as CRLF

CRLF
]]> https://www.sanluisobispo.com/news/california/article298476158.htmlCRLF
Update: Auto Fire burns 56 acres in Ventura County since Jan. 13CRLF
https://www.sanluisobispo.com/news/california/fires/CRLF
article298496048.html#storylink=rss Click to Continue »]]> Mon, 13 JanCRLF
2025 20:19:53 PST By CA WILDFIRE BOTCRLF
CRLF

The key for detection is in the last line of the entry (of course), "CA WILDFIRE BOT", but I want to eliminate the entire entry and others like it before I process the rss file.

Any pointers? Maybe an entirely different approach?

nmyshkin

unread,

Jan 16, 2025, 2:53:30 PM1/16/25

to Tasker

Update: I have what is mostly a solution. It doesn't catch the fire articles which are given actual article numbers and listed as "Update", but it does remove all the "Breaking News" wildfire alerts (which is most of them).

]]> https://www.sanluisobispo.com/news/california/fires/.*?BOT\r\n\r\n

This removes both the entry and the (now) extra carriage returns and line feeds, so a "Search and Replace" with the replacement left blank should do the trick. I hope.

Message has been deleted

RSF

unread,

Feb 4, 2025, 12:42:24 PM2/4/25

to Tasker

Multi-line regex processing is tricky, especially trying to process, in effect, multiple entries each of which may be multiple lines long.

I'd try Tasker's Variable Split action to put each article into a separate array entry, then loop through the entries using regex to extract out the non-bot articles you don't want via the Array Push action. It's likely better-suited to the task than regex by itself. Something like below (and attached). This assumes that each article starts with ]]> at the beginning of a line; if there are other delimiters, update the Variable Split action's splitter setting accordingly.

Task: Multi line Regex Split 2

A1: Variable Set [
Name: %feed
To: ]]> 1st article
etc.

]]> https://www.sanluisobispo.com/news/california/article298476158.html

Update: Auto Fire burns 56 acres in Ventura County since Jan. 13

https://www.sanluisobispo.com/news/california/fires/

article298496048.html#storylink=rss Click to Continue »]]> Mon, 13 Jan

2025 20:19:53 PST By CA WILDFIRE BOT

]]> 3rd article
etc.By CA WILDFIRE BOT

]]> 4th article...
Structure Output (JSON, etc): On ]

A2: Variable Split [
Name: %feed
Splitter: \n]]>
Delete Base: On
Regex: On ]

A3: Array Clear [
Variable Array: %nonbot_articles ]

A4: For [
Variable: %article
Items: %feed()
Structure Output (JSON, etc): On ]

A5: Variable Search Replace [
Variable: %article
Search: ]]>
Replace Matches: On ]

A6: If [ %article !~R By CA WILDFIRE BOT ]

A7: Array Push [
Variable Array: %nonbot_articles
Position: 999999
Value: %article ]

A8: End If

A9: End For

A10: Popup [
Title: Result
Text: # non-bot articles: %nonbot_articles(#)

%nonbot_articles()
Layout: Popup
Timeout (Seconds): 418 ]

Multi_line_Regex_Split_2.tsk.xml

Reply all

Reply to author

Forward