Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Multi-line regex puzzle

41 views
Skip to first unread message

nmyshkin

unread,
Jan 15, 2025, 12:29:48 PMJan 15
to Tasker
Years ago I created a routine using my local newspaper rss feed to produce a digest of stories without ads, etc. About a year ago the "California" feed began to include wildfire bot entries (ranging from a dumpster fire somewhere to an actual cataclysm). The article content is only one or two sentences but the number of bot entries is large and actual news stories are sprinkled among them depending on when a reporter posted.

I'd like to eliminate the the fire bot stuff altogether before I start massaging the rss input. I've tried a variety of regex expressions to isolate and delete the offending entries with no success. The presence of a CR and LF at the end of each line of text is making it doubly difficult.

I've tried replacing all the \r and \n with the actual letters CRLF and then attacking the problem that way (restoring them after works fine) but I still can't seem to isolate the text I want to delete. Here's a brief extract with the \r and \n represented as CRLF

CRLF
]]> https://www.sanluisobispo.com/news/california/article298476158.htmlCRLF
Update: Auto Fire burns 56 acres in Ventura County since Jan. 13CRLF
https://www.sanluisobispo.com/news/california/fires/CRLF
article298496048.html#storylink=rss Click to Continue »]]> Mon, 13 JanCRLF
2025 20:19:53 PST By CA WILDFIRE BOTCRLF
CRLF
 
The key for detection is in the last line of the entry (of course), "CA WILDFIRE BOT", but I want to eliminate the entire entry and others like it before I process the rss file.

Any pointers? Maybe an entirely different approach?

nmyshkin

unread,
Jan 16, 2025, 2:53:30 PMJan 16
to Tasker
Update: I have what is mostly a solution. It doesn't catch the fire articles which are given actual article numbers and listed as "Update", but it does remove all the "Breaking News" wildfire alerts (which is most of them).


This removes both the entry and the (now) extra carriage returns and line feeds, so a "Search and Replace" with the replacement left blank should do the trick. I hope.
Message has been deleted

RSF

unread,
Feb 4, 2025, 12:42:24 PM (3 days ago) Feb 4
to Tasker
Multi-line regex processing is tricky, especially trying to process, in effect, multiple entries each of which may be multiple lines long.

I'd try Tasker's Variable Split action to put each article into a separate array entry, then loop through the entries using regex to extract out the non-bot articles you don't want via the Array Push action. It's likely better-suited to the task than regex by itself. Something like below (and attached). This assumes that each article starts with ]]> at the beginning of a line; if there are other delimiters, update the Variable Split action's splitter setting accordingly.

Task: Multi line Regex Split 2
   
    A1: Variable Set [
         Name: %feed
         To: ]]> 1st article
         etc.
       
         ]]> https://www.sanluisobispo.com/news/california/article298476158.html

         Update: Auto Fire burns 56 acres in Ventura County since Jan. 13
         https://www.sanluisobispo.com/news/california/fires/

         article298496048.html#storylink=rss Click to Continue »]]> Mon, 13 Jan
         2025 20:19:53 PST By CA WILDFIRE BOT
       
         ]]> 3rd article
         etc.By CA WILDFIRE BOT
       
         ]]> 4th article...
         Structure Output (JSON, etc): On ]
   
    A2: Variable Split [
         Name: %feed
         Splitter: \n]]>
         Delete Base: On
         Regex: On ]
   
    A3: Array Clear [
         Variable Array: %nonbot_articles ]
   
    A4: For [
         Variable: %article
         Items: %feed()
         Structure Output (JSON, etc): On ]
   
        A5: Variable Search Replace [
             Variable: %article
             Search: ]]>
             Replace Matches: On ]
   
        A6: If [ %article !~R By CA WILDFIRE BOT ]
   
            A7: Array Push [
                 Variable Array: %nonbot_articles
                 Position: 999999
                 Value: %article ]
   
        A8: End If
   
    A9: End For
   
    A10: Popup [
          Title: Result
          Text: # non-bot articles: %nonbot_articles(#)
       
         %nonbot_articles()
          Layout: Popup
          Timeout (Seconds): 418 ]
Multi_line_Regex_Split_2.tsk.xml
Reply all
Reply to author
Forward
0 new messages