Years ago I created a routine using my local newspaper rss feed to produce a digest of stories without ads, etc. About a year ago the "California" feed began to include wildfire bot entries (ranging from a dumpster fire somewhere to an actual cataclysm). The article content is only one or two sentences but the number of bot entries is large and actual news stories are sprinkled among them depending on when a reporter posted.
I'd like to eliminate the the fire bot stuff altogether before I start massaging the rss input. I've tried a variety of regex expressions to isolate and delete the offending entries with no success. The presence of a CR and LF at the end of each line of text is making it doubly difficult.
I've tried replacing all the \r and \n with the actual letters CRLF and then attacking the problem that way (restoring them after works fine) but I still can't seem to isolate the text I want to delete. Here's a brief extract with the \r and \n represented as CRLF
The key for detection is in the last line of the entry (of course), "CA WILDFIRE BOT", but I want to eliminate the entire entry and others like it before I process the rss file.
Any pointers? Maybe an entirely different approach?