GREP question: how to find and remove tags in XML document

60 views
Skip to first unread message

cosmo

unread,
May 12, 2019, 8:48:38 AM5/12/19
to BBEdit Talk
Hi everyone,

trying to wrap my head around grep to accomplish this: i have several very large XML docs and need to remove some tags from them (just the tags and attributes, not the content inside the tags). There is just too much to do this manually so i'm wondering if this is possible with BBedit.

Within the document i have these kind of elements:

<structured-content content-type="task" vocab="work" vocab-term="main" id="id-c416e7d8-cd3d-4c85-bb5a-d5496d0aa54a">Some text content here</structured-content>



sometimes there can be one or several other elements inside these elements like for example:


<structured-content content-type="task" vocab="work" vocab-term="main" id="id-c416e7d8-cd3d-4c85-bb5a-d5496d0aa54a">Some <xref ref-type="ctrl" rid="id-a1847df5-8e01-21d8-f1d1-88b03728498b">text content</xref> here.</structured-content>


So i need to remove the <structured-content> tags entirely but preserve the content and tags WITHIN these elements, is this at all possible with grep in bbedit ?

Note that the code above is just a very simple example to illustrate, it gets much more complicated than that with tables, mathml, figures etc inside the tags to be removes ... there isn't a repetitive pattern to the content that can be inside the tags to be stripped, it could be anything so i want to be sure that i don't delete anything else than the enclosing tag.

I'm working my way through tutorials and examples but have a long way to go to figure this out on my own, so maybe one of you has a way to do this ?

many thx in advance for your help.

Christopher Stone

unread,
May 12, 2019, 2:46:16 PM5/12/19
to BBEdit-Talk
On 05/12/2019, at 03:54, cosmo <cosmop...@gmail.com> wrote:
Trying to wrap my head around grep to accomplish this …


Hey Cosmo,

The first thing to try is:

BBEdit Menu > Markup > Utilities > Translate HTML to Text

If you need to resort to a RegEx then try this:

Find:

<[^>]+?>

Replace:

Nothing.

--
Best Regards,
Chris

cosmo

unread,
May 12, 2019, 3:28:20 PM5/12/19
to BBEdit Talk


On Sunday, May 12, 2019 at 8:46:16 PM UTC+2, Christopher Stone wrote:

BBEdit Menu > Markup > Utilities > Translate HTML to Text

If you need to resort to a RegEx then try this:

Find: <[^>]+?>

Replace: Nothing.




thx for answering but it seems you did not read my post, i don't want to convert the xml to text or remove all tags ... i need to find and delete SPECIFIC tags WHILE PRESERVING ALL OTHER TAGS

Christopher Stone

unread,
May 12, 2019, 4:26:42 PM5/12/19
to BBEdit-Talk
On 05/12/2019, at 14:26, cosmo <cosmop...@gmail.com> wrote:
thx for answering but it seems you did not read my post, i don't want to convert the xml to text or remove all tags ... i need to find and delete SPECIFIC tags WHILE PRESERVING ALL OTHER TAGS


Hey Cosmo,

Clearly I did read your post, because I responded to it – but obviously I overlooked this requirement.

Note that the code above is just a very simple example to illustrate, it gets much more complicated than that with tables, mathml, figures etc inside the tags to be removes ... there isn't a repetitive pattern to the content that can be inside the tags to be stripped, it could be anything so i want to be sure that i don't delete anything else than the enclosing tag.

You really can't do this job with a regular expression search and replace, because of the unpredictable complexity of the tag.

You should post one or two examples of really complex tags, because it might be possible to use AppleScript or Perl to do the job.

--
Best Regards,
Chris


Bruce Linde

unread,
May 12, 2019, 5:53:51 PM5/12/19
to BBEdit Talk
wait - i re-read your problem statement and read it as removing the structured-content tags and attributes only... is that correct?

if so, this will remove all opening and closing structured-content tags and their attributes only and leave everything else:

<\/*structured-content[^>]*>

"find any amount of leading slashes followed by 'structured-content' followed by any amount of anything as long as it's not a closing tag greater than sign"

cosmo

unread,
May 12, 2019, 5:53:56 PM5/12/19
to BBEdit Talk
just to clarify, i am loohing for a way to find and delete specific tags ONLY while preserving all other tags, including those contained WITHIN the tags i want to remove. So for the above example the output i'm looking for is:


Original Markup:

<structured-content content-type="task" vocab="work" vocab-term="main" id="id-c416e7d8-cd3d-4c85-bb5a-d5496d0aa54a">Some <xref ref-type="ctrl" rid="id-a1847df5-8e01-21d8-f1d1-88b03728498b">text content</xref> here.</structured-content>


Processed markup:

Bruce Linde

unread,
May 12, 2019, 6:06:03 PM5/12/19
to BBEdit Talk
you specifically said: "So i need to remove the <structured-content> tags entirely but preserve the content and tags WITHIN these elements, is this at all possible with grep in bbedit ?"

my solution does just that.

if a yutz like me and a monster like christopher are both unclear on what you're looking to accomplish, it's clearly on you to provide specific before and after examples of EXACTLY what you're starting with and would like to end up with... the more, the merrier.

Christopher Stone

unread,
May 12, 2019, 6:06:45 PM5/12/19
to BBEdit-Talk
On 05/12/2019, at 14:30, cosmo <cosmop...@gmail.com> wrote:
Just to clarify, i am looking for a way to find and delete specific tags ONLY while preserving all other tags, including those contained WITHIN the tags i want to remove. So for the above example the output i'm looking for is:


Hey Cosmo,

Aha!

Okay, that's doable.

Try this:

Find:

<structured-content[^>]+>(.+?)</structured-content>

Replace:

\1


--
Best Regards,
Chris

Christopher Stone

unread,
May 12, 2019, 6:10:12 PM5/12/19
to BBEdit-Talk
On 05/12/2019, at 17:04, Bruce Linde <bli...@5happy.com> wrote:
you specifically said: "So i need to remove the <structured-content> tags entirely but preserve the content and tags WITHIN these elements, is this at all possible with grep in bbedit ?"

my solution does just that.


I meant to mention that Bruce's solution also works.  😎

--
Best Regards,
Chris


cosmo

unread,
May 13, 2019, 9:28:24 AM5/13/19
to BBEdit Talk
Morning everyone,

first of all apologies if my initial posts wasn't specific enough and my first replies came across as unpolite, i've had some rather frustrating forum experiences last week and think i might have inadvertently slipped into 'oh no not again'-mode, and clearly this is not that :)
Also not a fan of this google forum software which doesn't let me edit my inital post for clarification and seem to put replies where it wants and not where they should be, might be just me but i find it very hard to follow threads where several people are replying to different posts.

hope you wont hold it against me, i'm not usually such an ass :P


so back to the issue at hand:

Chris your solution:

<structured-content[^>]+>(.+?)</structured-content>
 
works great, however i forgot to mention that the <structured-content ... > tags can also contain other <structured-content> tags like this:

<structured-content>some random text here
<random-tag>with some other random content</random-tag> and more text
<structured-content>can be here also
<structured-content>or <more-random>here</more-random> as well
</structured-content> maybe here too
</structured-content>and also here
</structured-content>


in these cases it will use the end tag of the inner tag as closing tag .. so that doesn't work in these cases .. but still very cool to see this solution and i'm learning a lot from it. I don't suppose there is a way to modify your solution to work in this case since i don't know upfront if or how many sub-tags there are in each.


Bruce your solution

<\/*named-content[^>]*>

seems safer since it finds the opening and ending tags separately ... since i need to remove all <structured-content> tags this works for me (i think .. still stepping through one file testing to make sure i don't remove anything i shouldn't and there arent other special cases i hadn't thought of)



Rod Buchanan

unread,
May 13, 2019, 10:09:26 AM5/13/19
to BBEdit-Talk List

Find: </?structured-content.*?>
Replace:

Only issue would be if there are tags that contain additional non-blank data after content, e.g. ?<structured-content-data". Those will also be removed.
> --
> This is the BBEdit Talk public discussion group. If you have a
> feature request or need technical support, please email
> "sup...@barebones.com" rather than posting to the group.
> Follow @bbedit on Twitter: <https://www.twitter.com/bbedit>
> ---
> You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bbedit+un...@googlegroups.com.
> To post to this group, send email to bbe...@googlegroups.com.
> Visit this group at https://groups.google.com/group/bbedit.
> To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/d80839da-8a6d-4d0b-b5c9-3d38ab210e5d%40googlegroups.com.

--
Rod Buchanan
Kelly Supply Company
1004 W Oklahoma Ave
Grand Island, NE 68802-1328
308 382-5670
308 382-8764 x1120

Reply all
Reply to author
Forward
0 new messages