Question about GREP search in XML files with weird CDATA fields

45 views
Skip to first unread message

Miguel Perez

unread,
Feb 24, 2020, 11:44:36 AM2/24/20
to BBEdit Talk
Hi,

I'm fairly new to RegEx and I need your help.

I process many XML files in my job. Most of them are formatted correctly, that is:
<key1>Value</key1>
<key2>Value</key2>

For those I search for values using:

<key1>.*?</key1>

And it works like a charm.

But then I have this one source that formats its XML files with CDATA fields like this:
<field>
    <key><![CDATA[NAME]]></key>
    <value><![CDATA[John Appleseed]]></value>
</field>
In this example they are trying to say that the value NAME is John Appleseed. Rather than putting it as a key/value pair, they do that weird syntax.

What GREP pattern can I use to extract all the names for this formatting?

I am open to other solutions, like BASH scripts and Applescript. I'm desperate.

Thank you for your help, friends.

🙂

ThePorgie

unread,
Feb 24, 2020, 12:01:22 PM2/24/20
to BBEdit Talk
Put "\1" (no quotes) in the replace field and then Extract with
<value><!\[CDATA\[(.+?)\]

Will that work for ya?

Miguel Perez

unread,
Feb 24, 2020, 12:54:43 PM2/24/20
to BBEdit Talk
Thank you, ThePorgie.

Unfortunately it doesn't work for me.

I should've said that there are many values using this syntax, like this:
<field>
   
<key1><![CDATA[NAME]]></key1>

   
<value><![CDATA[John Appleseed]]></value>
</field>
<field>
   
<key2><![CDATA[Company]]></key2>
   
<value><![CDATA[Google]]></value>
</field>


As you can see, there are two keys, but the very next line says value for both of them. That is my main concern.

I want value1 for each item on the list, but its defining key is in the line above with that CDATA formatting.

Any ideas?

ThePorgie

unread,
Feb 24, 2020, 1:08:24 PM2/24/20
to BBEdit Talk
I would then include the line above so the string needs "Name"
<key1><!\[CDATA\[NAME\]\]></key1>\n\s+<value><!\[CDATA\[(.+?)\]

again with a \1 in the replace before hitting extract.

Will that do what you need?

Sam Hathaway

unread,
Feb 24, 2020, 1:09:51 PM2/24/20
to BBEdit Talk

Can you give us a real-world example? I’m not clear on whether “key1” and “key2” literally appear in your document or if they are placeholders.

In any case, you should probably use a tool that is designed to work with XML. Such a tool would take care of the CDATA sections for you and let you search for things in a hierarchical way.

You might be able to “get the job done” with text-oriented tools, but it will eventually drive you insane.
-sam

--
This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "sup...@barebones.com" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
---
You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbedit+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/8c4b1b3f-c421-472f-967f-9e3d5d3dfff8%40googlegroups.com.

ThePorgie

unread,
Feb 24, 2020, 1:12:16 PM2/24/20
to BBEdit Talk
One other thing about a xml tool. The latest version of Mac Excel will now open xml. Just an fyi if that would work to get the names you're looking for.



On Monday, February 24, 2020 at 11:44:36 AM UTC-5, Miguel Perez wrote:

GP

unread,
Feb 24, 2020, 1:44:54 PM2/24/20
to BBEdit Talk


On Monday, February 24, 2020 at 9:54:43 AM UTC-8, Miguel Perez wrote:
Thank you, ThePorgie.

Unfortunately it doesn't work for me.

I should've said that there are many values using this syntax, like this:
<field>
   
<key1><![CDATA[NAME]]></key1>
   
<value><![CDATA[John Appleseed]]></value>
</field>
<field>
   
<key2><![CDATA[Company]]></key2>
   
<value><![CDATA[Google]]></value>
</field>


As you can see, there are two keys, but the very next line says value for both of them. That is my main concern.

I want value1 for each item on the list, but its defining key is in the line above with that CDATA formatting.

Any ideas?

As others have noted, expand your match pattern to reject the portions of the XML data you don't want to match.

Using your examples , the following regular expression will handle both types of match cases:

(?:\s*<key1><!\[CDATA\[NAME\]\]><\/key1>\s+<value><!\[CDATA\[(.*)\]\]><\/value>)|(?:<key1>(.*)<\/key1>)

The (?: ... ) constructs are non-capturing grouping to organize the alternative matching cases.

Note that you want the longest match case (i.e., the CDATA pattern) as the first alternative. Since that will be the first case tried for a pattern match and will thus correctly match on the desired CDATA pattern and not incorrectly use the "<key1>(.*)<\/key1>" expression for wrongly find a match in the CDATA text patterns.

The second alternative is the expression which matches on your "formatted correctly" non-CDATA formatted XML case. (This alternative is only tried after the first alternative fails to find a match.

Also note there are two capture fields in the regular expression. $1 captures the name text in the first CDATA case alternative and $2 captures the name text in the second alternative.

Miguel Perez

unread,
Feb 24, 2020, 1:50:04 PM2/24/20
to BBEdit Talk
This formula did the trick.
You are the best!
😊

Miguel Perez

unread,
Feb 24, 2020, 1:50:05 PM2/24/20
to BBEdit Talk
This could be an option too. Thank you.
Excel ended up not working in this case because while it reads the file, it has a weird formatting too and I cannot work with it much better.
The formula posted above works much better for me and is what I was looking for.

Gustave Stresen-Reuter

unread,
Feb 25, 2020, 3:25:12 AM2/25/20
to bbe...@googlegroups.com
Have you looked into xslt? That's really the tool you should be using for xml transformations. The learning curve is steep-ish but this is the exact use case it was designed for.

Have a look. The Mac confess with a command line tool called xsltproc preinstalled that is normally more than sufficient for the types of transformations. I usually pipe the output to bbedit for verification of my work when the transformation is complete.

Ted

--
This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "sup...@barebones.com" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
---
You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbedit+un...@googlegroups.com.

Sam Hathaway

unread,
Feb 25, 2020, 1:54:06 PM2/25/20
to bbe...@googlegroups.com

You might also look into the command-line tool XMLStarlet. It can be installed on macOS using Homebrew. I playing with it for 15 minutes and here’s what I came up with for extracting data from your example:

xml sel -t -m '//field' -v 'name(*[1])' -o $'\t' -v '*[1]' -o $'\t' -v 'value' -o $'\n' example.xml

Output is:

key1    NAME    John Appleseed
key2    Company    Google

Documentation here: http://xmlstar.sourceforge.net/doc/UG/

Hope this helps.
-sam

Reply all
Reply to author
Forward
0 new messages