Extracting a substring with regex_replace fails

1,077 views
Skip to first unread message

jean-christophe manciot

unread,
Feb 14, 2018, 10:10:35 AM2/14/18
to Ansible Project
ansible 2.6.0 (devel da5cf72236) last updated 2018/02/14 14:29:49 (GMT +200)

The task seems obvious at first glance but it appears to be difficult to implement in this context.
The string is:
host_meta: "<XRD xmlns='http://docs.oasis-open.org/ns/xri/xrd-1.0'>\n    <Link rel='restconf' href='/restconf'/>\n</XRD>"

I need to extract the values of rel and href into api_release and api_root.
I unsuccessfully tried the following:
- set_fact: api_release="{{ host_meta | regex_replace('^.* rel=(.*) .*$', '\\1') }}"
  when: host_meta is defined

- set_fact: api_root="{{ host_meta | regex_replace('^.* href=(.*)/>.*$', '\\1') }}"
  when: host_meta is defined

Both variables contain the whole string instead of the corresponding substring, which should be:
api_release: 'restconf'
api_root: /restconf

I have already successfully used this filter in other contexts. 
What am I missing here? Is the filter confused by the string?

jean-christophe manciot

unread,
Feb 14, 2018, 10:13:05 AM2/14/18
to Ansible Project


I  have the same difficulty with:
ansible 2.4.3.0 (detached HEAD 8a7f9beab7) last updated 2018/02/14 16:11:16 (GMT +200)
 

Felix Fontein

unread,
Feb 14, 2018, 10:37:04 AM2/14/18
to ansible...@googlegroups.com
Hi,
You are missing the newlines the string contains. regex_replace uses
re.sub() in Python and does not offer a way to set the MULTILINE flag.
If you remove '^' and '$' from your regexps, you can see what happens:
only the matching part between two '\n's is replaced.

Instead of regexp_replace() you should use the regex_search() filter;
that works as expected:

- set_fact: api_release="{{ host_meta | regex_search('rel=(.*)') }}"
when: host_meta is defined
- set_fact: api_root="{{ host_meta | regex_search('href=(.*)/>') }}"
when: host_meta is defined

Yields:

"api_release": "rel='restconf' href='/restconf'/>"
"api_root": "href='/restconf'/>"

(That's not exactly equal to what you expected, but that is what your
original regexes would have returned if the multiline flag would have
been set. The problem is that regexes are notoriously bad for matching
XML and or HTML.)

Cheers,
Felix



>


--
Felix Fontein -- fe...@fontein.de -- http://felix.fontein.de/

Felix Fontein

unread,
Feb 14, 2018, 10:38:46 AM2/14/18
to ansible...@googlegroups.com
Hi,

instead of using regexes, you might want to use the XML module
(https://docs.ansible.com/ansible/2.4/xml_module.html) with the
xmlstring argument. That should allow you to do this in a much cleaner
way.

Cheers,
Felix

Kai Stian Olstad

unread,
Feb 14, 2018, 1:00:11 PM2/14/18
to ansible...@googlegroups.com
On Wednesday, 14 February 2018 16.36.33 CET 'Felix Fontein' via Ansible Project wrote:
> > The task seems obvious at first glance but it appears to be difficult
> > to implement in this context.
> > The string is:
> > host_meta: "<XRD
> > xmlns='http://docs.oasis-open.org/ns/xri/xrd-1.0'>\n <Link
> > rel='restconf' href='/restconf'/>\n</XRD>"
> >
> > I need to extract the values of rel and href into api_release and
> > api_root. I unsuccessfully tried the following:
> > - set_fact: api_release="{{ host_meta | regex_replace('^.*
> > rel=(.*) .*$', '\\1') }}"
> > when: host_meta is defined
> >
> > - set_fact: api_root="{{ host_meta | regex_replace('^.*
> > href=(.*)/>.*$', '\\1') }}"
> > when: host_meta is defined
>
> You are missing the newlines the string contains. regex_replace uses
> re.sub() in Python and does not offer a way to set the MULTILINE flag.
> If you remove '^' and '$' from your regexps, you can see what happens:
> only the matching part between two '\n's is replaced.


It's possible to turn on multiline, but the dot doesn't match newline so DOTALL flag need to be set to.

This can be done with (?mS), "m" is multiline and "S" is DOTALL.

So this should work:

- set_fact:
api_release: "{{ host_meta | regex_replace('(?ms)^.* rel=(.*) .*$', '\\1') }}"


--
Kai Stian Olstad

jean-christophe manciot

unread,
Feb 15, 2018, 8:48:51 AM2/15/18
to Ansible Project
@Kai Stian Olstad
Thanks for your answer: yes, it works.

@Felix Fontain
I have no XML skills. I nevertheless took a look at the ansible xml module, and it's obscure to me.
I think you're right though: it'd be better to read XML appropriately with that module rather than use some regex filter which can be easily defeated if the input changes in the future.
So I tried something, but of course it does not work (api_root_filename is the file containing the xml multi-line string):

- name: Reading RESTconf release 
  xml:
        attribute: rel
        content: attribute
        path: "{{ api_root_filename }}"
        xpath: /XRD/Link
  register: return_restconf_release

- name: Showing attribute value
  debug:
        var: return_restconf_release.matches[0].Link.rel

Felix Fontein

unread,
Feb 15, 2018, 9:44:10 AM2/15/18
to ansible...@googlegroups.com
Hi,
The problem here is that your XML uses a namespace. I'm not very
familiar with xpaths either, but after a little googling I came up with
this:

- name: Reading RESTconf release
xml:
attribute: rel
content: attribute
path: "{{ api_root_filename }}"
xpath: "/*[name()='XRD']/*[name()='Link']"
register: return_restconf_release
- name: Showing attribute value
debug:
var: return_restconf_release.matches[0]['{http://docs.oasis-open.org/ns/xri/xrd-1.0}Link']

This gives for me:

ok: [localhost] => {
"failed": false,
"return_restconf_release.matches[0]['{http://docs.oasis-open.org/ns/xri/xrd-1.0}Link']": {
"href": "/restconf",
"rel": "restconf"
}
}

Cheers,
Felix


jean-christophe manciot

unread,
Feb 16, 2018, 5:40:45 AM2/16/18
to Ansible Project
You made it, bravo!
However,  this solution is not better than the regex search, because the whole point of using the xml module was to make it resilient to change. As soon as the URI {http://docs.oasis-open.org/ns/xri/xrd-1.0} changes, the call to access both variables will fail. 
Reply all
Reply to author
Forward
0 new messages