[newbie] Trying to pull items out of an XML file.

73 views
Skip to first unread message

leam hall

unread,
Mar 16, 2015, 7:21:32 AM3/16/15
to nokogi...@googlegroups.com
I have an xml doc with a lot of elements like <title> and such that I am trying to organize outside of XML.

The main section of the document looks something like this, with lots of Groups:

  <Group id="V-112233>
    <title> ...
    <description> .... </description>
    <Rule id="SV-345">
      <version> ... </version>
      <title> .... </title>
      <description> .... </description>
      <reference>
        <dc:title> .... </dc:title>
        <dc:type> .... </dc:type>
      </reference>
     </Rule>
  </Group>

What I am trying to do is pull out the different parts and associate them with the Group. What I am getting is pretty much nothing. Here's the code I have so far:

###

#!/opt/puppet/bin/ruby

require 'rubygems'
require 'nokogiri'

infile = 'my_cool_file'
f = File.open(infile)
doc = Nokogiri::XML(f)

doc.xpath('//title').each do  |title|
  puts title.class
  puts title
end

f.close

###

I can have the script read and print the infile but it doesn't seem to grab any of the node information. My guess is that it's a simple erorr but the XPath tutorial didn't seem to show how to actually grab and store/print stuff.

Hassan Schroeder

unread,
Mar 16, 2015, 8:25:38 AM3/16/15
to nokogi...@googlegroups.com
On Fri, Mar 13, 2015 at 10:57 AM, leam hall <leam...@gmail.com> wrote:

> I can have the script read and print the infile but it doesn't seem to grab
> any of the node information. My guess is that it's a simple erorr but the
> XPath tutorial didn't seem to show how to actually grab and store/print
> stuff.

Not sure exactly which tutorial you're referring to but ...

2.1.5 (main):0 > doc = Nokogiri.XML("<x><group
id='1'><title>Something</title></group><group
id='2'><title>Other</title></group></x>")
=> <?xml version="1.0"?>
<x>
<group id="1">
<title>Something</title>
</group>
<group id="2">
<title>Other</title>
</group>
</x>
2.1.5 (main):0 > doc.xpath("//group").each do |group|
2.1.5 (main):0 * puts "#{group.attributes['id'].value}:
#{group.xpath('title').text}"
2.1.5 (main):0 * end
1: Something
2: Other
=> 0

HTH,
--
Hassan Schroeder ------------------------ hassan.s...@gmail.com
http://about.me/hassanschroeder
twitter: @hassan
Consulting Availability : Silicon Valley or remote

leam hall

unread,
Mar 19, 2015, 2:03:13 PM3/19/15
to nokogi...@googlegroups.com
Hassan, thanks!

I'm working on the next issue, trying to access sub-elements. For example, in the below I want to find Group => Rule => title, which is different than Group => title. I'm trying to reference it but not successful yet. Suggestions?

####

        <Group id="V-38439">
                <title>SRG-OS-000001</title>
                <description>&lt;GroupDescription&gt;&lt;/GroupDescription&gt;</description>
                <Rule id="SV-50239r1_rule" severity="medium" weight="10.0">
                        <version>RHEL-06-000524</version>
                        <title>The system must provide automated support for account management functions.</title>
                        <description>VulnDiscussion</description>
                        <reference>
                                <dc:title>DPMS Target Red Hat 6</dc:title>
                                <dc:publisher>Someone</dc:publisher>
                                <dc:type>DPMS Target</dc:type>
                                <dc:subject>Red Hat 6</dc:subject>
                                <dc:identifier>2367</dc:identifier>
                        </reference>
                        <ident system="http://server.example.com/cci">CCI-000015</ident>
                        <fixtext fixref="F-43384r1_fix">Implement an automated system for managing user accounts.</fixtext>
                        <fix id="F-43384r1_fix"/>
                        <check system="C-45994r1_chk">
                                <check-content-ref name="M" href="DPMS_XCCDF_Benchmark_RHEL_6_STIG.xml"/>
                                <check-content>Interview the SA.
                                If not, this is a finding.</check-content>
                        </check>
                </Rule>
        </Group>

leam hall

unread,
Mar 19, 2015, 4:22:02 PM3/19/15
to nokogi...@googlegroups.com




Small step forward, then it blows up again. The "Rule/title" thing is solved with the following:

 puts "\t#{group.xpath('Rule/title').text}"

It does a few and then blows up:

./open_stig.rb:39:in `write': Broken pipe - <STDOUT> (Errno::EPIPE)
        from ./open_stig.rb:39:in `puts'
        from ./open_stig.rb:39:in `puts'
        from ./open_stig.rb:39:in `block in <main>'
        from /opt/puppet/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.10/lib/nokogiri/xml/node_set.rb:237:in `block in each'
        from /opt/puppet/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.10/lib/nokogiri/xml/node_set.rb:236:in `upto'
        from /opt/puppet/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.10/lib/nokogiri/xml/node_set.rb:236:in `each'
        from ./open_stig.rb:35:in `<main>'

 

leam hall

unread,
Mar 19, 2015, 4:25:09 PM3/19/15
to nokogi...@googlegroups.com
On the other hand, maybe never mind. It seems that the "head" part of
the command line was the culprit. Neat!
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "nokogiri-talk" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/nokogiri-talk/hTUZlpJgNwk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> nokogiri-tal...@googlegroups.com.
> To post to this group, send email to nokogi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/nokogiri-talk.
> For more options, visit https://groups.google.com/d/optout.



--
Mind on a Mission

Hassan Schroeder

unread,
Mar 19, 2015, 4:28:47 PM3/19/15
to nokogi...@googlegroups.com
On Thu, Mar 19, 2015 at 11:03 AM, leam hall <leam...@gmail.com> wrote:

> I'm working on the next issue, trying to access sub-elements. For example,
> in the below I want to find Group => Rule => title, which is different than
> Group => title. I'm trying to reference it but not successful yet.

Using the above example, either of these:

2.2.1 (main):0 > doc.xpath('//Group').xpath('Rule').xpath('title').text
=> "The system must provide automated support for account management functions."
2.2.1 (main):0 > doc.xpath('//Group/Rule').xpath('title').text
=> "The system must provide automated support for account management functions."
2.2.1 (main):0 >

:: will do it.

leam hall

unread,
Mar 23, 2015, 10:33:45 AM3/23/15
to nokogi...@googlegroups.com
Hassan, thanks! I'm making progress on this and can sort the Rules and Titles. I have another question, though, as my script only works if I change the first three lines. Otherwise it just runs but outputs nothing.

####
Bad lines:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type='text/xsl' href='STIG_unclass.xsl'?>
<Benchmark xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xsi="http
://www.w3.org/2001/XMLSchema-instance" xmlns:cpe="http://cpe.mitre.org/language/2.0" xmlns:dc="http://purl.org/dc/ele
ments/1.1/" id="RHEL_5_STIG" xml:lang="en" xsi:schemaLocation="http://checklists.nist.gov/xccdf/1.1 http://nvd.nist.g
ov/schema/xccdf-1.1.4.xsd http://cpe.mitre.org/dictionary/2.0 http://cpe.mitre.org/files/cpe-dictionary_2.1.xsd" xmln
s="http://checklists.nist.gov/xccdf/1.1">

####
Script works if I change those three to just:

<Benchmark>
####

Is there a way to handle those first three lines so we don't have to manually edit the XML file before processing it? Do I just need to remove the first two and then shorten the third? Or is there a Nokogiri way to handle it?

Thanks!

Leam

Hassan Schroeder

unread,
Mar 23, 2015, 1:23:05 PM3/23/15
to nokogi...@googlegroups.com
On Mon, Mar 23, 2015 at 7:33 AM, leam hall <leam...@gmail.com> wrote:

> Is there a way to handle those first three lines so we don't have to
> manually edit the XML file before processing it? Do I just need to remove
> the first two and then shorten the third? Or is there a Nokogiri way to
> handle it?

It shouldn't be necessary to delete or alter any valid XML to achieve
your goal.

Can you gist a complete sample XML doc and your current code so
we can run it ourselves?

leam hall

unread,
Mar 23, 2015, 2:00:55 PM3/23/15
to nokogi...@googlegroups.com
On Mon, Mar 23, 2015 at 1:23 PM, Hassan Schroeder
<hassan.s...@gmail.com> wrote:

> Can you gist a complete sample XML doc and your current code so
> we can run it ourselves?


Hehe...the two target files total ~35k lines. Here's the git repo:

https://github.com/LeamHall/SecComFrame

The script is open_stig.rb, and the sample_stig.xml file is a very
short version. It has the header lines I was talking about.

Let me know if that gives you what you need.

Thanks!

Leam

leam hall

unread,
Mar 23, 2015, 4:16:00 PM3/23/15
to nokogi...@googlegroups.com
I've updated the code based on recommendations from ruby-talk.

Still should perform the same.

Leam

Hassan Schroeder

unread,
Mar 23, 2015, 5:10:35 PM3/23/15
to nokogi...@googlegroups.com
On Mon, Mar 23, 2015 at 11:00 AM, leam hall <leam...@gmail.com> wrote:

> The script is open_stig.rb, and the sample_stig.xml file is a very
> short version. It has the header lines I was talking about.

Like I said, valid XML :-)

Try running it through e.g. http://www.xmlvalidation.com/

leam hall

unread,
Mar 25, 2015, 6:03:15 AM3/25/15
to nokogi...@googlegroups.com
That extra space was in my cut and paste. The actual XML doesn't have it, so the problem is still in the first three lines.

Thanks!

Leam 

Hassan Schroeder

unread,
Mar 25, 2015, 8:03:04 AM3/25/15
to nokogi...@googlegroups.com
On Wed, Mar 25, 2015 at 3:03 AM, leam hall <leam...@gmail.com> wrote:

>> Like I said, valid XML :-)

> That extra space was in my cut and paste. The actual XML doesn't have it, so
> the problem is still in the first three lines.

There's more than one problem in that example file.

Can you provide a valid well-formed sample?

leam hall

unread,
Mar 25, 2015, 9:30:07 AM3/25/15
to nokogi...@googlegroups.com
On Wed, Mar 25, 2015 at 8:03 AM, Hassan Schroeder
<hassan.s...@gmail.com> wrote:
> On Wed, Mar 25, 2015 at 3:03 AM, leam hall <leam...@gmail.com> wrote:
>
>>> Like I said, valid XML :-)
>
>> That extra space was in my cut and paste. The actual XML doesn't have it, so
>> the problem is still in the first three lines.
>
> There's more than one problem in that example file.
>
> Can you provide a valid well-formed sample?
>
> --
> Hassan Schroeder ------------------------ hassan.s...@gmail.com

I'm digesting someone else's XML, for example:

https://github.com/LeamHall/SecComFrame/tree/master/tools/U_RedHat_5_V1R9_Manual-xccdf.xml

Not sure they are providing well formed XML. That's the question I'm
trying to resolve; is it my code or their file?

Hassan Schroeder

unread,
Mar 25, 2015, 12:00:17 PM3/25/15
to nokogi...@googlegroups.com
On Wed, Mar 25, 2015 at 6:30 AM, leam hall <leam...@gmail.com> wrote:

> I'm digesting someone else's XML, for example:
>
> https://github.com/LeamHall/SecComFrame/tree/master/tools/U_RedHat_5_V1R9_Manual-xccdf.xml
>
> Not sure they are providing well formed XML. That's the question I'm
> trying to resolve; is it my code or their file?

Ah, I see two issues:

1) One of those URLs in the xsi:schemaLocation is returning a 301
redirect, which is causing the document to be non-well-formed.

2) Once you fix that, you need to reference your targets as e.g.

2.2.1 (main):0 > doc.xpath("//xmlns:Group").count
=> 571

HTH,
--
Hassan Schroeder ------------------------ hassan.s...@gmail.com
Reply all
Reply to author
Forward
0 new messages