Parsing XML with no closing tags

瀏覽次數:27 次
跳到第一則未讀訊息

Jonathan Pan

未讀,
2017年2月25日 下午3:38:362017/2/25
收件者:nokogiri-talk
I'm really sorry if I'm asking too much, third post in a row.  I'm doing a student project for my software engineering course and I've been experimenting with parsing multiple sites.  Each site is different and has its own set of issues.

My current issue is I'm parsing an XML page from my university of courses.


It has no closing tags, and I'm not sure if that's the source of all my errors.  

1. My current error is I'm trying to xpath(".//parastyle:course").  I get 

C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/nokogiri-1.7.0-x64-mingw32/lib/nokogiri/xml/searchable.rb:165:in `evaluate': Undefined namespace prefix: .//parastyle:course (Nokogiri::XML::XPath::SyntaxError)
from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/nokogiri-1.7.0-x64-mingw32/lib/nokogiri/xml/searchable.rb:165:in `block in xpath'
from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/nokogiri-1.7.0-x64-mingw32/lib/nokogiri/xml/searchable.rb:156:in `map'
from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/nokogiri-1.7.0-x64-mingw32/lib/nokogiri/xml/searchable.rb:156:in `xpath'
from C:/Users/JP/Desktop/Project/lib/scraper.rb:25:in `<main>'
[Finished in 2.2s with exit code 1]

My code
doc = Nokogiri::XML(open(courses_url)) do |config|
config.huge
end

doc.xpath(".//parastyle:course").each do |node|
puts node
end

2. My second error is the output of doc has a huge block of ending tags that  Nokogiri inserts at the end.  I see theres a method for no_empty_tags but that's only for nodes.  I'm not sure how to use it when I can't xpath each node.

Thanks guys this group has been very helpful.


Aaron Patterson

未讀,
2017年2月25日 下午4:24:402017/2/25
收件者:nokogi...@googlegroups.com
On Sat, Feb 25, 2017 at 12:38 PM Jonathan Pan <jpa...@gmail.com> wrote:
I'm really sorry if I'm asking too much, third post in a row.  I'm doing a student project for my software engineering course and I've been experimenting with parsing multiple sites.  Each site is different and has its own set of issues.

My current issue is I'm parsing an XML page from my university of courses.


It has no closing tags, and I'm not sure if that's the source of all my errors.  

Not only is it missing closing tags, it's also missing namespace declarations.  This is definitely not valid XML.
 
1. My current error is I'm trying to xpath(".//parastyle:course").  I get 

C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/nokogiri-1.7.0-x64-mingw32/lib/nokogiri/xml/searchable.rb:165:in `evaluate': Undefined namespace prefix: .//parastyle:course (Nokogiri::XML::XPath::SyntaxError)
from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/nokogiri-1.7.0-x64-mingw32/lib/nokogiri/xml/searchable.rb:165:in `block in xpath'
from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/nokogiri-1.7.0-x64-mingw32/lib/nokogiri/xml/searchable.rb:156:in `map'
from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/nokogiri-1.7.0-x64-mingw32/lib/nokogiri/xml/searchable.rb:156:in `xpath'
from C:/Users/JP/Desktop/Project/lib/scraper.rb:25:in `<main>'
[Finished in 2.2s with exit code 1]

My code
doc = Nokogiri::XML(open(courses_url)) do |config|
config.huge
end

doc.xpath(".//parastyle:course").each do |node|
puts node
end

2. My second error is the output of doc has a huge block of ending tags that  Nokogiri inserts at the end.  I see theres a method for no_empty_tags but that's only for nodes.  I'm not sure how to use it when I can't xpath each node.

I suggest reading up on XPath and namespaces, but give this a try:

  doc.xpath("//*[name()='ParaStyle:Course']")

Hope that helps.

Jonathan Pan

未讀,
2017年2月25日 下午4:35:362017/2/25
收件者:nokogiri-talk
Thanks that does work, for some reason the loop never escapes.  It loops back to the start when it reaches the end.

doc = Nokogiri::XML(open(courses_url)) do |config|
config.huge
end

courses = []

doc.xpath("//*[name()='ParaStyle:Course']").each do |node|
courses.push(node)
end

puts courses

The xml I'm scraping is only 6000 lines so I terminated my build and the output went over 440,000 lines and I scrolled through and it was repeating.
回覆所有人
回覆作者
轉寄
0 則新訊息