Errors from removed tags

5 views
Skip to first unread message

Anatol Broder

unread,
Mar 8, 2015, 11:21:37 PM3/8/15
to nokogi...@googlegroups.com
I want to make a basic validation of HTML files using parsing errors
from Nokogiri. Some tags are always invalid, so I remove them. But the
errors still here after removing the corresponding invalid tag. This
code is where I’m stuck.

```ruby
require "nokogiri"

doc = Nokogiri::HTML "<svg xmlns=http://www.w3.org/2000/svg />"

puts "Document: #{doc}", "Errors: #{doc.errors}"
# Document: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# <html><body><svg xmlns="http://www.w3.org/2000/svg"></svg></body></html>
# Errors: [#<Nokogiri::XML::SyntaxError: Tag svg invalid>]

doc.xpath("//svg").each(&:unlink)

puts "Document: #{doc}", "Errors: #{doc.errors}"
# Document: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# <html><body></body></html>
# Errors: [#<Nokogiri::XML::SyntaxError: Tag svg invalid>]
```

The best solution I can figure out is to reparse the modified document.

```ruby
require "nokogiri"
doc = Nokogiri::HTML "<svg xmlns=http://www.w3.org/2000/svg />"
doc.xpath("//svg").each(&:unlink)
doc = Nokogiri::HTML doc.to_html
puts "Document: #{doc}", "Errors: #{doc.errors}"
# Document: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# <html><body></body></html>
# Errors: []
```

How would you get errors from the modified document? What is the most
efficient way to do it?

Reply all
Reply to author
Forward
0 new messages