Q: Nokogiri::XML::Reader with a stream from network

32 views
Skip to first unread message

Jonathan Rochkind

unread,
Jun 22, 2018, 4:36:22 AM6/22/18
to nokogiri-talk
I'd like to use Nokogiri::XML::Reader to read/parse a stream from an HTTP response. Ideally without reading the whole thing into memory, cause that's how Nokogiri::XML::Reader rolls, right?

I am not sure how to do this. 

It has a .from_io API, which works with ruby File objects at least... but various streaming HTTP clients don't generally give you one of those. I can turn what they give you into whatever Nokogiri::XML::Reader wants, but I'm not sure what that is. 

What's the duck type? Just an #each_line method?  Does it really have to be a newline-terminated _line_? 

Thanks for any advice, 

Jonathan

Mike Dalessio

unread,
Jun 24, 2018, 4:39:15 AM6/24/18
to nokogiri-talk
Hey Jonathan,

Thanks for asking this question, it's not obvious how to do this in Ruby.

The OpenURI library provides an IO object for HTTP responses. You can try using it like this:

#! /usr/bin/env ruby

require "nokogiri"
require "open-uri"

open("https://yahoo.com/") do |io|
  reader = Nokogiri::XML::Reader.from_io io
  reader.each do |node|
    puts node.name
  end
end

though I'll note that libxml2 doesn't provide HTML Reader functionality, so you may end up with warts between the XML parser and the HTML content.

Another alternative is to use non-blocking reads and the HTML::SAX::PushParser if you feel like creating the callback handler for it.

I hope this response helps?


--
You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nokogiri-tal...@googlegroups.com.
To post to this group, send email to nokogi...@googlegroups.com.
Visit this group at https://groups.google.com/group/nokogiri-talk.
For more options, visit https://groups.google.com/d/optout.

Jonathan Rochkind

unread,
Jun 24, 2018, 10:54:18 AM6/24/18
to nokogi...@googlegroups.com
Can you say anything about what API is expected on the "io" object? Or how would we figure this out? I'm getting stymied by my difficulty reading C. 

In some experimentation, it _seems_ like perhaps all Nokogiri::XML::Reader expects from it's `io` argument is a `read(num_bytes)` method.  Does this seem plausible? Would there be a way to confirm it? Could it be documented as API?

Experimentation also reveals that, while Nokogiri::XML::Reader can tolerate it when `read(some_num_bytes)` returns less bytes than the argument (I think this is not unusual in ruby IO-like object's #read?), it can _not_ tolerate if the call returns 0 bytes (the empty string).  I found this out when trying to use it with http.rb (wrapped in a little shim to make http.rb's #readpartial method the #read method that Nokogiri::XML::Reader seems to want), discovered that it sometimes returns 0 bytes (which may or may not be a bug), and that this caused Nokogiri::XML::Reader to raise. 

Whatever open-uri returns may or may not work (without weird edge-case bugs waiting?) with Nokogiri::XML::Reader. But I don't particularly want to use open-uri, I don't find it to be sufficiently flexible a library.  

What I really want to know is what API nokogiri expects on it's `io` objects, so I can figure out the best way for my needs/context to provide an `io` object with the expected API.  Any thoughts?

To unsubscribe from this group and stop receiving emails from it, send an email to nokogiri-talk+unsubscribe@googlegroups.com.

To post to this group, send email to nokogi...@googlegroups.com.
Visit this group at https://groups.google.com/group/nokogiri-talk.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nokogiri-talk+unsubscribe@googlegroups.com.

Jonathan Rochkind

unread,
Jun 24, 2018, 10:55:51 AM6/24/18
to nokogi...@googlegroups.com
Oh, and for what it's worth, I am not working with HTML here, so lack of HTML pull-parsing is not an issue. I am working with XML over the network, not HTML. 

Руслан Корнев

unread,
Jul 13, 2020, 9:32:48 PM7/13/20
to nokogiri-talk
Found these solutions
https://gist.github.com/kmile/827475/e2459379c76dc181ec2b3bb723810cc2b4753e43

пятница, 22 июня 2018 г., 11:36:22 UTC+3 пользователь Jonathan Rochkind написал:
Reply all
Reply to author
Forward
0 new messages