Nokogiri HTML parsing missing content

403 views
Skip to first unread message

김영록

unread,
Jul 25, 2014, 1:41:33 AM7/25/14
to nokogi...@googlegroups.com

Hello!

I'm learning how to program with Ruby and Nokogiri, and it is a lot of fun.
I recently ran into an issue that I can't seem to figure a way out.

params = {'id' => '1','SearchTxt' => '', 'hdLayerView' => '40'}
uri = URI.parse(url)
uri.query = URI.encode_www_form(params)
response = Net::HTTP.get_response(uri)
puts response.body


From the above code, you can see that it will print out the body in HTML. What is interesting, is that if you try to parse it with Nokogiri::HTML, it will not parse the entire HTML!
If you do Nokogiri::XML to parse, it seems to work but that doesn't seem to return xpath/css results like HTML.

Hope someone can teach me why this is.

Thank you in advance!

Hassan Schroeder

unread,
Jul 26, 2014, 12:23:29 PM7/26/14
to nokogi...@googlegroups.com
On Thu, Jul 24, 2014 at 10:41 PM, 김영록 <yrok...@gmail.com> wrote:

> url = "http://www.skin79mall.com/Search/"
> params = {'id' => '1','SearchTxt' => '', 'hdLayerView' => '40'}
> uri = URI.parse(url)
> uri.query = URI.encode_www_form(params)
> response = Net::HTTP.get_response(uri)
> puts response.body

> From the above code, you can see that it will print out the body in HTML.
> What is interesting, is that if you try to parse it with Nokogiri::HTML, it
> will not parse the entire HTML!

What makes you think that?

--
Hassan Schroeder ------------------------ hassan.s...@gmail.com
http://about.me/hassanschroeder
twitter: @hassan

김영록

unread,
Jul 27, 2014, 7:04:27 AM7/27/14
to nokogi...@googlegroups.com
What makes me think that the HTML is not parsed fully?

If that is your question, its because when I print them out it is missing contents.. =]
Not sure if you meant otherwise!

Hassan Schroeder

unread,
Jul 27, 2014, 10:50:00 AM7/27/14
to nokogi...@googlegroups.com
On Sun, Jul 27, 2014 at 4:04 AM, 김영록 <yrok...@gmail.com> wrote:
> What makes me think that the HTML is not parsed fully?
>
> If that is your question, its because when I print them out it is missing
> contents.. =]

What is "them" and what exactly do you mean by "print them out"?

I ask because I tried your code and the results looked exactly as I'd
expect. There are no errors reported, and the document appears to
be complete.

So if you could explain better *exactly* what you're doing that isn't
giving you the results you expect it would help.

Young Rok Kim

unread,
Jul 27, 2014, 11:37:12 PM7/27/14
to nokogi...@googlegroups.com
Ah, I see.

By them, I mean printing my page variable which is either:

params = {'id' => '1','SearchTxt' => '', 'hdLayerView' => '40'}
uri = URI.parse(url)
uri.query = URI.encode_www_form(params)
response = Net::HTTP.get_response(uri)
page = Nokogiri::HTML(response.body)


or 

params = {'id' => '1','SearchTxt' => '', 'hdLayerView' => '40'}
uri = URI.parse(url)
uri.query = URI.encode_www_form(params)
response = Net::HTTP.get_response(uri)
page = Nokogiri::XML(response.body)


I use the puts command to print the page and they are different, and I've included the two different outputs, both saved as HTML files.
When I print them, the output is different between HTML and XML, in that HTML is missing some content that XML is not.

You mentioned that the result was as you would expect, so maybe this is how it is supposed to be and there is something I do not understand about the web page's structure and what gets parsed with HTML and what gets parsed with XML with Nokogiri?

Thanks for the help!





--
You received this message because you are subscribed to a topic in the Google Groups "nokogiri-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nokogiri-talk/LZVYZbFnOe4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nokogiri-tal...@googlegroups.com.
To post to this group, send email to nokogi...@googlegroups.com.
Visit this group at http://groups.google.com/group/nokogiri-talk.
For more options, visit https://groups.google.com/d/optout.

skin79 nokogiri parsed with HTML.html
skin79 nokogiri parsed with xml.html

Hassan Schroeder

unread,
Jul 28, 2014, 11:05:23 AM7/28/14
to nokogi...@googlegroups.com
On Sun, Jul 27, 2014 at 8:36 PM, Young Rok Kim <yrok...@gmail.com> wrote:

> I use the puts command to print the page and they are different

OK, I totally don't understand what you're trying to do.

You have a site with nominally HTML5 (but invalid) markup - why
try to parse that as XML?

Either way, what's the point of printing the parsed representation?

Does the parse tree itself *not* contain any element you can see in
the raw markup? If so, what is it?

Young Rok Kim

unread,
Aug 3, 2014, 4:18:42 AM8/3/14
to nokogi...@googlegroups.com
Hassan,

Sorry for the delay in response.
I wasn't trying to parse it as XML, I was trying to parse it with:

Nokogiri::HTML(open(some url))

I'm just learning how to program so maybe what I am trying to do above is trying to parse it in XML? 

Either way the point of printing the parsed representation was to show that Nokogiri was not returning a parsed result. My understanding is that it failed to parse that page I was trying to parse. I then showed that using:

Nokogiri::XML(open(some url)),

It was able to parse the content of the page. So I guess essentially my question is that why is it that this page can be parsed with XML but not HTML?

I tried to show this because Nokogiri always were able to parse pages I tried to parse, but failed to do this for the url provided. Therefore, element I would like to access using Xpath or CSS, isn't available.

If you are trying to understand what I am trying to do, it's to parse a page using Nokogiri, and then use Xpath or CSS to get to elements that contain data I need.

Thanks!


Hassan Schroeder

unread,
Aug 3, 2014, 11:06:59 AM8/3/14
to nokogi...@googlegroups.com
On Sun, Aug 3, 2014 at 1:18 AM, Young Rok Kim <yrok...@gmail.com> wrote:

> My understanding is that it failed to parse that page

You've shown no basis for that "understanding".

> I tried to show this because Nokogiri always were able to parse pages I
> tried to parse, but failed to do this for the url provided. Therefore,
> element I would like to access using Xpath or CSS, isn't available.
>
> If you are trying to understand what I am trying to do, it's to parse a page
> using Nokogiri, and then use Xpath or CSS to get to elements that contain
> data I need.

And again: what element in the doc are you unable to get?

Showing us the code demonstrating a failure to find an element in
the document would be meaningful.

Young Rok Kim

unread,
Aug 4, 2014, 12:11:35 AM8/4/14
to nokogi...@googlegroups.com
Hassan,

It seems like there is a communication difficulty so let me simplify the question. I think I clearly stated the issue, you clearly don't seem to understand it and probably it's because I am new to programming and do not speak the language well. 


If you visit the above url and inspect the element, you will find the search result. Here is a screenshot.


Inline image 1

Consider the two examples I provided in my original question about using XML and HTML.

page = Nokogiri::XML(open(url))

vs

page = Nokogiri::HTML(open(url))


When I print the variable page to see if the element I am trying to parse is there (in this case <div id="SearchbodyList"> ), XML seems to have that element while the HTML does not.

I am asking:

1. If this is supposed to be the case?
2. How can I parse that element using the Nokogiri::HTML method? 

I am trying to achieve #2 so that I can use the xpath and css selectors to access the elements within the page.

Thanks!


Hassan Schroeder

unread,
Aug 8, 2014, 12:36:23 PM8/8/14
to nokogi...@googlegroups.com
Apologies for the delay in responding (traveling)

On Sun, Aug 3, 2014 at 9:11 PM, Young Rok Kim <yrok...@gmail.com> wrote:

It seems like there is a communication difficulty so let me simplify the question. I think I clearly stated the issue, you clearly don't seem to understand it and probably it's because I am new to programming and do not speak the language well. 

No, I understand your issue, but  

When I print the variable page to see if the element I am trying to parse is there (in this case <div id="SearchbodyList"> ), XML seems to have that element while the HTML does not

the point I'm trying to make is that "print the variable page" is not
an optimal way of approaching this. Use the appropriate Nokogiri
methods to examine the nodes of the parsed result (which is, as
you suspect, incomplete). 
 
I am asking:

1. If this is supposed to be the case?
2. How can I parse that element using the Nokogiri::HTML method? 

1) If you parse this page using XML in strict mode, it fails also.

2) If you examine the last node of the parsed HTML, you'll see that
    parsing stops on the *first non-ascii character*, a good indication
    you're seeing an encoding problem.

    Look at the encoding of the page returned by this web server -- 
    address that and you'll see the whole page parsed in HTML mode. 

HTH,

Young Rok Kim

unread,
Aug 11, 2014, 8:48:47 AM8/11/14
to nokogi...@googlegroups.com
Hassan,

Not a problem.
I apologize for the confusion. Now I understand what was missing from my question.
I am obviously trying to get at product information, and printing the entire document was just to help prove that it wasn't parsing the entire document as it should, but XML does.

You can see this by trying to save the parsed HTML data to a .html file, and a parsed XML data to a .html file.

The contents inside the two document will not be the same. When this page is parsed with HTML, it finished at the tag:

    <meta http-equiv="Keywords" content="WizcozStore,">

However, using XML to parse the page (when I print the page) will contain the information after that tag above and it also includes what I am after, and it is what I would expect (same as what I see when I inspect the page using Chrome or Firefox).

If it is an encoding issue, that Nokogiri::HTML and Nokogiri::XML handles encodings differently?

How can I go around the issue and parse it with HTML?

I ask this question for the sake of this issue, but my issue have been solved because I can use the XML parsed date from Nokogiri to get at elements I need.





--

Hassan Schroeder

unread,
Aug 11, 2014, 11:47:01 AM8/11/14
to nokogi...@googlegroups.com
On Mon, Aug 11, 2014 at 5:48 AM, Young Rok Kim <yrok...@gmail.com> wrote:

> If it is an encoding issue, that Nokogiri::HTML and Nokogiri::XML handles
> encodings differently?

As I said, in strict mode, XML parsing also fails.

The phrase "garbage in, garbage out" comes to mind :-)

> How can I go around the issue and parse it with HTML?

Example: https://gist.github.com/hassan/4bb163bfb9c61fcb241e
Reply all
Reply to author
Forward
0 new messages