Nokogiri converts & to & in query strings

103 views
Skip to first unread message

Shivasubramanian A

unread,
Nov 18, 2015, 3:49:01 AM11/18/15
to nokogiri-talk
Hi,

When using this code sample,
Nokogiri::HTML.fragment('<a href="http://localhost:8080?a=b&c=d'>Click here</a>'),

we find that Nokogiri converts it into
<a href="http://localhost:8080?a=b&amp;c=d'>Click here</a>

The & in the query string in the href is converted into &amp; Is there a way to avoid this, or is this a bug? The impact of this is that browsers treat the link as having two query parameters, named a & amp;c. Clearly not what we want!!

Apologies if this has been posted somewhere (we seem to have missed it!)

Regards,
Shivasubramanian A

Robert Klemme

unread,
Nov 18, 2015, 9:54:15 AM11/18/15
to nokogiri-talk


On Wednesday, November 18, 2015 at 9:49:01 AM UTC+1, Shivasubramanian A wrote:
When using this code sample,
Nokogiri::HTML.fragment('<a href="http://localhost:8080?a=b&c=d'>Click here</a>'),

I think your input is wrong. It should be like this:

Nokogiri::HTML.fragment('<a href="http://localhost:8080?a=b%26c=d">Click here</a>')

Cheers

robert

Shivasubramanian A

unread,
Nov 19, 2015, 1:22:01 AM11/19/15
to nokogiri-talk
Hello Robert,

Thanks for the reply.

Ok, I should have explained my business need earlier, but anyway, here it is: Our business need is that we are passed in a large HTML string, which also contains such anchor tags. We use Nokogiri to parse this string, and search and replace specific non <a> tags with some other content. This results in the <a> tags being changed as shown above. We then render this modified HTML string on the browser.

If we were to use the method you had specified, then once we parse the HTML, the URL will be http://localhost:8080?a=b%26c=d, which will mean that the value of 'a' parameter is b&c=d, which is not what we want.

FWIW, we are using Nokogiri 1.6.1.

Robert Klemme

unread,
Nov 19, 2015, 3:08:38 AM11/19/15
to nokogiri-talk

Hm.  First, you write:

> The impact of this is that browsers treat the link as having two query parameters, named a & amp;c. Clearly not what we want!!

So, you do not want two query parameters. For me doing this:

d = Nokogiri.HTML('<a href="https://www.google.de/search?q=a&b=c">link</a>')
File.open("x.html","wb"){|io|d.write_to io}

Leads to a file with this content:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><a href="https://www.google.de/search?q=a&amp;b=c">link</a></body></html>

When opening this with a browser (Chrome in my case) and clicking on the link Google will search for a and have another query parameter "b" with value "c". This is different from what you state. Did you test with a browser?

Then you write

On Thursday, November 19, 2015 at 7:22:01 AM UTC+1, Shivasubramanian A wrote:

If we were to use the method you had specified, then once we parse the HTML, the URL will be http://localhost:8080?a=b%26c=d, which will mean that the value of 'a' parameter is b&c=d, which is not what we want.

OK, I thought you wanted the & in the value since the original version did exactly what it was supposed to do (i.e. using "&amp;" as a query parameter separator).
 
FWIW, we are using Nokogiri 1.6.1.

irb(main):014:0> puts Nokogiri::VERSION_INFO
{"warnings"=>[], "nokogiri"=>"1.6.6.2", "ruby"=>{"version"=>"2.2.3", "platform"=>"x86_64-cygwin", "description"=>"ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-cygwin]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "source"=>"system", "compiled"=>"2.9.2", "loaded"=>"2.9.2"}}

Behavior seems the same, maybe it's browsers that handle it differently. But I doubt it. And so far Nokogiri works correct.

Cheers

robert

Shivasubramanian A

unread,
Nov 19, 2015, 3:55:17 AM11/19/15
to nokogiri-talk
Hi Robert,

> When opening this with a browser (Chrome in my case) and clicking on the link Google will search for a and have another query parameter "b" with value "c". This is different from what you state. Did you test with a browser?

We did test with Chrome. Additionally, we didn't trust the browser's address bar, but we looked at the Developer Tools, and found that it does not submit query parameter "b" with value "c", but query parameter "amp;b" with value "c". Can you look at Developer Tools and verify this?


> So, you do not want two query parameters
No, we do want two query parameters. We just don't want the browser to submit "amp;b" with value "c". This can be achieved if Nokogiri keeps the "&" in query strings as is.

Hope it's clear now.

Robert Klemme

unread,
Nov 19, 2015, 5:28:46 AM11/19/15
to nokogiri-talk


On Thursday, November 19, 2015 at 9:55:17 AM UTC+1, Shivasubramanian A wrote:
Hi Robert,

> When opening this with a browser (Chrome in my case) and clicking on the link Google will search for a and have another query parameter "b" with value "c". This is different from what you state. Did you test with a browser?

We did test with Chrome. Additionally, we didn't trust the browser's address bar, but we looked at the Developer Tools, and found that it does not submit query parameter "b" with value "c", but query parameter "amp;b" with value "c". Can you look at Developer Tools and verify this?

Verified. Additionally I changed parameter names to make the second one "q" and Google's search page properly shows the value in the search bar on the results page.
 
> So, you do not want two query parameters
No, we do want two query parameters. We just don't want the browser to submit "amp;b" with value "c". This can be achieved if Nokogiri keeps the "&" in query strings as is.

Hope it's clear now.

Yes.

robert

Shivasubramanian A

unread,
Nov 19, 2015, 5:37:55 AM11/19/15
to nokogiri-talk
Hello Robert,

> Verified. Additionally I changed parameter names to make the second one "q" and Google's search page properly shows the value in the search bar on the results page.

Does that mean you accept that this is a bug? Or do you have a possible workaround? Your reply is ambiguous.

Robert Klemme

unread,
Nov 19, 2015, 7:16:59 AM11/19/15
to nokogi...@googlegroups.com
On Thu, Nov 19, 2015 at 11:37 AM, Shivasubramanian A
<ashivasu...@gmail.com> wrote:
> Hello Robert,
>
>> Verified. Additionally I changed parameter names to make the second one
>> "q" and Google's search page properly shows the value in the search bar on
>> the results page.
>
> Does that mean you accept that this is a bug? Or do you have a possible
> workaround? Your reply is ambiguous.

I said Nokogiri works as expected - that clearly means "no bug". You
seem to be the only one having an issue. I cannot reproduce it.

robert



--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Walter Lee Davis

unread,
Nov 19, 2015, 8:23:46 AM11/19/15
to nokogi...@googlegroups.com

> On Nov 19, 2015, at 7:16 AM, 'Robert Klemme' via nokogiri-talk <nokogi...@googlegroups.com> wrote:
>
> On Thu, Nov 19, 2015 at 11:37 AM, Shivasubramanian A
> <ashivasu...@gmail.com> wrote:
>> Hello Robert,
>>
>>> Verified. Additionally I changed parameter names to make the second one
>>> "q" and Google's search page properly shows the value in the search bar on
>>> the results page.
>>
>> Does that mean you accept that this is a bug? Or do you have a possible
>> workaround? Your reply is ambiguous.
>
> I said Nokogiri works as expected - that clearly means "no bug". You
> seem to be the only one having an issue. I cannot reproduce it.
>
> robert
>
>

In XHTML Strict, having an unescaped ampersand anywhere in your code, even in a url, makes the page invalid.

Browsers know how to deal with this kind of code. Nokogiri is correcting your html to be valid, not breaking it.

Walter


>
> --
> [guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
> - without end}
> http://blog.rubybestpractices.com/
>
> --
> You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to nokogiri-tal...@googlegroups.com.
> To post to this group, send email to nokogi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/nokogiri-talk.
> For more options, visit https://groups.google.com/d/optout.

Walter Lee Davis

unread,
Nov 19, 2015, 9:54:54 AM11/19/15
to nokogi...@googlegroups.com

On Nov 19, 2015, at 8:23 AM, Walter Lee Davis <wa...@wdstudio.com> wrote:

>>
>> On Nov 19, 2015, at 7:16 AM, 'Robert Klemme' via nokogiri-talk <nokogi...@googlegroups.com> wrote:
>>
>> On Thu, Nov 19, 2015 at 11:37 AM, Shivasubramanian A
>> <ashivasu...@gmail.com> wrote:
>>> Hello Robert,
>>>
>>>> Verified. Additionally I changed parameter names to make the second one
>>>> "q" and Google's search page properly shows the value in the search bar on
>>>> the results page.
>>>
>>> Does that mean you accept that this is a bug? Or do you have a possible
>>> workaround? Your reply is ambiguous.
>>
>> I said Nokogiri works as expected - that clearly means "no bug". You
>> seem to be the only one having an issue. I cannot reproduce it.
>>
>> robert
>>
>>
>
> In XHTML Strict, having an unescaped ampersand anywhere in your code, even in a url, makes the page invalid.

Even in HTML5: http://scripty.walterdavisstudio.com/link-test.html

Make sure to enable the "show URL when you hover a link" feature in your browser, and note that the two bottom links are displayed exactly the same way (unescaped) despite the fact that one of those links has escaped &amp; in its literal code. View source. Note that all of these examples submit the exact same data to the reflector (which is a trivial PHP script I wrote years and years ago, source code here: https://gist.github.com/walterdavis/84c75ce31ce7e7e792e3).

Walter

Walter Lee Davis

unread,
Nov 19, 2015, 10:04:43 AM11/19/15
to nokogi...@googlegroups.com

On Nov 19, 2015, at 9:53 AM, Walter Lee Davis <wa...@wdstudio.com> wrote:

>
> On Nov 19, 2015, at 8:23 AM, Walter Lee Davis <wa...@wdstudio.com> wrote:
>
>>>
>>> On Nov 19, 2015, at 7:16 AM, 'Robert Klemme' via nokogiri-talk <nokogi...@googlegroups.com> wrote:
>>>
>>> On Thu, Nov 19, 2015 at 11:37 AM, Shivasubramanian A
>>> <ashivasu...@gmail.com> wrote:
>>>> Hello Robert,
>>>>
>>>>> Verified. Additionally I changed parameter names to make the second one
>>>>> "q" and Google's search page properly shows the value in the search bar on
>>>>> the results page.
>>>>
>>>> Does that mean you accept that this is a bug? Or do you have a possible
>>>> workaround? Your reply is ambiguous.
>>>
>>> I said Nokogiri works as expected - that clearly means "no bug". You
>>> seem to be the only one having an issue. I cannot reproduce it.
>>>
>>> robert
>>>
>>>
>>
>> In XHTML Strict, having an unescaped ampersand anywhere in your code, even in a url, makes the page invalid.
>
> Even in HTML5: http://scripty.walterdavisstudio.com/link-test.html

Sorry to bang on about this; by saying Even in HTML5, I don't mean that it is invalid in HTML5. HTML5 took the "pave the cowpaths" approach of making every "wrong" (by XHTML Strict standards) thing that people were doing already in millions and billions of Web pages, and declaring them legal. So there are a lot of things you can get away with in HTML5 that would be laughable in a stricter DOCTYPE. Un-closed (closed by inference only) tags: I am looking at you! I cut my teeth on XHTML Strict, so I use that syntax out of muscle memory or sheer pig-headedness.

Walter

Reply all
Reply to author
Forward
0 new messages