URL objects in PhantomJS (surfacing QUrl)

227 views
Skip to first unread message

Zack Weinberg

unread,
May 9, 2014, 11:46:52 AM5/9/14
to phan...@googlegroups.com
So I just filed https://github.com/ariya/phantomjs/issues/12216 ... the quick summary of which is that `page.url` calls `QUrl::toString()` instead of `QUrl::toEncoded()`, which is inconvenient if you want to store page URLs in a database somewhere and then, later, re-access them using something else.  In fact, it can destroy information and make it *impossible* to re-access the URL.  Here's a concrete example where this matters:  right now, if you attempt to load the page `http://genesis-ec.com/`, you get a 302 Moved Temporarily which puts you at `http://genesis-ec.com/err.asp?shopcd=99999&errmsg=%81E%83V%83%87%83b%83v%82%AA%91I%91%F0%82%B3%82%EA%82%C4%82%A2%82%DC%82%B9%82%F1%3CBR%3E`.  That query string is not valid UTF-8.  (Based on playing around with the Character Encoding menu in Firefox and then pasting stuff into Google Translate, I think it's Shift_JIS.) `page.url` for the redirected page comes out as 'http://genesis-ec.com/err.asp?shopcd=99999&errmsg=\ufffdE\ufffdV\ufffd\ufffd\ufffdb\ufffdv\ufffd\ufffd\ufffdI\ufffd\ufffd\ufffd\ufffd\u0102\ufffd\ufffd\u0702\ufffd\ufffd\ufffd<BR>', which as you can see has lost information.

This may be impossible to fix 100% without modifying Qt itself -- the QUrl documentation leads me to believe that it internally assumes URLs are always encoded in UTF-8, which, as the above example demonstrates, is wrong -- but it would be a step in the right direction to give access to `QUrl::toEncoded`.  Now, it would be trivial to add a `page.encoded_url` property, but I'm wondering if it would be *better* to define a "URL object" which exposes as much of the QUrl API as makes sense, and make that be the value of `page.url` and various other properties (basically wherever pjs internally stores a QUrl).  For backward compatibility it would stringify as it always has, but one could also access page.url.encoded, page.url.hostname, and so on.

This is a blocker for me on a project I'm using PhantomJS for, so I am volunteering to do the programming, but CONTRIBUTING.md says discuss new features here first. :-)  What do you think?

zw

Zack Weinberg

unread,
May 15, 2014, 5:35:51 PM5/15/14
to phan...@googlegroups.com
On Friday, May 9, 2014 11:46:52 AM UTC-4, Zack Weinberg wrote:
 I'm wondering if it would be *better* to define a "URL object" which exposes as much of the QUrl API as makes sense, and make that be the value of `page.url` and various other properties (basically wherever pjs internally stores a QUrl).

This turns out to be infeasible because PhantomJS's embedded copy of Qt doesn't include QScriptable.  I have taken a different approach which is now pull request  https://github.com/ariya/phantomjs/issues/12233 .

zw
Reply all
Reply to author
Forward
0 new messages