Character encoding problem: apostrophe etc.

1,181 views
Skip to first unread message

Dwight Lewis

unread,
Jun 23, 2017, 11:50:55 PM6/23/17
to Selenium Users
I put together an Java app to download articles from the WSJ.  Everything is working fine ... except things like "Trump's" become "Trump?s".   ... the details:

Firefox browser 54.0 with text encoding set to unicode.  When I right click and look at the page source I can find "Trump's".

Windows 7 x64, Selenium is 3.4.0, geckodriver is v0.17.0-win64

setup for selenium is:

     System.setProperty("webdriver.gecko.driver", "c:/selenium/geckodriver.exe");
     
DesiredCapabilities capabilities = DesiredCapabilities.firefox();
     capabilities
.setCapability("marionette", true);    
     
WebDriver driver = new FirefoxDriver(capabilities);
     driver
.get("articleUrl");
     
String pageSource = driver.getPageSource();
     
// at this point if I do pageSource.indexOf("'") it returns -1  ... it can't find the apostrophe
     
Document doc = Jsoup.parse(pageSource);



When I print out the elements in doc in the Netbeans Output area I find "Trump?s" not "Trump's"

I've searched for this on the web and can't find an answer to this problem.  My seaching suggests that:

Java String is always UTF-16. 

pageSource contains:    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">  ... from the WSJ

so the WSJ is sending me UTF-8 and driver.getPageSource() is converting it into a Java String which is in UTF-16

It appears that this conversion is changing the apostrophe to a question mark.

Which leads me to believe that there is some sort of encoding problem .... but I can't find anything on how to fix the problem.

What am I missing ?????

Thanks in advance



  




Anish Pillai

unread,
Jun 25, 2017, 9:32:13 PM6/25/17
to Selenium Users
Hi,

I have also faced this issue and didn't find a solution. It mostly looks like an issue when HTML gets converted to plain text.

It's not just a problem with apostrophe. There are multiple characters like –, …, “ ”, • (in a list) and emoticons as well.


Cheers, Anish
AutomationTestingHub

Dwight Lewis

unread,
Jun 26, 2017, 4:08:40 PM6/26/17
to Selenium Users

Dwight Lewis

unread,
Jun 26, 2017, 4:18:35 PM6/26/17
to Selenium Users
After some more experimentation I now have a solution.

see: http://utf8-chartable.de/unicode-utf8-table.pl?start=8064&names=-&utf8=0x

This will clean up the text:

{
    public static String toHexStr(byte b) {
        String str = "0x" + String.format("%02x", b);
        return str;
    }

     String pageSource = "";
        pageSource = driver.getPageSource();
        byte[] allBytes = null;
        try {
            allBytes = pageSource.getBytes("UTF-8");
        } catch (UnsupportedEncodingException ex) {
            ex.printStackTrace():
        }

 
        String translate = "";
        for (int i = 0; i < allBytes.length; i++) {
            //  newBytes[i + offset] = allBytes[i];
       
            if (allBytes[i] == ((byte) 0xE2)) {
                // if you find another sequence, include this print stmt to explore the contents
                //System.out.println("0xE2 " + toHexStr(allBytes[i + 1]) + " " + toHexStr(allBytes[i + 2]));
                switch ((byte) allBytes[i + 2]) {
                    case (byte) 0x98: // opening single quote
                    case (byte) 0x99: // closing single quote
                        translate += "'";
                        i += 2;
                        break;
                    case (byte) 0x94: // dash
                        translate += "-";
                        i+= 2;
                        break;
                    case (byte) 0x9c: // opening double quote
                    case (byte) 0x9d: // closing double quote
                        translate += "\"";
                        i+= 2;
                        break;
                }
            } else {
                translate += (char)allBytes[i];
            }
        }

        pageSource = translate;
}

I'm probably missing some other chars that you have experienced, but this solves my immediate problem.  There may be a more elegant way to solve the problem, but I'lll leave that to someone else to discover.  If you find other sequences, please post them.  I've been trying to solve this for several days.

Anish Pillai

unread,
Jun 26, 2017, 9:51:15 PM6/26/17
to Selenium Users
Thank you. This looks like a good approach to start with.

However, in my case, the content from HTML pages is already written in text files (which I have to test). So I'm not sure if this would work in my scenario. The text file shows ? for all these special chars. However, this would be a useful solution if we are directly fetching the data from web.


Thanks,
Anish


Krishnan Mahadevan

unread,
Jun 26, 2017, 10:57:04 PM6/26/17
to seleniu...@googlegroups.com

Dwight,

 

You might want to go one step further with this and perhaps have this reside as part of the org.openqa.selenium.remote.HttpCommandExecutor#execute implementation wherein you specifically look for the command org.openqa.selenium.remote.DriverCommand#GET_PAGE_SOURCE and if that’s the command, you directly apply your logic within your custom execute() method.

That way your code would not be making any extra calls.

 

For details on how to inject in a custom CommandExecutor implementation into your webdriver instance you can take a look at my blog here.

 

 

Thanks & Regards

Krishnan Mahadevan

 

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"

My Scribblings @ http://wakened-cognition.blogspot.com/

My Technical Scribbings @ http://rationaleemotions.wordpress.com/

--
You received this message because you are subscribed to the Google Groups "Selenium Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to selenium-user...@googlegroups.com.
To post to this group, send email to seleniu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/selenium-users/62dfd8cf-a843-4e68-ae3b-064391ccfcf4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages