Re: reinterpret() problems

6 views
Skip to first unread message

David Huynh

unread,
Nov 29, 2011, 1:40:51 PM11/29/11
to google-r...@googlegroups.com
+google-refine-dev@ since this is getting more technical.
bcc'ing google-refine@ for people to know where this discussion is going.

On Tue, Nov 29, 2011 at 1:23 AM, Tom Morris <tfmo...@gmail.com> wrote:
In Tue, Nov 29, 2011 at 12:31 AM, David Huynh <dfh...@gmail.com> wrote:
> Hi Thad,
> My latest checkin should fix your clipboard scenario. For some reason,
> neither Chrome nor Safari sends any charset in the POST.

Based on Thad's previous feedback, his data was Windows 1252, so we
may need to use the platform's default encoding rather than UTF-8
across all platforms.

So we do set accept-charset in the POST form 


Do you think that tells the browser how to encode the POST body? I look in the POST headers in Chrome and Safari and can't see any charset information at all. Not even in individual part in the multipart form. All of my attempts to retrieve the charset encoding in Java return null.

> Hi Tom,
> The bigger issue here is that a single project can now be created from
> several files, each with potentially a different encoding. A project-wide
> encoding setting no longer seems to make sense.

Importing multiple files with different encodings into the same
project seems like a recipe for disaster.  Do we think it's going be a
likely case?

Unlikely, but possible. You can paste in several URLs, and each might return a different encoding. If you select an encoding in the UI, then that should override individual file's encoding.


> I think in Reinterpret.java, line 67, encoder should still be o2 and decoder
> should be args[2], so that the decoder would be an optional parameter. What
> do you think?

Changing reinterpret to be reinterpret(new-encoding,old-encoding)
instead of vice-versa is fine with me.  Optional arguments
traditionally follow mandatory arguments, but I didn't really think
there was a backward compatibility issue in this case.

OK, I'll swap that.

David

David Huynh

unread,
Nov 29, 2011, 1:52:58 PM11/29/11
to google-r...@googlegroups.com
Also found this


"lacking any other indications, a browser will submit the data from a form using the same character encoding that the page is served in".

David

Thad Guidry

unread,
Nov 29, 2011, 1:59:22 PM11/29/11
to google-r...@googlegroups.com
On Tue, Nov 29, 2011 at 3:23 AM, Tom Morris <tfmo...@gmail.com> wrote:
> In Tue, Nov 29, 2011 at 12:31 AM, David Huynh <dfh...@gmail.com> wrote:
>> Hi Thad,
>> My latest checkin should fix your clipboard scenario. For some reason,
>> neither Chrome nor Safari sends any charset in the POST.
>
> Based on Thad's previous feedback, his data was Windows 1252, so we
> may need to use the platform's default encoding rather than UTF-8
> across all platforms.
>

The PASTE into Clipboard is handled by the OS as an Array of (bytes),
specifically a MemoryStream in Windows, I think. It seems that both
the encoding choosen by the User for rendering a document in the
browser, OR the application will affect the rendering / display on a
given DOM document. The DOM document encoding is handled by a
property called documentCharacterSet, I think.

Regardless, I do not think the platform's encoding is an issue. I
should be able to operate my Windows system in Greek as my locale.
And Refine, operating as an application with UTF8 as the default
rendering for the DOM document, which it appears Refine has always
done correctly. So we're good there. And browsers now are set to
AutoDetect from parsing <html lang="en"><head><meta charset="utf-8">.

A sidenote: PASTING a single character that was previously copied by
highlighted from an HTML page actually does different things depending
on the application that your pasting into. Some applications like
PSPad (and it's built in Hex Editor) on Windows, allow you to paste as
Text, Unicode Text, HTML Format, OEM Text, Text Locale, and will carry
along the metadata with the Clipboard's MemoryStream, such as even the
source url of the document you copied from.

The FORM element that Refine uses for the Clipboard can utilize this:

https://developer.mozilla.org/en/DOM/form.acceptCharset

The only real abilities that current browsers have with the Form
element are noted here:

http://www.w3.org/TR/1999/REC-html401-19991224/interact/forms.html#adef-enctype

We probably always want to ignore the User's Browser Encoding
preference they set (Under the covers that is set with the DOM
documentCharacterSet)

The multipart/form-data that we are using for the Clipboard function
seems to pass along the raw byte stream correctly for the Korean char
챔 (in HEX: 54CC)

Partsmultipart/form-data
clipboard Trois-Riviì±”res Port Authority
Source
Content-Type: multipart/form-data;
boundary=---------------------------491299511942 Content-Length: 173
-----------------------------491299511942 Content-Disposition:
form-data; name="clipboard" Trois-Riviì±”res Port Authority
-----------------------------491299511942--

-Thad
http://www.freebase.com/view/en/thad_guidry

clipboard - Google Refine.png

Thad Guidry

unread,
Nov 29, 2011, 2:08:08 PM11/29/11
to google-r...@googlegroups.com
Here's the POST itself in my Chrome Win7 for that same text example:

  1. Request URL:
  2. Request Method:
    POST
  3. Status Code:
    200 OK
  4. Request Headersview source
    1. Accept:
      text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    2. Accept-Charset:
      windows-1251,utf-8;q=0.7,*;q=0.3
    3. Accept-Encoding:
      gzip,deflate,sdch
    4. Accept-Language:
      en-US,en;q=0.8
    5. Cache-Control:
      max-age=0
    6. Connection:
      keep-alive
    7. Content-Length:
      171
    8. Content-Type:
      multipart/form-data; boundary=----WebKitFormBoundaryx5853XdJAN62XZAw
    9. Cookie:
      splashShown=1; www.freebase.com_access="oauth_token=fafXSSTNXm68aDqQxFbVH17y5KaLMd6NbX5h8gsjBu28KQYiK5KoRzbyGdkWyyz28ptV9Iwv4YB455ZiACpM6ZSzjbkCIfbfSKBX&oauth_token_secret=b26c08cb40c2075fa760b2c1f53bea923c09346d"; authsub_token=1/w2nkhIkIlkWpwLyd6hEWBYP40ToRztCUeUB3oUhBKVk; host=.butterfly
    10. Host:
    11. Origin:
    12. Referer:
    13. User-Agent:
      Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2
  5. Query String Parametersview URL encoded
    1. controller:
      core/default-importing-controller
    2. jobID:
      1322594324937
    3. subCommand:
      load-raw-data
  6. Request Payload
    1. ------WebKitFormBoundaryx5853XdJAN62XZAw Content-Disposition: form-data; name="clipboard" Trois-Riviì±”res Port Authority ------WebKitFormBoundaryx5853XdJAN62XZAw--
  7. Response Headersview source
    1. Content-Length:
      0
    2. Server:
      Jetty(6.1.22)


--
-Thad
http://www.freebase.com/view/en/thad_guidry

Reply all
Reply to author
Forward
0 new messages