Charset Encoding problem with Play 2.1

2,841 views
Skip to first unread message

daniel...@elo7.com

unread,
Apr 15, 2013, 7:33:31 PM4/15/13
to play-fr...@googlegroups.com
I have a web service that receives a parameter in ISO-8859-1 encoding.

But when I try to read it from the request, I get this characters:

�����

I've tryied all these approaches, but none of the convert the given string to the expected one (áéíóú):
//    val a = new String(_html.getBytes());
//    val b = new String(_html.getBytes(), "UTF-8")
//    val c = new String(_html.getBytes(), "ISO-8859-1")
//    val d = new String(_html.getBytes("ISO-8859-1"), "UTF-8")
//    val e = new String(_html.getBytes("ISO-8859-1"), "ISO-8859-1")
//    val f = new String(_html.getBytes("UTF-8"), "UTF-8")
//    val g = new String(_html.getBytes("UTF-8"), "ISO-8859-1")

Here is my action:

  val inboundMessageForm = Form(
    mapping(
      "html" -> text)(InboundMessage.apply)(InboundMessage.unapply))

  def forward = Action(parse.multipartFormData) { implicit request =>
    val inboundMessage = inboundMessageForm.bindFromRequest.get

        // inboundMessage.html =>  �����
   }

What can I do to solve this problem?

Thanks

Daniel

Adam Hooper

unread,
Apr 17, 2013, 6:54:35 PM4/17/13
to play-fr...@googlegroups.com
On Monday, April 15, 2013 7:33:31 PM UTC-4, daniel...@elo7.com wrote:
I have a web service that receives a parameter in ISO-8859-1 encoding.

But when I try to read it from the request, I get this characters:

�����

What are the bytes of the request? In particular, make sure it includes the following:

1) "Content-Type: application/x-www-form-urlencoded; charset=ISO-8859-1"
2) nothing but ISO-8859-1
 
Here is my action:

  val inboundMessageForm = Form(
    mapping(
      "html" -> text)(InboundMessage.apply)(InboundMessage.unapply))

You can't solve the problem here, because it's impossible to convert perfectly from Array[Byte] to String back to Array[Byte]. Converting to String replaces invalid characters with alternate ones; in other words, it's lossy.

In this case, Play parsed the request in the wrong character set--most likely UTF-8. It is impossible for you to retrieve the original bytes from the Form.

What you need to do is tell Play to use the different encoding when it parses your request. You do that by sending the Content-Type header with the request.

In other words: it's a bug in your client, not in Play. Don't do any conversion in Play.

Enjoy life,
Adam

daniel...@elo7.com

unread,
Apr 18, 2013, 3:48:37 PM4/18/13
to play-fr...@googlegroups.com
Actually I don't have much control over my client's code. :-(

I'm using SendGrid InboundParse client.


It calls my web service and the only thing he send to me is an HTTP parameter telling which encode it uses for each other parameters.
For example:
it sends me "subject" and "text" parameter
then, in the "charsets" parameter it tells me the encoding for the other ones. 
[charsets] => {"subject":"UTF-8","text":"iso-8859-1"}

Is there any elegant way to "hack" play, making it to parse the parameters according to the charsets entries?

I know this is not HTTP standard, but it is a requirement for my application.

Thanks!!!

Adam Hooper

unread,
Apr 18, 2013, 6:15:37 PM4/18/13
to play-fr...@googlegroups.com
On Thursday, April 18, 2013 3:48:37 PM UTC-4, daniel...@elo7.com wrote:
It calls my web service and the only thing he send to me is an HTTP parameter telling which encode it uses for each other parameters.
For example:
it sends me "subject" and "text" parameter
then, in the "charsets" parameter it tells me the encoding for the other ones. 
[charsets] => {"subject":"UTF-8","text":"iso-8859-1"}

Is there any elegant way to "hack" play, making it to parse the parameters according to the charsets entries?

I don't understand how one parameter can describe the character set of another parameter--the "charset" parameters needs to be decoded first, but how can you read it when it doesn't have a character set?

Are these requests double-encoded? multipart-encoded?

Could you post an example request, e.g., as text and as output of "xxd"?
 
I know this is not HTTP standard, but it is a requirement for my application.

There are a few approaches I'd ponder, and they depend on what Sendgrid is doing. You'll have to understand that before you proceed.

Single encoding

If you know each message will come with every parameter in the same encoding, but that's not the encoding Play is using, you can override Play's way of determining the character set. Create an app/Global.scala that looks something like this:

import play.api.GlobalSettings

import play.api.mvc.{Handler,RequestHeader}

object Global extends GlobalSettings {
  override def onRouteRequest(request: RequestHeader): Option[Handler] = {
    // Override character set
    val requestWithCharset = new RequestHeader {
        // Like RequestHeader.copy(), but overriding charset
        val id = request.id
        val tags = request.tags
        val uri = request.uri
        val path = request.path
        val version = request.version
        val queryString = request.queryString
        val headers = request.headers
        val remoteAddress = request.remoteAddress
        override lazy val charset : Option[String] = Some("iso-8859-1")
    }

    super.onRouteRequest(requestWithCharset)
  }
}

This method will be called for every route. You can set up some conditions to make sure it's only called with the routes that matter to you. Adjust the "lazy val charset" to the character set you want.

Multiple encodings

As I said, I don't quite understand how the message is encoded.

It's possible the Sendgrid developers don't, either :). If that's the case, you'll have to follow *their* logic, whatever that is. The brute-force way is to parse the request as a big Array[Byte] instead of as a String. In your controller:

  val MaxMemory = 1 * 1024 * 1024 // 1MB
  def post = Action(parse.raw(MaxMemory)) { request =>
    val bytes : Array[Byte] = request.body.asBytes.getOrElse(throw new Exception("Need to handle case when memory exceeded? Use request.body.asFile"))
    // parse the bytes yourself
  }

You won't get Java's or Scala's String libraries, or any other goodies to help you.

You'll want to first split the bytes into a Map[Array[Byte],Array[Byte]] by splitting on "&" and "=" (if it's URL-encoded), then you'll want to call new String() on each key and value using the appropriate encodings. Then you can pass the resulting Map[String,String] to Play's Forms.

Dummy encoding?

This is a hack, but it might help....

You could define a "dummy" encoding--that is, make a Java String whose only purpose is to carry bytes. You could convert an Array[Byte] to and from this String losslessly. Think about it:

1. Play would "decode" from the dummy charset
2. Play would do all its processing with these Strings (it would understand the ASCII characters; the rest would be garbage)
3. You would "encode" back into the dummy charset, getting the original Array[Byte]
4. You would use the charset from step 2 to properly decode the bytes

To do this, you'd need to define a CharsetDecoder (http://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetDecoder.html) with a decodeLoop() method that's essentially a cast operation. Ditto for CharsetEncoder. Then define a Charset (http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html) and finally you'd need to equip your app with a CharsetProvider (http://docs.oracle.com/javase/7/docs/api/java/nio/charset/spi/CharsetProvider.html). I've never done this before, so I'm out of my league here.

This hack could make sense, under certain conditions. But those would be wonky conditions indeed.

Whatever approach you choose, be sure you understand what's happening. (Other web frameworks might make this easier, but they might not! You need to first understand what the heck Sendgrid is doing.) Always understand when you're dealing with bytes and when you're dealing with Strings, and always understand the encoding of every array of bytes you deal with. (Strings don't have an encoding--at least not one that could possibly be useful to you here.)

Enjoy life,
Adam

daniel...@elo7.com

unread,
Apr 19, 2013, 8:03:04 AM4/19/13
to play-fr...@googlegroups.com

I can't do asBytes, I get this error:

request.body.asBytes (value asBytes is not a member of play.api.mvc.MultipartFormData)


| I don't understand how one parameter can describe the character set of another parameter--the "charset" parameters needs to be decoded first, but how can you read it when it doesn't have a character set?

It looks like you need to decode the "charset" parameter using UTF-8 and then decode the other parameters based on the content of the "charset" json.


SendGrid says:
"Messages, and their headers, can have character set data associated with them. In order to simplify the parsing of messages for the end user, SendGrid will decode the to, from, cc, and subject headers if needed. All headers will be converted to UTF-8 for uniformity, since technically a header can be in many different character sets.
The charsets variable will contain a JSON encoded hash of the header / field name and its respective character set. For instance, it may look like:
[charsets] => {"to":"UTF-8","cc":"UTF-8","subject":"UTF-8","from":"UTF-8","text":"iso-8859-1"}
This shows that all headers should be treated as UTF-8, and the text body is latin1."

Sendgrid says that everything comes in UTF-8, but I think it doesn't, because when I debug the parameter content, I can't read the "áéíóú". But the charsets JSON he sends to me is telling me that the encoding is latin1. 

I think (not sure) the problem is in ContentTypes.scala,  line 712 - BodyParsers class (Play 2.1.1)

      def handleDataPart: PartHandler[Part] = {
        case headers @ PartInfoMatcher(partName) if !FileInfoMatcher.unapply(headers).isDefined =>
          Traversable.takeUpTo[Array[Byte]](DEFAULT_MAX_TEXT_LENGTH)
            .transform(Iteratee.consume[Array[Byte]]().map(bytes => DataPart(partName, new String(bytes, "utf-8"))))
            .flatMap { data =>
              Cont({
                case Input.El(_) => Done(MaxDataPartSizeExceeded(partName), Input.Empty)
                case in => Done(data, in)
              })
            }
      }

It looks like always use UTF-8! This is correct? 




Eytan Biala

unread,
Jun 24, 2014, 11:26:04 AM6/24/14
to play-fr...@googlegroups.com, daniel...@elo7.com
Did you end up figuring this out?
Reply all
Reply to author
Forward
0 new messages