[RavenDB] [Client API] special characters encoding/decoding in queries

386 views
Skip to first unread message

styx31

unread,
May 20, 2010, 1:23:09 PM5/20/10
to ravendb
It seems that specials characters (such as "ç") are not correctly
encoded in whereclause commands

Example : .Where("Type:Feats AND Language:Français")

is encoded as "query=Type:Feats%20AND%20Language:Fran%C3%A7ais" and
returns nothing.

By crafting the url by hand, it works when I use the following :
"query=Type:Feats%20AND%20Language:Fran%E7ais"

Please note that the difference is that I used ANSI encoding of the
"ç" instead of UTF-8.

My data where previously created using .Store() method, so they must
be something related to Lucene here.

Ayende Rahien

unread,
May 20, 2010, 1:26:22 PM5/20/10
to rav...@googlegroups.com
Hm, I'll add a test case for that, I am doing something bad in the encoding, that I already knew..

Ayende Rahien

unread,
May 20, 2010, 4:15:32 PM5/20/10
to rav...@googlegroups.com
This has proven to be surprisingly difficult to solve.
The failing test can be found here: DocumentStoreServerTests.Can_query_using_special_characters

The problem is that I don't understand how the query string is being parsed. More to the point, everything that I tried (Uri.EscapeDataString, Uri.EscapeUriString, HttpUtility.UrlEncode) and with all the encoding that I tried, it doesn't work.
Any idea how to generate the working string from .NET?

Jan Benny Thomas

unread,
May 20, 2010, 4:31:54 PM5/20/10
to rav...@googlegroups.com

As a Norwegian I have to take a look a look at this, the Norwegian å-Å, æ-Æ and ø-Ø characters are often used in names.

 

Benny

Ayende Rahien

unread,
May 20, 2010, 4:36:22 PM5/20/10
to rav...@googlegroups.com
Run the test, you'll be able to see what is going on.
The received string is located in Index.Query(IndexQuery query), in the query.Query property.


On Thu, May 20, 2010 at 11:31 PM, Jan Benny Thomas <jan.t...@lyse.net> wrote:

As a Norwegian I have to take a look a look at this, the Norwegian å-Å, æ-Æ and ø-Ø characters are often used in names.

 

Benny

 

From: rav...@googlegroups.com [mailto:rav...@googlegroups.com] On Behalf Of Ayende Rahien
Sent: 20. mai 2010 22:16
To: rav...@googlegroups.com
Subject: Re: [RavenDB] [Client API] special characters encoding/decoding in queries

 

This has proven to be surprisingly difficult to solve.

Ayende Rahien

unread,
May 20, 2010, 4:36:37 PM5/20/10
to rav...@googlegroups.com
We might need to test this on IIS as well, this is running using the embedded web server.

Asbjørn Ulsberg

unread,
May 20, 2010, 6:20:16 PM5/20/10
to rav...@googlegroups.com, Ayende Rahien
The query string is usually parsed (and constructed) using UTF-8. The
encoding used can be set in ASP.NET with the <globalization/> element:

http://msdn.microsoft.com/en-us/library/hy4kkhe0.aspx

I'm pretty sure it's possible to set the encoding other means than
web.config, but I don't have Visual Studio available at the moment to test
with.

What I'm wondering either way, though, is why ANSI is working while UTF-8
isn't, because UTF-8 is obviously highly preferable over e.g. ISO-8859-1.
Either way, HttpUtility.UrlEncode() and .UrlDecode() has overloads that
take a System.Text.Encoding object;

http://msdn.microsoft.com/en-us/library/system.web.httputility.urlencode.aspx
http://msdn.microsoft.com/en-us/library/system.web.httputility.urldecode.aspx

Is perhaps the web server itself using ISO-8859-1 (or even worse:
Windows-1252) as a default encoding?


-Asbjørn


On Thu, 20 May 2010 22:15:32 +0200, Ayende Rahien <aye...@ayende.com>
wrote:

Jan Benny Thomas

unread,
May 20, 2010, 6:33:22 PM5/20/10
to rav...@googlegroups.com

It looks like the HttpListenerRequest.QueryString is the bad ****** here. Replaced it with System.Web.HttpUtility.ParseQueryString(Uri.UnescapeDataString(request.Url.Query)); in the HttpListenerRequestAdapter. And the query looks correct on the serverside. But the test still fails.

Jan Benny Thomas

unread,
May 20, 2010, 8:40:20 PM5/20/10
to rav...@googlegroups.com

It looks like I have a fix for this now. I’ll send a pull request soon.

 

It’s seems like the server screw things up in the QueryString, I made a temporary fix to this. I have to look into what happens behind the scene at the webserver.

 

We have a little problem with seperators, DocumentStoreServerTests.Can_get_correct_averages_from_map_reduce_index fails in my environment, because the server responds with age: 26,5, this value is interpreted as 265 from the NewTonSoft.Json library.

Asbjørn Ulsberg

unread,
May 21, 2010, 3:58:43 AM5/21/10
to rav...@googlegroups.com, Jan Benny Thomas
The separator issue sounds like an i18n problem. Not sure what culture
options you have in NewtonSoft.Json.


-Asbjørn


On Fri, 21 May 2010 02:40:20 +0200, Jan Benny Thomas <jan.t...@lyse.net>

Ayende Rahien

unread,
May 21, 2010, 4:11:31 AM5/21/10
to rav...@googlegroups.com
That one I know how to fix, no worries here.

2010/5/21 Asbjørn Ulsberg <asbj...@gmail.com>

Benny Thomas

unread,
May 21, 2010, 6:53:34 AM5/21/10
to ravendb
I have now tested it using Curl, it seems that using Curl the correct
data is send in but the wrong characters are stored in the database.

From the Raven ui point,when you save a new or changed document with
the correct øæå you can query it using Curl. So I have to take a look
at the put problem.


On 21 Mai, 02:40, "Jan Benny Thomas" <jan.tho...@lyse.net> wrote:
> It looks like I have a fix for this now. I’ll send a pull request soon.
>
> It’s seems like the server screw things up in the QueryString, I made a
> temporary fix to this. I have to look into what happens behind the scene at
> the webserver.
>
> We have a little problem with seperators,
> DocumentStoreServerTests.Can_get_correct_averages_from_map_reduce_index
> fails in my environment, because the server responds with age: 26,5, this
> value is interpreted as 265 from the NewTonSoft.Json library.
>
> From: rav...@googlegroups.com [mailto:rav...@googlegroups.com] On Behalf
> Of Ayende Rahien
> Sent: 20. mai 2010 22:37
> To: rav...@googlegroups.com
> Subject: Re: [RavenDB] [Client API] special characters encoding/decoding in
> queries
>
> We might need to test this on IIS as well, this is running using the
> embedded web server.
>
> On Thu, May 20, 2010 at 11:36 PM, Ayende Rahien <aye...@ayende.com> wrote:
>
> Run the test, you'll be able to see what is going on.
>
> The received string is located in Index.Query(IndexQuery query), in the
> query.Query property.
>
> On Thu, May 20, 2010 at 11:31 PM, Jan Benny Thomas <jan.tho...@lyse.net>
> wrote:
>
> As a Norwegian I have to take a look a look at this, the Norwegian å-Å, æ-Æ
> and ø-Ø characters are often used in names.
>
> Benny
>
> From: rav...@googlegroups.com [mailto:rav...@googlegroups.com] On Behalf
> Of Ayende Rahien
> Sent: 20. mai 2010 22:16
> To: rav...@googlegroups.com
> Subject: Re: [RavenDB] [Client API] special characters encoding/decoding in
> queries
>
> This has proven to be surprisingly difficult to solve.
>
> The failing test can be found here:
> DocumentStoreServerTests.Can_query_using_special_characters
>
> The problem is that I don't understand how the query string is being parsed.
> More to the point, everything that I tried (Uri.EscapeDataString,
> Uri.EscapeUriString, HttpUtility.UrlEncode) and with all the encoding that I
> tried, it doesn't work.
>
> Any idea how to generate the working string from .NET?
>
> On Thu, May 20, 2010 at 8:26 PM, Ayende Rahien <aye...@ayende.com> wrote:
>
> Hm, I'll add a test case for that, I am doing something bad in the encoding,
> that I already knew..
>

Jan Benny Thomas

unread,
May 21, 2010, 8:16:39 AM5/21/10
to rav...@googlegroups.com

It looks like it works when adding documents from the UI and querying the index.

 

I tried using curl to add information:

 

curl -X PUT http://localhost:8080/docs/bob -d "{ Name: 'Bob', HomeState: 'Småland', ObjectType: 'User' }"

curl -X PUT http://localhost:8080/docs/sarah -d "{ Name: 'Sarah', HomeState: 'Illinois', ObjectType: 'User' }"

curl -X PUT http://localhost:8080/docs/paul -d "{ Name: 'Paul', HomeState: 'Småland', ObjectType: 'User' }"

curl -X PUT http://localhost:8080/docs/mary -d "{ Name: 'Mary', HomeState: 'Småland', ObjectType: 'User' }"

 

Creating the index:

curl -X PUT http://localhost:8080/indexes/usersByHomeState -d "{ Map:'from doc in docs\r\nwhere doc.ObjectType==\"User\"\r\nselect new { doc.HomeState }' }"

 

Quering the index:

curl -X GET http://localhost:8080/indexes/usersByHomeState?query=HomeState:Småland

 

This gives ut the result:

 

{

  "Results": [],

  "IsStale": false,

  "TotalResults": 0

}

 

 

Looking at the document from the Raven UI

 

 

Edit Document Bob shows us that the Småland was saved with the wrong character:

 

 

 

I change it to å like it should be in the UI and it looks pretty, both in the list and in the edit document dialog:

 

 

mime.pl?file=image011-1.png

 

I don’t like that it shows ut the \u00e5, but I can live with that for now as it shows correct in the Edit Document dialog.

 

 

 

When I know run the query from Curl I get what I want:

curl -X GET http://localhost:8080/indexes/usersByHomeState?query=HomeState%3ASmåland

 

{

  "Results": [

    {

      "Name": "Bob",

      "HomeState": "SmÑland",

      "ObjectType": "User",

      "@metadata": {

        "Content-Type": "application/x-www-form-urlencoded",

        "Last-Modified": "Fri, 21 May 2010 09:46:03 GMT",

        "@id": "bob",

        "@etag": "37d6aef2-001e-9f31-11df-64bdaace59f8"

      }

    }

  ],

  "IsStale": false,

  "TotalResults": 1

}

 

Voila, almost perfect! So it looks like I have to take a round with the put responder to fix that.

 

 

From: rav...@googlegroups.com [mailto:rav...@googlegroups.com] On Behalf Of Ayende Rahien
Sent: 21. mai 2010 10:12
To: rav...@googlegroups.com
Subject: Re: [RavenDB] [Client API] special characters encoding/decoding in queries

 

That one I know how to fix, no worries here.

2010/5/21 Asbjørn Ulsberg <asbj...@gmail.com>

The separator issue sounds like an i18n problem. Not sure what culture options you have in NewtonSoft.Json.


-Asbjørn




On Fri, 21 May 2010 02:40:20 +0200, Jan Benny Thomas <
jan.t...@lyse.net> wrote:

It looks like I have a fix for this now. I’ll send a pull request soon.


It’s seems like the server screw things up in the QueryString, I made a
temporary fix to this. I have to look into what happens behind the scene at
the webserver.


We have a little problem with seperators,
DocumentStoreServerTests.Can_get_correct_averages_from_map_reduce_index
fails in my environment, because the server responds with age: 26,5, this
value is interpreted as 265 from the NewTonSoft.Json library.


From: rav...@googlegroups.com [mailto:rav...@googlegroups.com] On Behalf
Of Ayende Rahien

Sent: 20. mai 2010 22:37


To:
rav...@googlegroups.com
Subject: Re: [RavenDB] [Client API] special characters encoding/decoding in
queries


We might need to test this on IIS as well, this is running using the
embedded web server.

On Thu, May 20, 2010 at 11:36 PM, Ayende Rahien <
aye...@ayende.com

> wrote:



Run the test, you'll be able to see what is going on.

The received string is located in Index.Query(IndexQuery query), in the
query.Query property.



On Thu, May 20, 2010 at 11:31 PM, Jan Benny Thomas <jan.t...@lyse.net>


wrote:

As a Norwegian I have to take a look a look at this, the Norwegian å-Å, æ-Æ
and ø-Ø characters are often used in names.


Benny


From:


Of Ayende Rahien
Sent: 20. mai 2010 22:16
To:


Subject: Re: [RavenDB] [Client API] special characters encoding/decoding in
queries


This has proven to be surprisingly difficult to solve.

The failing test can be found here:
DocumentStoreServerTests.Can_query_using_special_characters


The problem is that I don't understand how the query string is being parsed.
More to the point, everything that I tried (Uri.EscapeDataString,
Uri.EscapeUriString, HttpUtility.UrlEncode) and with all the encoding that I
tried, it doesn't work.

Any idea how to generate the working string from .NET?


On Thu, May 20, 2010 at 8:26 PM, Ayende Rahien <
aye...@ayende.com

> wrote:



Hm, I'll add a test case for that, I am doing something bad in the encoding,
that I already knew..


On Thu, May 20, 2010 at 8:23 PM, styx31 <

image010.jpg
image013.jpg
image009.jpg
image002.jpg
image011.png

Ayende Rahien

unread,
May 21, 2010, 11:53:24 AM5/21/10
to rav...@googlegroups.com
With your recent patch, is this still an issue?

Ayende Rahien

unread,
May 21, 2010, 11:57:30 AM5/21/10
to rav...@googlegroups.com
This is now fixed

2010/5/21 Asbjørn Ulsberg <asbj...@gmail.com>

Jan Benny Thomas

unread,
May 21, 2010, 2:53:38 PM5/21/10
to rav...@googlegroups.com

It’s still a issue when using CURL.

Ayende Rahien

unread,
May 21, 2010, 4:12:14 PM5/21/10
to rav...@googlegroups.com
Okay, I'll take a look at those.

On Fri, May 21, 2010 at 9:53 PM, Jan Benny Thomas <jan.t...@lyse.net> wrote:

It’s still a issue when using CURL.

 

Jan Benny Thomas

unread,
May 21, 2010, 6:51:29 PM5/21/10
to rav...@googlegroups.com

It looks like it is a Curl issue. Ravens implementation defaults the encoding to UTF-8, but Curl sends the data as UTF-7.

 

If we demand the content to be encoded as UTF-8 we should be safe.

Ayende Rahien

unread,
May 21, 2010, 7:01:48 PM5/21/10
to rav...@googlegroups.com
Does Curl send the appropriate encoding in its headers?

On Sat, May 22, 2010 at 1:51 AM, Jan Benny Thomas <jan.t...@lyse.net> wrote:

It looks like it is a Curl issue. Ravens implementation defaults the encoding to UTF-8, but Curl sends the data as UTF-7.

 

If we demand the content to be encoded as UTF-8 we should be safe.

 

Jan Benny Thomas

unread,
May 22, 2010, 3:55:55 AM5/22/10
to rav...@googlegroups.com

I could only see application/x-www-form-urlencoded when we us CURL, so we have to do UrlDecode on the inputstream. Using UTF-7 seems to work as well.

Ayende Rahien

unread,
May 22, 2010, 4:50:45 AM5/22/10
to rav...@googlegroups.com
That means that Curl doesn't tell us that it is using utf7, this is strange.
If it was, we could detect and handle it somehow. If it doesn't, I don't see what we have to do.

On Sat, May 22, 2010 at 10:55 AM, Jan Benny Thomas <jan.t...@lyse.net> wrote:

I could only see application/x-www-form-urlencoded when we us CURL, so we have to do UrlDecode on the inputstream. Using UTF-7 seems to work as well.

 

From: rav...@googlegroups.com [mailto:rav...@googlegroups.com] On Behalf Of Ayende Rahien
Sent: 22. mai 2010 01:02

Jan Benny Thomas

unread,
May 22, 2010, 5:30:16 AM5/22/10
to rav...@googlegroups.com

It may look like the HttpListener doesn’t automatical doing the urldecoding on behalf of us like it should.

Jan Benny Thomas

unread,
May 22, 2010, 5:37:28 AM5/22/10
to rav...@googlegroups.com

It seems that the Httplistener uses the Raw stream instead of the the unescaped data.

 

http://msdn.microsoft.com/en-us/library/system.net.configuration.httplistenerelement.unescaperequesturl.aspx

The UnescapeRequestUrl property indicates if HttpListener uses the raw unescaped URI instead of the converted URI where any percent-encoded values are converted and other normalization steps are taken. Default value true. It looks like setting this to false changes things.

Ayende Rahien

unread,
May 22, 2010, 5:47:27 AM5/22/10
to rav...@googlegroups.com
Okay, I figured it out.
Curl is right, it sends out requests in the default format: ISO-8859-1
We are wrong, because we aren't trying to figure out what the charset of the request is

Will be fixed shortly.

On Sat, May 22, 2010 at 12:30 PM, Jan Benny Thomas <jan.t...@lyse.net> wrote:

It may look like the HttpListener doesn’t automatical doing the urldecoding on behalf of us like it should.

 

Ayende Rahien

unread,
May 22, 2010, 5:49:53 AM5/22/10
to rav...@googlegroups.com
Not sure how this is related, since we are talking about the forms data here.

On Sat, May 22, 2010 at 12:37 PM, Jan Benny Thomas <jan.t...@lyse.net> wrote:

It seems that the Httplistener uses the Raw stream instead of the the unescaped data.

 

http://msdn.microsoft.com/en-us/library/system.net.configuration.httplistenerelement.unescaperequesturl.aspx

The UnescapeRequestUrl property indicates if HttpListener uses the raw unescaped URI instead of the converted URI where any percent-encoded values are converted and other normalization steps are taken. Default value true. It looks like setting this to false changes things.

Jan Benny Thomas

unread,
May 22, 2010, 6:10:44 AM5/22/10
to rav...@googlegroups.com

It worked and got correct for the Curl Put operation, but it set everything else I did to failure…

Ayende Rahien

unread,
May 22, 2010, 12:51:00 PM5/22/10
to rav...@googlegroups.com
Okay, I just pushed the fix for that, I would be grateful if you can confirm this.,

On Sat, May 22, 2010 at 1:10 PM, Jan Benny Thomas <jan.t...@lyse.net> wrote:

It worked and got correct for the Curl Put operation, but it set everything else I did to failure…

 

Jan Benny Thomas

unread,
May 22, 2010, 4:35:12 PM5/22/10
to rav...@googlegroups.com

All good now…almost. All looks in the line of my recent attempts.

 

The

curl -X GET http://localhost:8080/indexes/Raven/DocumentsByEntityName?query=&start=0&pageSize=10&cutOff=2010-05-22T11:47:21.9098944+02:00 , what happens here is that the + sign gets stripped in the QueryString handling.

Ayende Rahien

unread,
May 22, 2010, 5:05:56 PM5/22/10
to rav...@googlegroups.com
Huh?
We have a freaking test for that, it passes.
See DocumentStoreServerTests.Can_specify_cutoff_using_server.

On Sat, May 22, 2010 at 11:35 PM, Jan Benny Thomas <jan.t...@lyse.net> wrote:

All good now…almost. All looks in the line of my recent attempts.

 

The

curl -X GET http://localhost:8080/indexes/Raven/DocumentsByEntityName?query=&start=0&pageSize=10&cutOff=2010-05-22T11:47:21.9098944+02:00 , what happens here is that the + sign gets stripped in the QueryString handling.

 

 

 

Jan Benny Thomas

unread,
May 22, 2010, 5:25:12 PM5/22/10
to rav...@googlegroups.com

Yes, been there, done that! The test works, Curl attempt gives us an error. If we can live we that, there is no problem.

Ayende Rahien

unread,
May 23, 2010, 6:45:17 AM5/23/10
to rav...@googlegroups.com
Okay, looked at that, this is actually expected, you are sending +, which is actually a space. 
If you send %2B, which is + encoded, it works.


On Sun, May 23, 2010 at 12:25 AM, Jan Benny Thomas <jan.t...@lyse.net> wrote:

Yes, been there, done that! The test works, Curl attempt gives us an error. If we can live we that, there is no problem.

 

 

Sebastien Lambla

unread,
May 23, 2010, 7:12:41 AM5/23/10
to rav...@googlegroups.com

Funnily enough, the + turning into a space is a long-running browser compat feature, as space encoded as + hasn’t been in any spec for a long time. It’s not even an accepted charater so its presence itself is a bug. :)

Jan Benny Thomas

unread,
May 23, 2010, 10:26:50 AM5/23/10
to rav...@googlegroups.com

I have read about that, I thougth that Curl would urlencode it, but it didn’t.


So it is not a problem.

Reply all
Reply to author
Forward
0 new messages