Rexster 2.4.0 & 2.5.0: Unable to look up node with special (unicode) characters in property

64 views
Skip to first unread message

Georg Walther

unread,
Apr 15, 2014, 5:05:17 AM4/15/14
to gremli...@googlegroups.com
Hi,

I have tried this with Rexster 2.4.0 and 2.5.0 and am using bulbs 0.3.x from a couple of days ago.

After restarting the Rexster server, I generate one node in emptygraph and attempt to look up the same node right after I create it.
See the screenshot I took inside Rexster's dog house to demonstrate that this node is actually created and stored in rexster.

As you can see, when trying to look up this node by its name property Rexster just returns None (or an empty list when curl'ing the GET url that the debugger spits out).
When storing nodes without special characters (e.g. name=u'University of Cambridge') Rexster returns the expected node.

Could anyone please kindly point me to a fix for this or let me know what I am doing wrong?

Thank you!


# -*- coding: utf-8 -*-


from bulbs.rexster import Graph
from bulbs.model import Node
from bulbs.property import String
from bulbs.config import DEBUG
import bulbs

class University(Node):
    element_type = 'university'
    name = String(nullable=False, indexed=True)


g = Graph()
g.add_proxy('university', University)
g.config.set_logger(DEBUG)

name = u'Université de Montréal'

g.university.create(name=name)

print g.university.index.lookup(name=name)

print bulbs.__version__
Output in console:

POST url: http://localhost:8182/graphs/emptygraph/tp/gremlin
POST body: {"params": {"keys": null, "index_name": "university", "data": {"element_type": "university", "name": "Universit\u00e9 de Montr\u00e9al"}}, "script": "def createIndexedVertex = {\n vertex = g.addVertex()\n index = g.idx(index_name)\n for (entry in data.entrySet()) {\n if (entry.value == null) continue;\n vertex.setProperty(entry.key,entry.value)\n if (keys == null || keys.contains(entry.key))\n\tindex.put(entry.key,String.valueOf(entry.value),vertex)\n }\n return vertex\n }\n def transaction = { final Closure closure ->\n try {\n results = closure();\n g.commit();\n return results; \n } catch (e) {\n g.rollback();\n throw e;\n }\n }\n return transaction(createIndexedVertex);"} GET url: http://localhost:8182/graphs/emptygraph/indices/university?value=Universit%C3%A9+de+Montr%C3%A9al&key=name
GET body: None None 0.3


James Thornton

unread,
Apr 15, 2014, 9:03:54 AM4/15/14
to gremli...@googlegroups.com
Hi Georg -

Stephen and I have been looking into this issue this morning. See this gist for where we are...


- James


--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Georg Walther

unread,
Apr 15, 2014, 1:27:47 PM4/15/14
to gremli...@googlegroups.com
Fantastic guys, thank you so much for looking into this!

Looking at your code snippets I gleaned the following bit of code that works for me for the time being:

script = 'g.idx(index_name).get(key, value)'
params = dict(index_name="university", key="name", value=name)
resp = g.gremlin.execute(script, params)
for rexster_result in resp:
    vertex = g.vertices.get(rexster_result.get_id())

However that seems a bit convoluted and it seems to me as if I am doing two lookups for each vertex.

James Thornton

unread,
Apr 16, 2014, 4:02:36 PM4/16/14
to gremli...@googlegroups.com
Hi Georg -

Ok, I finally got to the bottom of this. 

Since TinkerGraph uses a HashMap for its index, you can see what's being stored in the index by using Gremlin to return the contents of the map. 

Here's what's being stored in the TinkerGraph index using your Bulbs `g.university.create(name=name)` method above...

    $ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="g.idx(\"university\").index"
    
    {"results":[{"name":{"Université de Montréal":[{"name":"Université de Montréal","element_type":"university","_id":"0","_type":"vertex"}]},"element_type":{"university":[{"name":"Université de Montréal","element_type":"university","_id":"0","_type":"vertex"}]}}],"success":true,"version":"2.5.0-SNAPSHOT","queryTime":3.732632}

All that looks good -- the encodings look right.

To create and index a vertex like the one above, Bulbs uses a custom Gremlin script via an HTTP POST request with a JSON content type.  

Here's the problem...

Rexster's index lookup REST endpoint uses URL query params, and Bulbs encodes URL params as UTF-8 byte strings. 

To see how Rexster handles URL query params encoded as UTF-8 byte strings, I executed a Gremlin script via a URL query param that simply returns the encoded string...

    $ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="'Universit%C3%A9%20de%20Montr%C3%A9al'"
    
    {"results":["Université de Montréal"],"success":true,"version":"2.5.0-SNAPSHOT","queryTime":16.59432}

Egad! That's not right. As you can see, that text is mangled. 

In a twist of irony, we have Gremlin returning gremlins, and that's what Rexster is using for the key's value in the index lookup, which as we can see is not what's stored in TinkerGraph's HashMap index.

Here's what's going on...

This is what the unquoted byte string looks like in Bulbs: 

    >>> name
    u'Universit\xe9 de Montr\xe9al'

    >>> bulbs.utils.to_bytes(name)
    'Universit\xc3\xa9 de Montr\xc3\xa9al'

`'\xc3\xa9'` is the UTF-8 encoding of the unicode character `u'\xe9'` (which can also be specified as `u'\u00e9'`).

UTF-8 uses 2 bytes to encode a character, and Jersey/Grizzly 1.x (Rexster's app server) has a bug where it doesn't properly handle 2-byte character encodings like UTF-8.


It looks like this is fixed in Jersey/Grizzly 2.0, but switching Rexster from Jersey/Grizzly 1.x to Jersey/Grizzly 2.x is a big ordeal. 

Last year TinkerPop decided to switch to Netty instead, and so for the TinkerPop 3 release this summer, Rexster is in the process of morphing into Gremlin Server,  which is based on Netty rather than Grizzly.  

Until then, here are few workarounds...

Since Grizzly can't handle 2-byte encodings like UTF-8, client libraries need to encode URL params as 1-byte latin1 encodings (AKA ISO-8859-1), which is Grizzly's default encoding.

Here's the same value encoded as a latin1 byte string...

     $ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="'Universit%E9%20de%20Montr%E9al'"
    
    {"results":["Université de Montréal"],"success":true,"version":"2.5.0-SNAPSHOT","queryTime":17.765313}

As you can see, using a latin1 encoding works in this case.

However, for general purposes, it's probably best for client libraries to use a custom Gremlin script via an HTTP POST request with a JSON content type and thus avoid the URL param encoding issue all together -- this is what Bulbs is going to do, and I'll push the Bulbs update to GitHub later today.

- James

James Thornton

unread,
Apr 17, 2014, 7:12:32 AM4/17/14
to gremli...@googlegroups.com
Hi All -

UPDATE: It turns out that even though we can change Grizzly's default encoding type, we can specify UTF-8 as the charset in the HTTP request Content-Type header and Grizzly will use it. Bulbs 0.3.29 has been updated to include the UTF-8 charset in its request header, and all tests pass. The update has been pushed to both GitHub and PyPi.

With charset=UTF-8 in request header...

$ curl -H "Content-Type: text/xml; charset=utf-8" http://localhost:8182/graphs/emptygraph/tp/gremlin?script="'Universit%C3%A9%20de%20Montr%C3%A9al'"
{"results":["Université de Montréal"],"success":true,"version":"2.5.0-SNAPSHOT","queryTime":17.627604}

Without charset=UTF-8 in request header...

curl -H "Content-Type: text/xml;" http://localhost:8182/graphs/emptygraph/tp/gremlin?script="'Universit%C3%A9%20de%20Montr%C3%A9al'"
{"results":["Université de Montréal"],"success":true,"version":"2.5.0-SNAPSHOT","queryTime":13.48543}


- James



On Wednesday, April 16, 2014 3:02:36 PM UTC-5, James Thornton wrote:
Hi Georg -

Stephen Mallette

unread,
Apr 17, 2014, 7:18:17 AM4/17/14
to gremli...@googlegroups.com
Wow, James...thanks for going so deep with that.  Character encoding is such an annoying problem to deal with.  Good to see that there were a number of workarounds and finally a solution to the issue.


--

Georg Walther

unread,
Apr 21, 2014, 12:53:46 AM4/21/14
to gremli...@googlegroups.com
James, thank you so much for diving into this and your thorough explanation!

If I understand correctly, Grizzly does handle UTF-8 correctly but it appears to assume some other encoding by default unless one passes the used encoding in the Content-Type header of the request?

At any rate, I can confirm that the test code from my initial post works without a flaw now after doing a `pip install bulbs --upgrade`.

Thank you again and Happy Easter!


Georg
Reply all
Reply to author
Forward
0 new messages