SQL Server -- nvarchar and varchar

Alex Vidal

unread,

Jul 20, 2012, 12:34:11 PM7/20/12

to pyo...@googlegroups.com

I'm using pyodbc against a SQL Server database; some of the columns are nvarchar, others are varchar. I'm having a problem where if I feed all unicode to pyodbc, particularly where I query against a column of type varchar, that the results take a significantly longer time to return than they would if I feed it a string. However, if I feed strings to pyodbc (encoded as utf8), then I lose the characters that normally can be represented in my nvarchar fields, if that makes sense. The documentation for types on the pyodbc site says that when given strings it assumes ASCII and uses SQL_CHAR which I'm assuming is why the data in the utf8 string that cannot be represented in ASCII is lost.

To be more specific, I'm working in Django, using a fork of django-pyodbc that I've been working on: https://github.com/avidal/django-pyodbc and I'm trying to figure out a way to sanely determine how to encode the data that the library sends to pyodbc. Because of the nature of the ORM there's currently no way to indicate the encoding of a specific field without creating new fields types (which I'll do if I must), so my current goal is to come up with the right combination of encoding to Do The Right Thing.

I'm using FreeTDS 0.91 with client charset set to UTF-8 in freetds.conf. I'm guessing that at some point, parameters that are used to query against a column of type varchar/char/text need to be encoded to the native type (latin1), while using unicode for the rest of the parameters (to support multiple languages and such). What's the best approach? Should I use `cursor.columns` to determine the column types (and appropriate encoding) and then use that information to do the actual encoding for the non-unicode fields?

The table I'm querying against currently contains 24 million rows; and the column that I query on is varchar(50). Using a string for the query returns results almost immediately, using unicode (which I'm assuming generates a query such as: SELECT * FROM tbl WHERE id = N'<param>') takes upwards of 90 seconds.

Alex Vidal

unread,

Jul 20, 2012, 1:52:41 PM7/20/12

to pyo...@googlegroups.com

In case anyone is curious, what I decided to do was not do any special encoding when calling out to pyodbc. The library (django-pyodbc) currently will encode any unicode as utf8, so I had to disable that. What I did to alleviate the other problem was to specify that my custom field in my model should encode itself to latin1 before going to the database. The end result is that django-pyodbc does no specific encoding, and pyodbc receives unicode in most cases, except where it's troublesome.

I still wonder though, if perhaps it's better to use cursor.columns to determine the column datatypes and encode accordingly for varchar/char/text columns.

Michael Kleehammer

unread,

Aug 16, 2012, 10:44:23 PM8/16/12

to pyo...@googlegroups.com

First, if you can, try the Microsoft's SQL Server driver for Linux. All of my FreeTDS issues were fixed by it.

pyodbc expects everything to be "unencoded", so you shouldn't need any encoding. I use Unicode with SQL Server every day and don't know of any specific issues. Can you give me an example that doesn't work? You also might try checking out the sqlservertests.py module in the tests2 directory.

On Windows, neither string type needs conversion. Windows uses UCS2 internally for Unicode and ODBC uses the same. The only thing necessary on Windows is to call the right function, the ANSI version or the Unicode (W) version.

On other OSs, the Python unicode size, the native wchar_t size, and SQLWCHAR size can differ, so I manaully make the conversion. ODBC does not support any UTF-type encodings however. Since drivers *do* write encodings other than UCS2, I plan on adding a way to tell pyodbc to use UTF8 or UTF16.

Reply all

Reply to author

Forward