load a Julia dataframe from Microsoft SQL Server table

1,442 views
Skip to first unread message

Charles Brauer

unread,
Mar 17, 2015, 3:21:39 PM3/17/15
to julia...@googlegroups.com
Hi,

I'm considering diving into Julia. However, all of my data is in a Microsoft SQL Server database.
I would really appreciate a Julia code example on how to load a Julia dataframe from SQL Server table.

Thanks
Charles

Jacob Quinn

unread,
Mar 17, 2015, 3:22:50 PM3/17/15
to julia...@googlegroups.com
Check out the https://github.com/quinnj/ODBC.jl package for connecting to DSN defined in your ODBC manager.

Terry Seaward

unread,
Feb 3, 2016, 2:46:25 AM2/3/16
to julia-users, quinn....@gmail.com
Is there a DBI compliant version of this? Also I have endless string conversion issues with this package.

Stefan Karpinski

unread,
Feb 3, 2016, 9:48:24 AM2/3/16
to Julia Users
I've been working through encoding issues with this as well. To connect to Microsoft SQL Server, this patch made things work:

diff --git a/src/backend.jl b/src/backend.jl
index b5f24af..bf4ee11 100644
--- a/src/backend.jl
+++ b/src/backend.jl
@@ -40,7 +40,7 @@ end

 # Send query to DMBS
 function ODBCQueryExecute(stmt::Ptr{Void}, querystring::AbstractString)
-    if @FAILED SQLExecDirect(stmt, utf16(querystring))
+    if @FAILED SQLExecDirect(stmt, utf8(querystring))
         ODBCError(SQL_HANDLE_STMT,stmt)
         error("[ODBC]: SQLExecDirect failed; Return Code: $ret")
     end

The query string gets passed through to SQLExecDirect:

#SQLExecDirect
#Description: executes a preparable statement
#Status:
function SQLExecDirect(stmt::Ptr{Void},query::AbstractString)
    @windows_only ret = ccall( (:SQLExecDirect, odbc_dm), stdcall,
        Int16, (Ptr{Void},Ptr{UInt8},Int),
        stmt,query,sizeof(query))
    @unix_only ret = ccall( (:SQLExecDirect, odbc_dm),
            Int16, (Ptr{Void},Ptr{UInt8},Int),
            stmt,query,sizeof(query))
    return ret
end
 
This function just convert whatever it's argument is to a pointer. Looking at the docs for this function, the signature is this:

SQLRETURN SQLExecDirect(
     SQLHSTMT     StatementHandle,
     SQLCHAR *    StatementText,
     SQLINTEGER   TextLength);

and SQLCHAR is defined as unsigned char. So this would seem to be a non-wide character string – i.e. ASCII or UTF-8. And indeed, that's what the Microsoft SQL driver seems to be expecting.

The question I have is this: how the heck is this working for other ODBC drivers? How are they getting pointers to UTF-16 data and interpreting it correctly? The correct fix would seem to be to make this always send UTF-8 strings. But when I made a PR that did that, it seemed to break other setups.

Terry Seaward

unread,
Feb 3, 2016, 2:44:13 PM2/3/16
to julia-users
From R, it seems like the encoding is based on the connection (as opposed to being hard coded). See `enc <- attr(channel, "encoding")` below:

```
> RODBC::odbcQuery
function (channel, query, rows_at_time = attr(channel, "rows_at_time")) 
{
    if (!odbcValidChannel(channel)) 
        stop("first argument is not an open RODBC channel")
    if (nchar(enc <- attr(channel, "encoding"))) 
        query <- iconv(query, to = enc)
    .Call(C_RODBCQuery, attr(channel, "handle_ptr"), as.character(query), 
        as.integer(rows_at_time))
}
<environment: namespace:RODBC>
```

Digging down `odbcConnect` is just a wrapper for `odbcDriverConnect` which has the following parameter `DBMSencoding = ""`. This calls the `C` function `C_RODBCDriverConnect` (available here:RODBC_1.3-12.tar.gz), which has no reference to encodings. So `attr(channel, "encoding")` is simply `DBMSencoding`, i.e. `""`.

It seems to come down to `iconv(..., to = "")` which, from the R source code, uses `win_iconv.c` attached. I can't seem to find how `""` is handled, i.e. is there some default value based on the system?


```
> RODBC::odbcDriverConnect
function (connection = "", case = "nochange", believeNRows = TRUE, 
    colQuote, tabQuote = colQuote, interpretDot = TRUE, DBMSencoding = "", 
    rows_at_time = 100, readOnlyOptimize = FALSE) 
{
    id <- as.integer(1 + runif(1, 0, 1e+05))
    stat <- .Call(C_RODBCDriverConnect, as.character(connection), 
        id, as.integer(believeNRows), as.logical(readOnlyOptimize))
    if (stat < 0L) {
        warning("ODBC connection failed")
        return(stat)
    }
    Call <- match.call()
    res <- .Call(C_RODBCGetInfo, attr(stat, "handle_ptr"))
    isMySQL <- res[1L] == "MySQL"
    if (missing(colQuote)) 
        colQuote <- ifelse(isMySQL, "`", "\"")
    if (missing(case)) 
        case <- switch(res[1L], MySQL = "mysql", PostgreSQL = "postgresql", 
            "nochange")
    switch(case, toupper = case <- 1L, tolower = case <- 2L, 
        postgresql = case <- 2L, nochange = case <- 0L, msaccess = case <- 0L, 
        mysql = case <- ifelse(.Platform$OS.type == "windows", 
            2L, 0L), stop("Invalid case parameter: nochange | toupper | tolower | common db names"))
    case <- switch(case + 1L, "nochange", "toupper", "tolower")
    rows_at_time <- max(1, min(1024, rows_at_time))
    cs <- attr(stat, "connection.string")
    if (grepl("PWD=", cs)) {
        attr(stat, "connection.string") <- sub("PWD=[^;]+($|;)", 
            "PWD=******;", cs)
        Call$connection <- sub("PWD=[^;]+($|;)", "PWD=******;", 
            connection)
    }
    structure(stat, class = "RODBC", case = case, id = id, believeNRows = believeNRows, 
        colQuote = colQuote, tabQuote = tabQuote, interpretDot = interpretDot, 
        encoding = DBMSencoding, rows_at_time = rows_at_time, 
        isMySQL = isMySQL, call = Call)
}
<environment: namespace:RODBC>
```

win_iconv.c
iconv.h

Terry Seaward

unread,
Feb 3, 2016, 3:01:50 PM2/3/16
to julia-users
PS, thanks for the patch.

Tony Kelman

unread,
Feb 3, 2016, 5:43:56 PM2/3/16
to julia-users
Try to avoid posting snippets of GPL-licensed code if at all possible. That license is viral and it's best not to run any risk of using code with that license in more-permissively-licensed Julia projects. Links are fine, as long as the license status of what you're linking to is clear.

Scott Jones

unread,
Feb 4, 2016, 12:33:35 AM2/4/16
to julia-users
The question I have is this: how the heck is this working for other ODBC drivers? How are they getting pointers to UTF-16 data and interpreting it correctly? The correct fix would seem to be to make this always send UTF-8 strings. But when I made a PR that did that, it seemed to break other setups.

SQLCHAR is for encodings with 8-bit code units.  It doesn't imply ASCII or UTF-8 (probably one of the more common character sets used with that is actually Microsoft's CP1252, which is often mistakenly described as ANSI Latin-1 - of which it is a superset).
Even when something says it is UTF-8, it frequently is not *really* valid UTF-8, for example, there are two common variations of UTF-8, CESU-8, used by MySQL and others, which encodes any non-BMP code point using the two UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8 sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus embedded \0s are encoded in a "long" form (0xc0 0x80) so that you can still use old-fashioned C \0 termination for strings while allowing embedded \0s.
(note: handling those formats, common in databases, was part of the reason I did the work to rewrite the `convert` functions for strings for Julia).

The reason your PR didn't work is because you need to find out from the database which character set it is using, and then do the conversions necessary to go back and forth between that character set and whatever Unicode you are using in Julia (UTF8String, UTF16String, or UTF32String).
One thing that could help is the following package that @nalimilan has been working on: https://github.com/nalimilan/StringEncodings.jl)
You should also be able to use the SQLWCHAR API, but that can have it's own issues, see the following from Oracle:
NOTE: The mapping of SQLWCHAR type is somewhat complicated and it can
         create hidden pitfalls for programmers porting their code from 
         Windows to Linux. Usually a SQLWCHAR character is a 16-bit unit and
         we will not consider the exotic cases when SQLWCHAR is different.
         Windows uses UTF-16 and maps SQLWCHAR to 16-bit wchar_t type.
         However, many Linux versions such as Ubuntu Linux use UTF-32 as an
         internal character set and therefore their 32-bit wchar_t is not
         compatible with SQLWCHAR, which is always 16-bit.
 
Your working theory https://github.com/JuliaDB/ODBC.jl/pull/71#issuecomment-175095266 is incorrect, it is an issue of character encodings used, not platforms.  You see platform differences because the databases you are connecting to have different default character encodings on different platforms.

Páll Haraldsson

unread,
Feb 4, 2016, 4:52:33 AM2/4/16
to julia-users
On Thursday, February 4, 2016 at 5:33:35 AM UTC, Scott Jones wrote:
SQLCHAR is for encodings with 8-bit code units.  It doesn't imply ASCII or UTF-8 (probably one of the more common character sets used with that is actually Microsoft's CP1252, which is often mistakenly described as ANSI Latin-1 - of which it is a superset).

When I read that, I thought, that must not be true.. You can't have s superset (add letters) without dropping others (implying a superset of a subset), so I looked up:

"differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range. Notable additional characters are curly quotation marks, the Euro sign, and all the printable characters that are in ISO 8859-15.
[..]
This is now standard behavior in the HTML 5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.[1] In January 2016 1.0% of all web sites use Windows-1252."

Still, despite this 1.0% I think we should support this encoding (in a way, if not its own 8-bit-only type (I'm not sure we need to support any other 8-bit one); it's no longer just some Microsoft thing as I assumed..), as it is ideal for most of Europe (and even the US/world because of "curly quotation"). I've been thinking of doing a sting-type, that does the same as Python, encodes in 8-bit when possible, possibly 7-bit (then it can still say it's UTF-8 and fast indexing is known, note the strings are immutable).

It wouldn't surprise me that "UTF-8" would sometimes, incorrectly, include this as the "Latin-1" subset..

I wander if this screws up sorting.. It's not like the exact position of the Euro sign is to important in alphabetical sorting. I could argue it be sorted with E e but I assume just after A-Z a-z if ok for most..

I had never heard of "control characters in the 80 to 9F (hex) range", assuming then it's a very obscure/ancient thing that can be assumed to be never used anymore..


Even when something says it is UTF-8, it frequently is not *really* valid UTF-8, for example, there are two common variations of UTF-8, CESU-8, used by MySQL and others, which encodes any non-BMP code point using the two UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8 sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus embedded \0s are encoded in a "long" form (0xc0 0x80)

Not only those..

I thought the WTF variant (important for us, because of Windows-filenames?) of UTF-8 was a joke/vandalism at Wikipedia until I read more closely on this I just saw:


-- 
Palli.

Scott Jones

unread,
Feb 4, 2016, 7:48:05 AM2/4/16
to julia-users
I just have have been clearer - it is just a superset of the printable characters, but, as it reuses assigned (even though pretty much never used) control character positions, it is not truly a superset.  ASCII is a 7-bit subset of ANSI Latin-1, which is an 8-bit subset of UCS-2, which is a 16-bit subset that can represent only the BMP, which is a subset of the Unicode code points (which need 21 bits).
I ever hadn't run across that in my work, but my work was in databases, usually Unix (AIX, Solaris, etc) or Linux, not so much on Windows any longer, and I added Unicode support before surrogates even existed (they were added in Unicode 2.0, but not actually used until Unicode 3.0).
I'm not sure what you'd want to do to convert that for use in Julia? (btw, I think the initials of the "format" says it all!)

Milan Bouchet-Valat

unread,
Feb 4, 2016, 7:53:42 AM2/4/16
to julia...@googlegroups.com
Le mercredi 03 février 2016 à 11:44 -0800, Terry Seaward a écrit :
> From R, it seems like the encoding is based on the connection (as
> opposed to being hard coded). See `enc <- attr(channel, "encoding")`
> below:
>
> ```
> [...]
>
> Digging down `odbcConnect` is just a wrapper for `odbcDriverConnect`
> which has the following parameter `DBMSencoding = ""`. This calls the
> `C` function `C_RODBCDriverConnect` (available here:RODBC_1.3-
> 12.tar.gz), which has no reference to encodings. So `attr(channel,
> "encoding")` is simply `DBMSencoding`, i.e. `""`.
>
> It seems to come down to `iconv(..., to = "")` which, from the R
> source code, uses `win_iconv.c` attached. I can't seem to find how
> `""` is handled, i.e. is there some default value based on the
> system?
"" refers to the encoding of the current system locale. This is a
reasonable guess, but it will probably be wrong in many cases (else, R
wouldn't have provided this option at all).


Regards

Stefan Karpinski

unread,
Feb 4, 2016, 1:13:12 PM2/4/16
to Julia Users
The real issue is this:

SQLCHAR is for encodings with 8-bit code units.

Condescending lecture on encodings notwithstanding, UTF-16 is not such an encoding, yet UTF-16 is what the ODBC package is currently sending to SQLExecDirect for an argument of type SQLCHAR * – and somehow it seems to be working for many drivers, which still makes no sense to me. I can only conclude that some ODBC drivers are treating this as a void * argument and they expect pointers to data in whatever encoding they prefer, not specifically in encodings with 8-bit code units.

Querying the database about what encoding it expects is a good idea, but how does one do that? The SQLGetInfo function seems like a good candidate but this page doesn't include "encoding" or "utf" anywhere.

Stefan Karpinski

unread,
Feb 4, 2016, 1:33:33 PM2/4/16
to Julia Users
Not a model of clarity (ANSI and Unicode are not encodings), but this page seems to be the best resource on this:


It seems that there's a parallel "Unicode" API for ODBC drivers that support it. Moreover:

Currently, the only Unicode encoding that ODBC supports is UCS-2, which uses a 16-bit integer (fixed length) to represent a character. Unicode allows applications to work in different languages.

So using Klingon is off the table. Although the design of UTF-16 is such that sending UTF-16 to an application that expects UCS-2 will probably work reasonably well, as long as it treats it as "just data".

This still doesn't explain why some drivers are accepting UCS-2/UTF-16 when called with the non-Unicode API.

Scott Jones

unread,
Feb 4, 2016, 4:25:15 PM2/4/16
to julia-users
My reply wasn't meant to be condescending at all, just trying to explain the issue.
UTF-16LE & UTF-16BE *are* encodings of the 16-bit UTF-16 encoding of Unicode onto 8-bit code units.
If the server is sending UTF-16 or UTF-32, you should simply use the *W API, period, because in some places the 8-bit API can have
problems with the embedded 0x00 bytes.

Scott Jones

unread,
Feb 4, 2016, 4:25:16 PM2/4/16
to julia-users


On Thursday, February 4, 2016 at 1:33:33 PM UTC-5, Stefan Karpinski wrote:
Not a model of clarity (ANSI and Unicode are not encodings), but this page seems to be the best resource on this:


It seems that there's a parallel "Unicode" API for ODBC drivers that support it. Moreover:

Currently, the only Unicode encoding that ODBC supports is UCS-2, which uses a 16-bit integer (fixed length) to represent a character. Unicode allows applications to work in different languages.

So using Klingon is off the table. Although the design of UTF-16 is such that sending UTF-16 to an application that expects UCS-2 will probably work reasonably well, as long as it treats it as "just data".

That comment is probably years (decades?) out of date.  There is no limit of just UCS-2.
 
This still doesn't explain why some drivers are accepting UCS-2/UTF-16 when called with the non-Unicode API.

When you do so, are you actually calling the functions with the A, or just the macro without either A or W?
The macro will compile to either the A or the W form, depending on how your application is built.

This is a better page in MSDN: https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing what is going on.

Stefan Karpinski

unread,
Feb 4, 2016, 4:29:46 PM2/4/16
to Julia Users
On Thu, Feb 4, 2016 at 1:50 PM, Scott Jones <scott.pa...@gmail.com> wrote:

This still doesn't explain why some drivers are accepting UCS-2/UTF-16 when called with the non-Unicode API.

When you do so, are you actually calling the functions with the A, or just the macro without either A or W?
The macro will compile to either the A or the W form, depending on how your application is built.

This is a better page in MSDN: https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing what is going on.

The ODBC package calls the functions without A or W. What it's calling can't be a macro since macros aren't callable via ccall. But changing ODBC to call the W version of everything may be the fix here.

Scott Jones

unread,
Feb 4, 2016, 11:39:08 PM2/4/16
to julia-users
That very well may be the solution: looking for example at libiodbc on the Mac, it has 3 different versions of all those functions, and I'm not sure just what behavior you get when using the form without the A or W.  I've always used ODBC with the C headers, unlike the direct linking that Julia is doing, so that it always gets the W version since I compile as a Unicode build. 

Jacob Quinn

unread,
Feb 4, 2016, 11:42:40 PM2/4/16
to julia...@googlegroups.com
That's a big part of the "remodel" I've been working on to always call the "W" version of the functions and use UTF8 consistently (see here: https://github.com/JuliaDB/ODBC.jl/blob/jq/remodel/src/API.jl). I would certainly welcome those willing to test various configurations/setups.

-Jacob

David Anthoff

unread,
Feb 5, 2016, 1:18:19 PM2/5/16
to julia...@googlegroups.com

https://msdn.microsoft.com/en-us/library/ms716246%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396

 

suggests that if you call the version without the A or W suffix you get the ANSI version.

Stefan Karpinski

unread,
Feb 5, 2016, 1:44:20 PM2/5/16
to Julia Users
It does, but that's not what we're seeing – at least with some ODBC drivers.

Jameson

unread,
Feb 9, 2016, 12:38:10 AM2/9/16
to julia-users
Calling the ANSI version doesn't preclude the possibility of getting UTF16 data back. In particular, that would be code page 1200 (utf16le) or 1201 (utf16be) for Microsoft Windows. MSDN is inconsistent in their usage of A and whether it means ANSI, OEM, localized-locale, or application-dependent (aka other) and generally makes no statement about how the bytes may need to be handled or interpreted.

Stefan Karpinski

unread,
Feb 9, 2016, 7:22:36 AM2/9/16
to Julia Users
So that's basically a useless API then? I guess we should use the W version everywhere.
Reply all
Reply to author
Forward
0 new messages