How to represent a "table" type with potential missing values?

636 views
Skip to first unread message

Jacob Quinn

unread,
Jun 30, 2015, 11:22:42 PM6/30/15
to juli...@googlegroups.com
There have been many discussions scattered across mailing lists and packages on the subject, but I'm looking to crowdsource some discussion/ideas on several outstanding issues I'm currently trying to resolve across a few packages. The packages in question are:

* ODBC.jl: returning data from a DBMS system
* SQLite.jl: data is stored in an SQLite database table; data can be pulled out
* CSV.jl: reading CSV files into memory, with potential missing values

I've considered a few different ideas for a consistent, "table" type, with the key design decisions including:

* How to represent NULL values; some native NULL type or sentinel values
* Whether to preserve type information of individual columns (obviously we probably want to)
* How to associate column headers with table data; do we discard the column names or utilize some type that incorporates them?

Some concrete representations I've been considering are:

* Matrix{Any} or Vector{Vector{T}}: 
    * Matrix{Any} just totally punts on types and just stores all values as the `Any` type; this approach isn't without precedent, this is actually how SQLite stores data and it makes for extremely simple parsing/storing/fetching code
    * Vector{Vector{T}} is a step up because it preserves the type `T` of each vector, while storing each "column" in a Vector{Any}
    * In both these cases, NULL values are represented by sentinel values; i.e. in CSV.read(), there are keyword arguments to the effect of null_int=typemin(Int) and null_float=NaN
* NullableArrays
    * I had a really great conversation and introduction to the NullableArrays.jl package from John Myles White and David Gold last week at JuliaCon; David showed me the extensive testing/performance benchmarking/designing that went into the package and they really have some core ideas figured out there
    * The "table" type would still need to be a NullableArray{Any,2} or Vector{NullableArray{T}}, depending on if we preserve types of the columns or not
* AxisArrays
    * Also got introduced to this new nifty package from Matt Bauman last week; while extremely new (with dependencies on unregistered packages), the cool idea is that this approach allows the incorporation of the column names (as opposed to discarding them or returning a separate Vector{String} type like Base.readcsv does currently)
* Tables.jl
    * This is a new package I'm developing for a Table type that wraps an SQLite table underneath and has efficient sub-Table operations (through SQLite views) and allows direct usage of SQL statements through convenience indexing. 
    * While this has been emerging in my mind as the potential "universal table type", it's also reliant on the functionality/limitations of SQLite underneath.

I'd really appreciate other ideas or suggestions people have and other considerations worth thinking about here.

-Jacob

Tom Short

unread,
Jul 1, 2015, 7:45:49 AM7/1/15
to julia-dev

Another prominent candidate is a DataFrame or AbstractDataFrame with NullableArray columns.

Tom Breloff

unread,
Jul 1, 2015, 9:56:18 AM7/1/15
to juli...@googlegroups.com
Hi Jacob.  The reason you're having such a hard time with it is that: "it depends".  There are lots of different ways to store table data, and the best way is completely dependent on what type of data is there, and how you plan on accessing it.  Reading in a 1000-column table in which you plan on using only a few columns?  Use something similar to DataFrame (or equivalently, Dict{Symbol, Vector{T}}).  Have a big time-series table where you typically access one row at a time?  Vector{CustomImmutableType} might be a good option. (generated from a macro?)  I frequently stream datasets (as opposed to loading them all in memory) so simple iteration through an on-disk dataset would be huge for me.

I think the most useful approach would be to pick one arbitrarily (maybe Vector{TableRow}, where TableRow is a Vector{Any}), but then have a simple framework for the user to specify an alternative structure given their expected use-type/table-type, and provide iteration for cases where you don't need the whole dataset all at once.

If you want more collaboration on this, let me know.

David Anthoff

unread,
Jul 1, 2015, 2:06:46 PM7/1/15
to juli...@googlegroups.com

I also struggled with similar questions when I created the ExcelReaders.jl package. In the end I now support reading into DataMatrix and DataFrame, those seem the most commonly used currently. I just hand coded those two… But I guess in reality we will have N different types of table-like data structures and M different types of sources, and ideally users could just combine them freely. I think CSVReaders.jl had some ideas that looked very promising, but I never got around to investigating that fully.

Scott Jones

unread,
Jul 1, 2015, 2:48:10 PM7/1/15
to juli...@googlegroups.com


On Tuesday, June 30, 2015 at 11:22:42 PM UTC-4, Jacob Quinn wrote:
There have been many discussions scattered across mailing lists and packages on the subject, but I'm looking to crowdsource some discussion/ideas on several outstanding issues I'm currently trying to resolve across a few packages. The packages in question are:

* ODBC.jl: returning data from a DBMS system
* SQLite.jl: data is stored in an SQLite database table; data can be pulled out
* CSV.jl: reading CSV files into memory, with potential missing values

I've considered a few different ideas for a consistent, "table" type, with the key design decisions including:

* How to represent NULL values; some native NULL type or sentinel values

If you want this to be general, I'd forget about sentinel values. (I've had very bad experiences with sentinel values in the past also!)
 
* Whether to preserve type information of individual columns (obviously we probably want to)

That can be difficult, you want to preserve as much information as you can, but since the types of each DB are different, you can't
really have a one-to-one mapping between the types in the DB and the types you need to use in julia.
 
* How to associate column headers with table data; do we discard the column names or utilize some type that incorporates them?

I'd have a type, that contains both the metadata (not just for columns), and the the data itself.
 
Some concrete representations I've been considering are:

* Matrix{Any} or Vector{Vector{T}}: 
    * Matrix{Any} just totally punts on types and just stores all values as the `Any` type; this approach isn't without precedent, this is actually how SQLite stores data and it makes for extremely simple parsing/storing/fetching code
    * Vector{Vector{T}} is a step up because it preserves the type `T` of each vector, while storing each "column" in a Vector{Any}
    * In both these cases, NULL values are represented by sentinel values; i.e. in CSV.read(), there are keyword arguments to the effect of null_int=typemin(Int) and null_float=NaN
 
I'd *strongly* recommend avoiding using sentinel values.
If you don't already have a Vector{Any}, I'd want something that just used a bit per column.
I think I saw something recently that looked like it did that.

* NullableArrays
    * I had a really great conversation and introduction to the NullableArrays.jl package from John Myles White and David Gold last week at JuliaCon; David showed me the extensive testing/performance benchmarking/designing that went into the package and they really have some core ideas figured out there
    * The "table" type would still need to be a NullableArray{Any,2} or Vector{NullableArray{T}}, depending on if we preserve types of the columns or not
Maybe that's what I was thinking of.  I do think you need to be able to just represent some subset of a table, say, you load in the first 10K rows of a petabyte table, then the next, etc.

* AxisArrays
    * Also got introduced to this new nifty package from Matt Bauman last week; while extremely new (with dependencies on unregistered packages), the cool idea is that this approach allows the incorporation of the column names (as opposed to discarding them or returning a separate Vector{String} type like Base.readcsv does currently)
  What is this package?  Sounds interesting.

Matt Bauman

unread,
Jul 1, 2015, 3:28:09 PM7/1/15
to juli...@googlegroups.com
I should clean up AxisArrays and register it so others can start using it more easily. Right now there's not much functionality, except as a simple array wrapper that labels dimensions and individual rows/cols/etc in a very generic way. Since it just wraps an array, its major limitation for a database application is that it doesn't support heterogeneous column types. You can find it on at github.com/mbauman/AxisArrays.jl.

I think Tom is right - I'm not sure there's one data data structure that will always work. There are two levels where we can choose either array-of-structs or structs-of-arrays. Is a table of heterogeneous columns a vector of its rows? Or is it a group of column vectors? And then the same transformation can be made between Nullable and NullableArrays. And then there are times when you want a homogenous matrix to send to BLAS for a regression or PCA, or times when

But there is something extremely powerful in having just one data structure. So can we find something like R's DataFrame that works for most uses? My knowledge here is limited? But could we have both a vector of rows and a strided vector view into that data structure?

Tom Breloff

unread,
Jul 1, 2015, 3:59:43 PM7/1/15
to juli...@googlegroups.com
I haven't used AxisArrays yet, but I think that's the right approach: have a simple wrapper around AbstractArrays with named rows/cols.  Most functionality should be defined as abstractly as possible, and the actual table data structure could be data-dependent (either specified by the user, or automatically chosen somehow).  This way you can swap out the underlying structure easily without redefining all associated logic.

One example I'd like to see is a select-type call that populates a Matrix{Float64} given a list of column indices/names.  It would be a shame if you have to load a table into some massive structure only to convert it to a matrix after the fact.

Having an iterator on a table is really important as well, in which case many specialized structures could be created with a user-specified structure.  I'm thinking of something like:

itr = SQLTableIterator(tablename)
for row_ptr in itr
 update
!(somestruct, MySpecializedRowType(row_ptr))
end

I can't comment on how easy this is for SQLite, but for csv text files it should be somewhat straightforward.

David Gold

unread,
Jul 1, 2015, 6:48:43 PM7/1/15
to juli...@googlegroups.com
@Jacob have you seen the latest developments in https://github.com/JuliaStats/DataFrames.jl/issues/744 ? This issue contains some good discussion about some of these issues, in particular the question of how to encode column type information in the table type. You may also consider Simon's and Jarrett's sketches of DataFrames revisions as other potential candidates.

For what it's worth, here are my thoughts after spending some time talking to a number of folks who are much more well-informed w/r/t these matters than I am.

In short, I see a lot of merit in having the go-to table type (hereafter TT) be a thin interface to a SQLite backend as in Jacob's Tables.jl design. To expand on this position, I'll reiterate one of John's points from #744 above: in order to evaluate the above options it would help to specify the functions for which TT will be responsible. Off the top of my head, I imagine users will at least want to be able:

1) to read data -- either wholesale, or iteratively via some buffer -- into a Julia object via TT's interface
2) to query/manipulate/recode their data via TT's interface
3) to conduct exploratory analysis and to summarize/visualize their data via TT's interface with appropriate packages (e.g. Gadfly.jl)
4) to pipe their processed data into a modeling/learning method via TT's interface with appropriate packages (e.g. GLM.jl, Mocha.jl)
5) to save their processed data, export it to an external site or pipe it to some other process via TT's interface

Having SQL querying/munging functionality out of the box would let Julia developers focus on refining the interface between Julia's modeling, visualization/exploration and data management facilities. For one, designing TT as a wrapped SQLite database would (based on my superficial familiarity with these issues) at least allow us to build on the latter's extant, and in some cases highly optimized, facilities for performing functions (1) and (2) (and (5)?) above.

For two, having the primary data management engine live "outside" of TT may actually support better abstractions. In many -- perhaps most -- use cases, data needs to be transformed in order to work with it. For instance, fitting a linear model to the data from a DataFrame involves deriving ModelFrame and ModelMatrix objects from the relevant Formula object and the DataFrame in question. As Tom points out below, and as John mentioned in his talk last week, trying to design the primary tabular data storage type to lend itself to all or even a majority of use cases is just not going to work, especially when these use cases are evolving by the day. Purposefully designing TT as an abstraction over the actual data management engine, as opposed to the engine itself, allows us to put the design emphasis on "materialization" methods (I didn't come up with this terminology, but I like it). That is, instead of focusing on the design of a single tabular data type, we instead focus on designing generic, extensible methods for materializing data from the backend into in-memory Julia structures that are appropriate for the use case at hand. Here I can (tentatively) see some significant opportunities for decoupling data management functionality/abstraction from modeling functionality/abstraction. So implementing TT as a SQLite wrapper and embracing a "materialization"-centric framework has the potential to score well with regards to (3) and (4), too. 

For three, my sense is that such a design may also simplify implementing streaming data access, which would then amount to designing specialized materialization pathway for streaming data into a Julia object via some buffering scheme. (At least, this is my impression, but I have very little experience with IO, so I do hope others will chime in).

For many modeling use cases, we'd want to supply (say in StatsBase) methods for materializing data (with respect to a formula) as a design matrix, i.e. a Matrix{Float64}, and as whatever other structures are most amenable to learning / probabilistic programming. 

To be sure, there will be use cases in which one would want to materialize tabular data as an actual, in-Julia table. I imagine that a good number of these use cases will fall under (3) above, i.e. exploration/visualization/summarization or related objectives, since one may wish to have missing values represented in the materialized structure during these activities. Here I am in favor of incorporating NullableArrays as a means of representing missing values. So, the materialization could be a Vector{NullableVector}, or some wrapped version that includes column type information in the materialization's type. My hope is that by circumscribing an in-Julia tabular data structure to just one particular materialization pathway amongst others, it will then be easier to identify a type design since the metrics against which to evaluate the type's performance need only concern the typical use cases for which we expect this pathway to be called. Furthermore, it seems that a widely performant in-Julia tabular data structure simply won't be possible (at least without unwieldy syntax and exposed metaprogramming) until Base develops along a number of fronts. So I think it makes most sense to design this structure with as limited as possible a set of use cases in mind at first, and then expand it where appropriate as relevant changes land in Base. 

I'm also beginning to consider that, in the spirit of extensible, diverse materialization and also in response to some comments that Tom Breloff made at JuliaCon, it may be worth trying to generalize the scheme of NullableArrays to an abstract type. The goal here would be to allow users to extend their own custom-defined container/iterator types to Nullable versions that (a) make use of the `mask`, `values` structure of NullableArrays for optimized performance and (b) save users from having to rewrite a handful of sufficiently generic, inheritable methods for working with containers of Nullable objects. I'm not sure how feasible/useful that would be, but the prospect is interesting.

Anyway, them's my thoughts.

On Tuesday, June 30, 2015 at 11:22:42 PM UTC-4, Jacob Quinn wrote:

John Myles White

unread,
Jul 1, 2015, 8:24:25 PM7/1/15
to juli...@googlegroups.com
FWIW, here's my current take on this:

* The best type representation for table data in general is Ptr{Void}. Seriously. Unless you need to say how a table is laid out in memory, do everything you possibly can to avoid exposing a well-defined memory layout. Define a table by the operations you can perform on it (e.g. SQL ops) and by the ways you can convert it to Julia types with specific memory layouts. In general, most data analysis should start in a remote database system and should stay there as long as humanly possible. The best possible data analysis is the one that never uses anything but SQL.

* When you're forced to materialize things into Julia objects with precise memory layouts, you really do need to expose a variety of options. But I'm increasingly happy with something like a OrderedDict{NullableVector} representation. The only problems there are the absence of type constraints on the columns, but it's not clear how often that's really a problem unless you're going to try looping over rows. For that use case, I think we need something more custom, like Matt Bauman's suggestions of an array-of-structs rather than a struct-of-arrays.

For database libraries, I think the right representation should be a cursor object, which comes along with a lot of convert-style methods that lets you turn the current row (or group of rows) into an arbitrary Julia data structure depending upon your needs.

 -- John

Jeffrey Sarnoff

unread,
Jul 1, 2015, 11:25:59 PM7/1/15
to juli...@googlegroups.com
I agree that database-based analyitics are much better done with SQL. SQLite is one of the most widely incorporated sizeable third party applications.
And it is going be better yet with version 4 (dunno when that happens).   While is not nearly as capable as e.g. PostgreSQL, it supports PDQ in-memory databases with a comparatively small footprint.  The "oh, well" part comes missing stored procedures and coping [or not[ with a way-too-cumbersome extension mechanism.

Tables will out .. big data is getting bigger, and there is going to be spread.  Absent disagreement, I think it would be significant if there were a user-welcoming, quick-enough general way to accept [present] tabled information for [from] safe-keeping that let me fetch and SQL query already-refined remote data. 
On that:

        We can view each possible table as a well-typed, well, type.

        When receiving something to be tabled in julia, one may use the type specs to construct an envelope type that wraps data, so it becomes easier to convey. If type matching to an external app is required, that information is used with the envelope type to generate an inversable multimap.  Save the envelope type with the direct and inverse multimaps as operational metadata (functional  metas).  When preparing something obtained from a tabled source, the operational metadata, where given, is used to re-present information.  Where no metainfo has been stored, column types may be deduced or, optionally, queried.


        # a tiny strawman

        import Base:show

        type VARCHAR{n}         # <: String   (not doing this for the example)
           s :: AbstractString
           VARCHAR(s) =
              (length(s) <= n) ? new(s) : throw(TypeError)
        end
        show(io::IO, s::VARCHAR) = show(io, s.s)
        getindex(s::VARCHAR,v::UnitRange{Int64})=getindex(s.s,v)
        # etc

        StringOf4 = VARCHAR{4};

        testme = StringOf4("abc");
        (testme, testme[1:2])

        ohno   = StringOf4("12345");


        StringOf60 = VARCHAR{60}

        type ExampleTable <: AbstractTable
           name :: StringOf60
           email:: StringOf60
           age  :: Int32
        end
        

Scott Jones

unread,
Jul 2, 2015, 9:34:46 AM7/2/15
to juli...@googlegroups.com
PDQ?  Physician's Data Query or Paint Data Query?  Those are the two acronyms I'm used to, but could you explain your use?

Scott Jones

unread,
Jul 2, 2015, 9:51:00 AM7/2/15
to juli...@googlegroups.com
The tricky bits come in that there are `CHAR`, `VARCHAR`, `NVARCHAR`, etc. and you have to deal with different encodings,
your check for 4 "characters" will fail in general, because the database really wants 4 "bytes", not 4 characters, for many DBMSes.
There are also as many issues with numbers, Julia would really need good decimal float support, @stevengl has gotten off to a great start with `DecFP.jl`, but that may need to be changed to use the `decNumber` library instead of the Intel library, he's looking into that, as well as sizes of numbers, I seem to remember one database having a 3-byte integer value.

I've tended to map things to a simpler types for use, i.e. a single, Unicode capable text string/blob type, a binary string/blob type,
integral values, binary float and decimal float values, date and date+TZ, and of course null, true, false.

Your ideas are very interesting.
I wouldn't want to have different sets of types defined for each DB, were you thinking of having things like VARCHAR defined
separate from the databases?
I think some unification would be needed to make it practical if you are dealing with multiple databases (common these days).

On Wednesday, July 1, 2015 at 11:25:59 PM UTC-4, Jeffrey Sarnoff wrote:

Milan Bouchet-Valat

unread,
Jul 2, 2015, 3:01:59 PM7/2/15
to juli...@googlegroups.com
Le mercredi 01 juillet 2015 à 17:24 -0700, John Myles White a écrit :
> FWIW, here's my current take on this:
>
> * The best type representation for table data in general is
> Ptr{Void}. Seriously. Unless you need to say how a table is laid out
> in memory, do everything you possibly can to avoid exposing a well
> -defined memory layout. Define a table by the operations you can
> perform on it (e.g. SQL ops) and by the ways you can convert it to
> Julia types with specific memory layouts. In general, most data
> analysis should start in a remote database system and should stay
> there as long as humanly possible. The best possible data analysis is
> the one that never uses anything but SQL.
So basically what we need is a general API like that defined by
DataFramesMeta.jl. It would either run the operations in SQL when
called on a SQL table, or in Julia when called on a DataFrame. Is that
what you have in mind?

> * When you're forced to materialize things into Julia objects with
> precise memory layouts, you really do need to expose a variety of
> options. But I'm increasingly happy with something like a
> OrderedDict{NullableVector} representation. The only problems there
> are the absence of type constraints on the columns, but it's not
> clear how often that's really a problem unless you're going to try
> looping over rows. For that use case, I think we need something more
> custom, like Matt Bauman's suggestions of an array-of-structs rather
> than a struct-of-arrays.
Even for looping over rows, isn't it easy to compile a Julia function
and pass it to SQL so that it takes care of running it? With macros, it
can be both generic, easy to write and very efficient. That doesn't
mean Julia structs cannot be used for an alternative pure-Julia
implementation.


Overall I totally agree with the ideas developed in this thread.


Regards

Jeffrey Sarnoff

unread,
Jul 2, 2015, 7:14:21 PM7/2/15
to juli...@googlegroups.com
PDQ .. "pretty darn quick"

Jeffrey Sarnoff

unread,
Jul 2, 2015, 7:19:44 PM7/2/15
to juli...@googlegroups.com
yes, absolutely julia abstracts.  
I agree that a priority would be not getting deep into the weeds of any particular database.

David Anthoff

unread,
Jul 2, 2015, 7:58:36 PM7/2/15
to juli...@googlegroups.com
I think MS really nailed the query part of such a design with the underpinnings of LINQ. You essentially get a completely storage agnostic, rich API for data access that works for in-memory data, SQL databases or anything else you can think of. They had some really smart people that came up with that design, they always lost me when they described how this was all based on monads/monoids etc., but as a user it was an incredibly handy thing to work with.

The thing they never solved with anything even close to the same elegance was data modification...

But I do think that having another look at some of the design ideas there might be helpful when designing something for julia. F# also later had a feature called type providers that essentially allowed you to get strongly typed representations of data into the system, that might also have some nice ideas.

> -----Original Message-----
> From: juli...@googlegroups.com [mailto:juli...@googlegroups.com]
> On Behalf Of Milan Bouchet-Valat
> Sent: Thursday, July 2, 2015 12:02 PM
> To: juli...@googlegroups.com
> Subject: Re: [julia-dev] Re: How to represent a "table" type with potential
> missing values?
>

Scott Jones

unread,
Jul 2, 2015, 9:25:13 PM7/2/15
to juli...@googlegroups.com


On Thursday, July 2, 2015 at 7:14:21 PM UTC-4, Jeffrey Sarnoff wrote:
PDQ .. "pretty darn quick"

Ha, that shows I spent too much time dealing with the healthcare industry!  I need to get out more!  Thanks!

David Anthoff

unread,
Jul 3, 2015, 7:39:18 PM7/3/15
to juli...@googlegroups.com
I went back and found the original whitepaper that introduced LINQ and the language features in C# that support it. It still is a good read:

https://msdn.microsoft.com/en-us/library/bb308959.aspx

I feel that a lot of the discussion around how the data handling could be improved in Julia would benefit tremendously from trying to incorporate some of the LINQ stuff. For example, the way type information is preserved even through super complicated data transformation steps is just impressive in LINQ, and probably very applicable to the problems of type stability that come up in the discussions around julia.

The whole philosophy also (I think) really lines up nicely with John's suggestion that the best representation would just be a Ptr{void}. It is not as radical, but the basic representation for data is an IEnumerable, which just defines a very minimal set of behaviors and then the whole framework sits nicely on top.

Cheers,
David

PS: And probably one of the more crazy manifestations of the real flexibility of the LINQ design is this:

https://msdn.microsoft.com/en-us/library/dn749872.aspx

where you can run a normal LINQ query as a distributed HIVE query. There are countless examples like this, where the original LINQ design was later used for things probably no-one had imagined originally.

John Myles White

unread,
Jul 4, 2015, 12:07:43 AM7/4/15
to juli...@googlegroups.com
I'm sure there's lots of great ideas in LINQ. My feeling is that we're quite far from being able to implement something like LINQ since we need to achieve consistency at a much lower level of abstraction for basic database mechanisms. That said, it would be great if somebody started working on implementing something like LINQ for Julia. I kind of suspect it's going to be hard given that LINQ was optimized for languages with features that Julia lacks -- perfect static knowledge of types and functions with return types. That said, I think we'd learn a lot from seeing how someone tries to translate LINQ into Julia.

-- John

David Anthoff

unread,
Jul 6, 2015, 5:55:50 PM7/6/15
to juli...@googlegroups.com
> My feeling is that we're quite far from being able to implement something
> like LINQ since we need to achieve consistency at a much lower level of
> abstraction for basic database mechanisms.

Can you elaborate a bit? It is not clear to me what you have in mind here.

> That said, it would be great if somebody started working on
> implementing something like LINQ for Julia. I kind of suspect it's going to be
> hard given that LINQ was optimized for languages with features that Julia
> lacks -- perfect static knowledge of types and functions with return types.

Agreed, they essentially had one release where the C# and VB language teams
and all the other involved parties coordinated and got all the moving parts
in order to enable the LINQ scenario.

> That said, I think we'd learn a lot from seeing how someone tries to translate
> LINQ into Julia.

Well, maybe this is just something to keep in the back of someone's mind as a
killer julia 2.0 feature. Not sure whether this is a good idea, but one could
even go with a plan where for now there isn't some grand unification of data
access packages/structures/whatever, but the current (creative!) chaos
continues. And then, in a couple of years, one tries to pull something like
LINQ of, including the language features that are needed.

John Myles White

unread,
Jul 7, 2015, 4:45:57 AM7/7/15
to juli...@googlegroups.com
What I mean is that it's currently not possible even to connect to most databases because we don't have fully-fleshed wrappers for the relevant C API's. We still need to get that stuff straight before more abstractions would offer any new functionality.

-- John

Jacob Quinn

unread,
Jul 8, 2015, 12:34:38 AM7/8/15
to juli...@googlegroups.com
Thanks everyone for the great replies. It's great to hear that others are grappling with some of the same issues I'm bumping into.

One thought I've had, that I'm not sure will necessarily be feasible or not, is to have an abstract DataTable type/package that would define a minimal interface for "table"-like types. I've looked at AbstractDataFrame, but I think even that is too restrictive and "dataframe"-y. I'm imagining a more minimal interface that other types could subtype and that packages like Gadfly, Mocha, GLM, etc. could define methods against to accept a wider variety of input types. Obviously the complication and possibly in-feasibility of this comes from those packages (Gadfly, GLM, etc.) needing to "reach into the internals" too much for performance reasons, but my hope would be that they could just define a single method against a "DataTable" type that would work out of the box for an SQLite table, NullableMatrix, DataFrame, etc. And perhaps without multiple inheritance, this needs to use a "trait" approach, but I'm not sure. Another idea would be to define it something like:

type DataTable{T}
    table::T
end

which would basically wrap any kind of "table" type and the "DataTables.jl" package could provide a consistent external API while adding all the "internal" methods for a bunch of different table-types. That way other packages could define methods against `DataTable`, use the external API, while "accepting" a variety of input types. This would be an "interface-through-wrapper-type" approach, which I actually haven't heard of before, so that probably means it's a bad idea. It does, however, kind of get to John's point of having an "opaque" pointer table-type since you would define methods against DataTable, but not really know what you'd be getting as the internal storage.

Any other ideas or concerns on something like this?

-Jacob

David Gold

unread,
Jul 8, 2015, 9:39:11 AM7/8/15
to juli...@googlegroups.com
@Jacob:

Based on what I've been learning about interfaces, I suspect it would be preferable to have users define for T <: DataTable as few methods as possible to hook into the DataTable interface. Define this and that method against T <: DataTable and you know precisely what methods you get in return. I wonder if much confusion would result in everybody defining methods directly against the DataTable wrapper -- if, for instance, it would become difficult to tell at first glance which DataTable methods are endemic to the original implementation and which are due to external packages. Would you say that this issue is, at least in part, of whether to emphasize extending methods or emphasize "extending" a single type?

What sort of functionality would you like the DataTable interface to provide? "Define methods _______ for T<:DataTable and the following methods _____________ will magically work for T". How would you fill in those blanks?

In particular, I think a very strong part of this idea is that it saves one from having to proliferate methods such as `fit(GLM, SQLiteTable, ...)`, `fit(GLM, DataFrame, ...)`, `fit(GLM, Matrix)`. But this benefit could be reaped without having to include, say, querying functionality in the interface.

Also, maybe it's best to get the general scheme working for a single "DataTable" type or small set of such types and then think about abstraction?

Scott Jones

unread,
Jul 8, 2015, 5:30:56 PM7/8/15
to juli...@googlegroups.com
I haven't had time to reply to all of this, but this research/work you are doing is great.
I thought maybe you could point a pointer to this discussion in the julia-users group, because I think
that there very well may be people with expertise in this area, who are not in the julia-dev group (because they are only interested in using Julia at this point), but maybe I'm wrong.


On Wednesday, July 8, 2015 at 12:34:38 AM UTC-4, Jacob Quinn wrote:
Reply all
Reply to author
Forward
0 new messages