Another prominent candidate is a DataFrame or AbstractDataFrame with NullableArray columns.
I also struggled with similar questions when I created the ExcelReaders.jl package. In the end I now support reading into DataMatrix and DataFrame, those seem the most commonly used currently. I just hand coded those two… But I guess in reality we will have N different types of table-like data structures and M different types of sources, and ideally users could just combine them freely. I think CSVReaders.jl had some ideas that looked very promising, but I never got around to investigating that fully.
There have been many discussions scattered across mailing lists and packages on the subject, but I'm looking to crowdsource some discussion/ideas on several outstanding issues I'm currently trying to resolve across a few packages. The packages in question are:* ODBC.jl: returning data from a DBMS system* SQLite.jl: data is stored in an SQLite database table; data can be pulled out* CSV.jl: reading CSV files into memory, with potential missing valuesI've considered a few different ideas for a consistent, "table" type, with the key design decisions including:* How to represent NULL values; some native NULL type or sentinel values
* Whether to preserve type information of individual columns (obviously we probably want to)
* How to associate column headers with table data; do we discard the column names or utilize some type that incorporates them?
Some concrete representations I've been considering are:* Matrix{Any} or Vector{Vector{T}}:* Matrix{Any} just totally punts on types and just stores all values as the `Any` type; this approach isn't without precedent, this is actually how SQLite stores data and it makes for extremely simple parsing/storing/fetching code* Vector{Vector{T}} is a step up because it preserves the type `T` of each vector, while storing each "column" in a Vector{Any}* In both these cases, NULL values are represented by sentinel values; i.e. in CSV.read(), there are keyword arguments to the effect of null_int=typemin(Int) and null_float=NaN
* NullableArrays* I had a really great conversation and introduction to the NullableArrays.jl package from John Myles White and David Gold last week at JuliaCon; David showed me the extensive testing/performance benchmarking/designing that went into the package and they really have some core ideas figured out there* The "table" type would still need to be a NullableArray{Any,2} or Vector{NullableArray{T}}, depending on if we preserve types of the columns or not
* AxisArrays* Also got introduced to this new nifty package from Matt Bauman last week; while extremely new (with dependencies on unregistered packages), the cool idea is that this approach allows the incorporation of the column names (as opposed to discarding them or returning a separate Vector{String} type like Base.readcsv does currently)
I think Tom is right - I'm not sure there's one data data structure that will always work. There are two levels where we can choose either array-of-structs or structs-of-arrays. Is a table of heterogeneous columns a vector of its rows? Or is it a group of column vectors? And then the same transformation can be made between Nullable and NullableArrays. And then there are times when you want a homogenous matrix to send to BLAS for a regression or PCA, or times when
But there is something extremely powerful in having just one data structure. So can we find something like R's DataFrame that works for most uses? My knowledge here is limited? But could we have both a vector of rows and a strided vector view into that data structure?
itr = SQLTableIterator(tablename)
for row_ptr in itr
update!(somestruct, MySpecializedRowType(row_ptr))
end
PDQ .. "pretty darn quick"