I think the interfaces and functionality of readdlm, readcsv, and readtable are a bit muddled right now. I would like to propose some clarified meaning and interfaces.
readdlm(input, T::Type=Float64; delim='\t', eol=r"\r?\n?") => Matrix{T}
I'm not sure if the delimiter should be positional or keyword, but that's not super important. The important parts are:
- It always returns a homogenous matrix of a single type, defaulting to Float64, since, hey, that's our bread-and-butter.
- It does not do any fancy escaping of any kind: if the delimiter is tab, there is no way to include tabs in field; likewise, there is no way to include an end-of-line sequence in a field.
readcsv(input, T::Type=Float64) => Matrix{T}
Like readdlm, this should always return a matrix of homogeneous type. However, I think that this should *not* be a simple wrapper for readdlm with comma as the delimiter – instead, it should support correct CSV reading. Relegating the ability to properly and efficiently read the ubiquitous CSV format to DataFrames doesn't make any sense to me – that ability has nothing to do with a tabular data format that allows each column to have a different type. If you want to read a CSV file with escaped commas into a matrix of strings, you should be able to do that without having to load it into a DataFrame.
readtable(input, ???)
I'm not sure what the signature should be here, but the *key* distinction between readtable [1], should be that it allows reading data where each column of data has a different type – i.e. a DataFrame. You should *only* need to use the functionality provided by DataFrames if you want to produce a DataFrame. Like I said above, you shouldn't need DataFrames to read a CSV file of strings into a Matrix{UTF8String}.
All of this should have common lower level common functionality to support it. In particular, there are two pieces of core functionality that seem to crucial to have fast, general versions of that can be used to build data structures:
- reading delimited data, a la readdlm
- reading CSV data, a la readcsv
It's unclear to me that constructing a DataFrame – i.e. a data structure that allows tabular data with heterogeneous types – should be tied to either format. You should be able to read a DataFrame using a simple dlm-style reader or a fancy csv-style reader that allows escaping and whatnot. It should also be possible to read ragged data where each row has a different number of values on it. That can be done in either simple delimited style of CSV style. Reading either delimited or CSV data should be supported in Base. Reading such data into a DataFrame should be in the DataFrames pacakge. This is going to require some careful refactoring to make sure that it remains fast, but I think it should be quite possible using a produce-consume model or carefully designed iterators.
The ideal way this should work is that you compose a data reader – that gets elements and figures out when lines are done – with a data structure builder that takes those values, parses them and builds the data structure. Thus readdlm is the composition of a DLMReader with a MatrixBuilder or something like that, while readcsv is the composition of CSVReader with MatrixBuilder. There should be versions of readtable that use DLMReader or CSVReader, depending on the data format; the common thing should be the data structure side, which could be DataFrameBuilder or something like that. I don't really care about the names, which are kind of Javaesque as much as the composable design.
[1] I'm not super happy about the name "readtable", but it's ok. It seems a bit weird to say readtable and get an object of type DataFrame. Maybe calling this readdata or readframe or something?