read{dlm,csv,table}

992 views
Skip to first unread message

Stefan Karpinski

unread,
Jun 29, 2013, 12:09:33 PM6/29/13
to Julia Dev
I think the interfaces and functionality of readdlm, readcsv, and readtable are a bit muddled right now. I would like to propose some clarified meaning and interfaces.

readdlm(input, T::Type=Float64; delim='\t', eol=r"\r?\n?") => Matrix{T}

I'm not sure if the delimiter should be positional or keyword, but that's not super important. The important parts are:
  1. It always returns a homogenous matrix of a single type, defaulting to Float64, since, hey, that's our bread-and-butter.
  2. It does not do any fancy escaping of any kind: if the delimiter is tab, there is no way to include tabs in field; likewise, there is no way to include an end-of-line sequence in a field.
readcsv(input, T::Type=Float64) => Matrix{T}

Like readdlm, this should always return a matrix of homogeneous type. However, I think that this should *not* be a simple wrapper for readdlm with comma as the delimiter – instead, it should support correct CSV reading. Relegating the ability to properly and efficiently read the ubiquitous CSV format to DataFrames doesn't make any sense to me – that ability has nothing to do with a tabular data format that allows each column to have a different type. If you want to read a CSV file with escaped commas into a matrix of strings, you should be able to do that without having to load it into a DataFrame.

readtable(input, ???)

I'm not sure what the signature should be here, but the *key* distinction between readtable [1], should be that it allows reading data where each column of data has a different type – i.e. a DataFrame. You should *only* need to use the functionality provided by DataFrames if you want to produce a DataFrame. Like I said above, you shouldn't need DataFrames to read a CSV file of strings into a Matrix{UTF8String}.

All of this should have common lower level common functionality to support it. In particular, there are two pieces of core functionality that seem to crucial to have fast, general versions of that can be used to build data structures:
  1. reading delimited data, a la readdlm
  2. reading CSV data, a la readcsv
It's unclear to me that constructing a DataFrame – i.e. a data structure that allows tabular data with heterogeneous types – should be tied to either format. You should be able to read a DataFrame using a simple dlm-style reader or a fancy csv-style reader that allows escaping and whatnot. It should also be possible to read ragged data where each row has a different number of values on it. That can be done in either simple delimited style of CSV style. Reading either delimited or CSV data should be supported in Base. Reading such data into a DataFrame should be in the DataFrames pacakge. This is going to require some careful refactoring to make sure that it remains fast, but I think it should be quite possible using a produce-consume model or carefully designed iterators.

The ideal way this should work is that you compose a data reader – that gets elements and figures out when lines are done – with a data structure builder that takes those values, parses them and builds the data structure. Thus readdlm is the composition of a DLMReader with a MatrixBuilder or something like that, while readcsv is the composition of CSVReader with MatrixBuilder. There should be versions of readtable that use DLMReader or CSVReader, depending on the data format; the common thing should be the data structure side, which could be DataFrameBuilder or something like that. I don't really care about the names, which are kind of Javaesque as much as the composable design.

[1] I'm not super happy about the name "readtable", but it's ok. It seems a bit weird to say readtable and get an object of type DataFrame. Maybe calling this readdata or readframe or something?

Stefan Karpinski

unread,
Jun 29, 2013, 12:22:45 PM6/29/13
to Julia Dev
While I'm at it, I should point out that readdlm can be generalized to arbitrary dimensions, where the standard version has two delimiters and produces two-dimensional arrays. You could do something like this instead:

readdlm(input, T::Type, delims::Tuple=(',', r"\r?\n?"))
  => Array{T,length(delims)}

Thus, if you wanted to read a vector of integers, you could do this:

readdlm(input, Int, (r"\r?\n?",))

If you wanted to read a 3-tensor of floats, you could do something like this:

readdlm(input, Float64, (',', r"\r?\n?", "\r?\n?\r?\n?"))

This would treat blank lines as the separator in the next dimension. Note that for this to work, you have to search for the delimiters in reverse order (blank lines, then line ends, then commas).

Glen Hertz

unread,
Jun 29, 2013, 9:04:56 PM6/29/13
to juli...@googlegroups.com
Hi,

+1 on the proposal.

Have you thought of making the base definition take a file handle to the actual data blob?  Sometimes you have to deal with non-standard headers so it would be nice to call read{dlm,cvs} with a file handle pointing to the start of the data.  Perhaps a End of Data delimiter could also be supported if the data ended before the end of file.  Another situation is at times you want to read in only certain columns..or do a streamed read-in.

Perhaps both the data format type and the type of data you want to receive could be keyword arguments:

readdata(...., format=CSV, result=DataFrame)

Sorry for asking for so much, but parsing data is one of those things that is really nice if someone else has done it for you :).  People don't want to waste their time writing parsers; they want to work with the data.

Cheers,

Glen

Viral Shah

unread,
Jun 30, 2013, 4:04:09 AM6/30/13
to juli...@googlegroups.com
This is roughly the direction in which the new implementations have proceeded, but there is more work to be done in making readdlm and readcsv composable with the rest of the array infrastructure, and with readtable (I vote for readdata as the new name).

A DLMReader / CSVReader interface (perhaps where each row is accessed through an iterator) should also allow filtering and processing while reading large files, which I find essential in working with data that is unlikely to fit in memory. The current interfaces are bit restrictive, and one has to roll out their own code whenever one needs such capabilities - as I ended up doing recently. Having an interface where you can process the file as you are reading it also makes it possible to handle weirdly or incorrectly formatted data, since as John Myles White said - every CSV file is broken in its own way. One can never address all breakages, but it may be possible to accommodate some by providing ways for the user to work around parts of a file while reading it.

-viral

Viral Shah

unread,
Jun 30, 2013, 4:07:12 AM6/30/13
to juli...@googlegroups.com
I just mentioned streamed reading in my earlier email, but forgot about reading only certain columns. Both of these are incredibly useful. We may benefit from having a general way to think about data streams - currently, DataFrames.jl has DataStreams, and HDFS.jl has internal interfaces to streaming large distributed files.

-viral

Stefan Karpinski

unread,
Jun 30, 2013, 8:50:01 AM6/30/13
to Julia Dev
On Sat, Jun 29, 2013 at 9:04 PM, Glen Hertz <glen....@gmail.com> wrote:
Sorry for asking for so much, but parsing data is one of those things that is really nice if someone else has done it for you :).  People don't want to waste their time writing parsers; they want to work with the data.

No, this is exactly the point. You should ask for that much from a language – and if the language doesn't provide it, then it's falling down on the job.

A compositional design has the benefit that you can add support for new formats to represent CSV-like data very easily. In a previous existence, I spent a lot of time working with data stored in the Hadoop Sequence File format, which would be a natural source for the data to populate a DataFrame. Adding generic support for Thrift, ProtoBuf, and Avro are equally reasonable. The point of the composition idea is that one should be able to add generic support for reading these formats without needing to know anything specific about DataFrames, while DataFrames should be able to be created from these formats without needing to know anything specific about these formats.

Stefan Karpinski

unread,
Jun 30, 2013, 9:17:50 AM6/30/13
to Julia Dev
On Sun, Jun 30, 2013 at 4:04 AM, Viral Shah <vi...@mayin.org> wrote:
A DLMReader / CSVReader interface (perhaps where each row is accessed through an iterator) should also allow filtering and processing while reading large files, which I find essential in working with data that is unlikely to fit in memory.

I suspect that something like an producer/consumer event-driven parser model may be the way to go. The data parsing side could produce a series of "events", including "value" events – which would include the actual content of a field, unescaped and such – and "row end" events, etc., which signal structure in the data. The consumer would use these events to construct the actual data structure as appropriate, including converting string values to the appropriate types, saving them into the data structure, and potentially inferring types and such.

The current interfaces are bit restrictive, and one has to roll out their own code whenever one needs such capabilities - as I ended up doing recently. Having an interface where you can process the file as you are reading it also makes it possible to handle weirdly or incorrectly formatted data, since as John Myles White said - every CSV file is broken in its own way. One can never address all breakages, but it may be possible to accommodate some by providing ways for the user to work around parts of a file while reading it.

Ideally, the CSV data producer would just handle weirdness in the CSV format so that the consumer would never need to know about that stuff. It would be interesting to have interactive loading procedures where you decide while processing the data how to handle certain kinds of irregularities. One can imagine it prompting you about whether the first row is a header or not and later encountering oddly formatted lines and asking which of various interpretations is correct, and then applying that interpretation consistently, until you're done with reading the data, and maybe also returning a specification of the rules you just implicitly picked for parsing the data so that you can do it again in the future automatically. That would certainly be far more pleasant than the usual fail, tweak, try again approach, which is incredibly tedious and time-consuming.

Tomas Lycken

unread,
Jul 1, 2013, 4:39:19 AM7/1/13
to juli...@googlegroups.com
+1 for the composition idea - separating the specification of input data from the specification of the data structure into which to load the data is a Good Thing. It'll make it really easy for users to extend existing functionality for their own file formats - or data structures - without having to invent both wheels.

I would also like to request a built-in reader for .mat files (i.e. matlab data files). Since these can store several variables in one file, it probably makes sense to force the user to specify the name of the variable in the file that one wants to read, but with its own method ("readmat", maybe?) that shouldn't be a problem.

// T

Tim Holy

unread,
Jul 1, 2013, 6:00:39 AM7/1/13
to juli...@googlegroups.com
On Monday, July 01, 2013 01:39:19 AM Tomas Lycken wrote:
> I would also like to request a built-in reader for .mat files (i.e. matlab
> data files).

Why does it need to be built-in, rather than the existing package?

--Tim

John Myles White

unread,
Jul 1, 2013, 8:38:50 AM7/1/13
to juli...@googlegroups.com
I'm having a hard time figuring out what benefits we're going to get from this. The existing DataFrames IO system is already broken up into chunks that are completely reusable: there's a readnrows() function that consumes N correctly formatted rows from a CSV file into a data structure that could be used by readdlm() to produce output. Then there's a separate function that does parsing/type inference to determine how to populate a DataFrame using the data structure that readdlm() produces. This machinery can easily be generalized -- but I see almost no settings in which the generalization will help anyone since any realistic tabular data structure should be read into something like a DataFrame. If you know that you don't have heterogenous columns and you also know that you don't have missing data, then the tabular file format isn't for you.

Most of the "hard" work that's being done in DataFrames IO is a work-around for Julia's parse*() functions, which expect to get strings rather than byte vectors with start and end points. In terms of reusable infrastructure that we're lacking, my feeling is that the parse*() functions need to be rewritten from the ground up since they are the major bottleneck. What we need is a generic mechanism for telling Julia to translate subsequences of byte vectors into Int's, Float64's, Bool's and ASCIIString's/UTF8String's. You can build up any parsing system you want once you have access to those tools: you just need a mechanism for telling you where the field breaks are (which readnrows provides) and a mechanism for populating a data structure with fields.

Honestly, I think there's not many use cases for what readcsv() is doing. Do a Google search for "filetype:csv". Look at the results and count how many have heterogeneous columns and/or missing entries. You'll see that it's essentially all of them. If you have a CSV file that was generated in the real world, you want a DataFrame as output. Using CSV files to store things like a matrix of Float64's strikes me as an anti-pattern we shouldn't be encouraging. I'm particularly perplexed by the idea of reading ragged rows: either you're not getting a matrix back or you're getting one that has something like #undef in it. But why aren't you using NA then?

-- John

Viral Shah

unread,
Jul 1, 2013, 10:41:30 AM7/1/13
to juli...@googlegroups.com
There are two simultaneous issues under discussion here, and I will try to separate them and offer my viewpoints on both:

1. There is the re-organization of all the building blocks of readdlm. This includes having a DLMReader so that you can stream data (readnrows), filter / process as you read, improved parse*() functions, transparently mmaping files inside DLMReader, etc. I prefer having all these building blocks in base, which should make it possible for people to roll out their own file format parsers should readdlm or DataFrames.readtable not cut it for them.

2. Then there is the whole CSV file format, with quoting, heterogeneous data, missing data, handling corrupt files, etc. In my opinion, all this really belongs to DataFrames. DataFrames should reuse the building blocks from Base, and then add these additional capabilities. These capabilities will also constantly keep improving as we keep tackling real world data, which will mean APIs that may change faster than APIs in base do.

Once we have this structure largely in place, we can always move some stuff back and forth, as we gain experience. I feel that we have already made huge strides in the last few weeks.

-viral

John Myles White

unread,
Jul 1, 2013, 10:46:45 AM7/1/13
to juli...@googlegroups.com
I agree with this approach.

-- John

Stefan Karpinski

unread,
Jul 1, 2013, 3:19:49 PM7/1/13
to Julia Dev
I have actually done a lot of work with datasets that are nominally in CSV or TSV format, yet just represent matrices. It's not a great format, but *everything* can read it. There's also a fair amount of utility to being able to read a matrix of strings. I agree, however, that most of the time that people are reading CSV files a DataFrame is the most appropriate data structure to represent the data. To that end, maybe we should just get rid of readcsv in base altogether and only have readdlm? The duplication of confusingly similar functionality that bothers me. It makes no sense that readcsv exists but if I want to read a CSV file I *should* use DataFrames.readtable instead of readcsv. That's just broken.

The point of the composability I'm talking about is so that you can do this:

using Hadoop.SequenceFiles # knows nothing about DataFrames
using DataFrames # knows nothing about Sequence File format

df = readtable(SequenceFile("file"))

or something like that and have it just work. It's the same reason I wanted DataFrames to use bitmasks for NAs – so that you can have a DataArray{SomeType} where DataFrames and SomeType don't know anything about each other and still have it just work.

Tomas Lycken

unread,
Jul 1, 2013, 3:46:53 PM7/1/13
to juli...@googlegroups.com
This is exactly what I'm talking about as well - the DataFrames package is great, but if we think it's so crucial for IO, why is it still a package and not part of base? With clear separation between parsing data (reading it into memory) and structuring it (placing it in a data structure that can be used by the user) it is real easy to extend existing functionality with new file formats - or new data structures.

"Built-in" functionality for reading .mat files was probably a stricter wording than what I really meant - but it should be just as easy to read mat files as e.g. csv files. If I need to write an own parser, that's OK as long as the tools I need to do so are there - but I don't want to be bound to a data structure when I parse the file. After all, most of the data I need from these files is just plain matrices of floats, so I should be able to read it into a Matrix{Float64} and never bother about e.g. DataFrames.

Given this approach, I don't think CSV parsing should necessarily be part of DataFrames, but could rather be its own package (if the functionality does not fit into base). For most applications where you actually need the CSV format you probably also want DataFrames, but there might be other times where you have a matrix where you know that you get a nice matrix of all-valid, never-missing values and when DataFrames just is overkill.

Strict decoupling is the key here, but of course it also has to be a process that has to take its time - I really agree with Viral's last comments about moving stuff back and forth along the way.

As a side note: given emerging packages like Phylogenics.jl, there's good reason not to assume tabular data at all - but it would be a real strength for Julia if parsing data files for other structures (trees etc) worked the same way.

//T 

Viral Shah

unread,
Jul 1, 2013, 3:50:59 PM7/1/13
to juli...@googlegroups.com
We are differentiating between readdlm that reads well formed de-limited files, and CSV files that support more bells and whistles, missing data, etc. Thus, if you have a well formed file containing homogeneous data, you will always be able to read it with readdlm in base.

-viral

Stefan Karpinski

unread,
Jul 1, 2013, 4:32:31 PM7/1/13
to Julia Dev
On Mon, Jul 1, 2013 at 3:50 PM, Viral Shah <vi...@mayin.org> wrote:
We are differentiating between readdlm that reads well formed de-limited files, and CSV files that support more bells and whistles, missing data, etc. Thus, if you have a well formed file containing homogeneous data, you will always be able to read it with readdlm in base.

This is simply not true. There are many cases where you might be reading homogenous string data and simple delimited reading will not work because there are embedded delimiters in the fields. Sure, for simple numeric CSV files, this is usually not the case, but that's not the only use case for CSV files that doesn't require DataFrames. The point of separating CSV-reading from the DataFrame structure is that there might be other things you'd want to read into a DataFrame and other things that you'd want to read CSV data into.

Tim Holy

unread,
Jul 1, 2013, 5:13:05 PM7/1/13
to juli...@googlegroups.com
On Monday, July 01, 2013 12:46:53 PM Tomas Lycken wrote:
> "Built-in" functionality for reading .mat files was probably a stricter
> wording than what I really meant - but it should be just as easy to read
> mat files as e.g. csv files.

I'm confused; in what way is it hard to read a .mat file?

using MAT
vars = readmat("myfile.mat")

--Tim

Stefan Karpinski

unread,
Jul 1, 2013, 5:22:24 PM7/1/13
to Julia Dev
It's not at all. I think that MAT files may not be the best example here. The main idea is that there are many data formats (F) for structured data and many data structures (S) one may wish to read that data into. It would be better to have a design where one doesn't need to write specific code any of the F*S possible combinations of formats and data structures one may want to use. Or maybe that's just wrong thinking and there is only a single appropriate data structure for each data format. That's why MAT files seem like a bad example – each variable in a MAT files does actually inherently map to a specific data structure, because that's the point of MAT files.

tanmay

unread,
Jul 1, 2013, 5:50:23 PM7/1/13
to juli...@googlegroups.com
+1 The event driven parser model would be neat. So would be the separation of source (producer) and target (consumer) formats.

The current readcsv in base is just a thin wrapper over readdlm, so moving it out of base would not break anything much. In my thinking, apart from the framework to support generic parsing, base should include all that is needed to work with frequently encountered formats (including cases like delimiters within a field). It should however stick to one well defined format e.g. http://tools.ietf.org/html/rfc4180 for csv. Handling the numerous format deviations found in the wild can be part of packages.

My idea is that for most simple conditions, base should suffice. If necessary, it should be possible to use a package to correct the format of an erroneous file once and use base henceforth. Since handling broken formats can be complex and code handling it is prone to bugs, it's prudent to have it outside where that can be many alternatives to be chosen from to suit a specific case. It should be possible to contain simpler format handlers within more complex ones for maximum reuse.

Thinking CSV along these lines, stuff like filtering and massaging the data while reading can be done though packages. But base should have support for heterogeneous columns, which it does to some extent today and could be better.

Stefan Karpinski

unread,
Jul 1, 2013, 6:20:53 PM7/1/13
to Julia Dev
I want to point out that there are two very different aspects to handling a CSV file:
  1. Decoding the incoming bytes and dealing with commas, newlines, etc. to figure out what the string data that the CSV file encodes are.
  2. Inferring the types of columns and interpreting those strings as actual data.
The latter is part of the DataFrame construction logic and could be just as well applied to Sequence Files or TSV, while the former is inherently part of the CSV format.

Tomas Lycken

unread,
Jul 1, 2013, 6:43:15 PM7/1/13
to juli...@googlegroups.com
@Tim: Nice, I was not aware of the MAT package - it looks nice. And given that it exists, my .mat example was really not the best one.

But the point is, that given that the IO logic is separated into the two (disjoint) categories, with a well-defined and consistent interface between them, it becomes (almost) trivial to implement readers for new formats that can be used with existing data structures - and vice versa.

// T

Simon Kornblith

unread,
Jul 1, 2013, 8:08:38 PM7/1/13
to juli...@googlegroups.com
I think one needs to differentiate between binary formats (where the desired in-memory representation is usually isomorphic to the on-disk representation) and string formats (where the desired in-memory representation is often quite different). But I'm not sure it's going to be possible to hide the complexities of all string formats behind a single abstraction layer. I can't imagine a single event-based parser API that would be well-suited to CSV, JSON, and XML. If you take a look at SAX, which is the standard event-based XML parser API, it specifies a lot of events that are specific to XML. It might be possible to come up with something more general, but generality isn't necessarily desirable, since those XML-specific events make parsing XML files easier. While it makes sense to have a single parsing API for flat record-based data formats (e.g. CSV and TSV as discussed here), IMHO it makes little sense to try to expand this effort to arbitrary file formats.

Simon

John Myles White

unread,
Jul 8, 2013, 12:04:38 PM7/8/13
to juli...@googlegroups.com
Is anything happening here?

I'm in total agreement with Simon: I think the producer/consumer dynamics only make sense for file formats that are basically a variant of the tabular data format. Unless you can read a minibatch of rows and extract something useful, there's no point in building a streaming parser. For example, JSON and XML don't satisfy the requirement that subsets of the input are, in isolation, complete data sets.

Our IO system has code that can easily be used in Base now because it just tells you which bytes from the input IO stream are inside of which fields. Type inference and DataFrame construction are totally separate. There seems to be a memory bug in it that I will track down over the next two days, but this should be a trivial concern. We can copy the mmaping approach from readdlm(), but my impression was that this was actually slower than the approach we had taken -- although that may have since changed.

 -- John
Reply all
Reply to author
Forward
0 new messages