[Python-ideas] csv.DictReader could handle headers more intelligently.

2,435 views
Skip to first unread message

J. Cliff Dyer

unread,
Jan 22, 2013, 8:06:08 PM1/22/13
to python...@python.org
Idea folks,

I'm working with some poorly-formed CSV files, and I noticed that
DictReader always and only pulls headers off of the first row. But many
of the files I see have blank lines before the row of headers, sometimes
with commas to the appropriate field count, sometimes without. The
current implementation's behavior in this case is likely never correct,
and certainly always annoying. Given the following file:

---Start File 1---
,,
A,B,C
1,2,3
2,4,6
---End File 1---

csv.DictReader yields the rows:

{'': 'C'}
{'': '3'}
{'': '6'}


And given a file starting with a zero-length line, like the following:

---Start File 2---

A,B,C
1,2,3
2,4,6
---End File 2---

It yields the following:

{None: ['A', 'B', 'C']}
{None: ['1', '2', '3']}
{None: ['2', '4', '6']}

I think that in both cases, the proper response would be treat the A,B,C
line as the header line. The change that makes this work is pretty
simple. In the fieldnames getter property, the "if not
self._fieldnames:" conditional becomes "while not self._fieldnames or
not any(self._fieldnames):" As a subclass:

import csv


class DictReader(csv.DictReader):
@property
def fieldnames(self):
while self._fieldnames is None or not any(self._fieldnames):
try:
self._fieldnames = next(self.reader)
except StopIteration:
break
return self._fieldnames
self.line_num = self.reader.line_num

#Same as the original setter, just rewritten to associate with the
new getter propery
@fieldnames.setter
def fieldnames(self, value):
self._fieldnames = value

There might be some issues with existing code that depends on the {None:
['1','2','3']} construction, but I can't imagine a time when programmers
would want to see {'': '3'} with the 1 and 2 values getting lost.

Thoughts? Do folks think this is worth adding to the csv library, or
should I just keep using my subclass?

Cheers,
Cliff


_______________________________________________
Python-ideas mailing list
Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas

alex23

unread,
Jan 22, 2013, 8:51:38 PM1/22/13
to python...@python.org
On Jan 23, 11:06 am, "J. Cliff Dyer" <j...@sdf.lonestar.org> wrote:
> I'm working with some poorly-formed CSV files, and I noticed that
> DictReader always and only pulls headers off of the first row.  But many
> of the files I see have blank lines before the row of headers, sometimes
> with commas to the appropriate field count, sometimes without.  The
> current implementation's behavior in this case is likely never correct,
> and certainly always annoying.

I don't think we should start adding support for every malformed type
of csv file that exists. It's easy enough to remove the unnecessary
lines yourself before passing them to DictReader:

from csv import DictReader

with open('malformed.csv','rb') as csvfile:
csvlines = list(l for l in csvfile if l.strip())
csvreader = DictReader(csvlines)

Personally, if I was dealing with this as often as you are, I'd
probably make a custom context manager instead. The problem lies in
the files themselves, not in csv's response to them.

J. Cliff Dyer

unread,
Jan 23, 2013, 11:51:05 AM1/23/13
to python...@python.org
On Tue, 2013-01-22 at 17:51 -0800, alex23 wrote:
> I don't think we should start adding support for every malformed type
> of csv file that exists. It's easy enough to remove the unnecessary
> lines yourself before passing them to DictReader:
>
> from csv import DictReader
>
> with open('malformed.csv','rb') as csvfile:
> csvlines = list(l for l in csvfile if l.strip())
> csvreader = DictReader(csvlines)
>
> Personally, if I was dealing with this as often as you are, I'd
> probably make a custom context manager instead. The problem lies in
> the files themselves, not in csv's response to them.
> _______________________________________________
> Python-ideas mailing list
> Python...@python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>

With all due respect, while you make a good point that we don't want to
start special casing every malformed type of CSV, there is absolutely
something wrong with DictReader's response to files that have duplicate
headers. It throws away data silently.

If you (and others on this list) aren't in favor of trying to find the
right header row (which I can understand: "In the face of ambiguity,
refuse the temptation to guess."), maybe a better solution would be to
raise a (suppressible) exception if the headers aren't uniquely named.
("Errors should never pass silently. Unless explicitly silenced.")

Cheers,
Cliff

Amaury Forgeot d'Arc

unread,
Jan 23, 2013, 12:08:32 PM1/23/13
to J. Cliff Dyer, python...@python.org
Hi,

2013/1/23 J. Cliff Dyer <j...@sdf.lonestar.org>

On Tue, 2013-01-22 at 17:51 -0800, alex23 wrote:
> I don't think we should start adding support for every malformed type
> of csv file that exists. It's easy enough to remove the unnecessary
> lines yourself before passing them to DictReader:
>
>     from csv import DictReader
>
>     with open('malformed.csv','rb') as csvfile:
>         csvlines = list(l for l in csvfile if l.strip())
>         csvreader = DictReader(csvlines)
>
> Personally, if I was dealing with this as often as you are, I'd
> probably make a custom context manager instead. The problem lies in
> the files themselves, not in csv's response to them.
> _______________________________________________
> Python-ideas mailing list
> Python...@python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>

With all due respect, while you make a good point that we don't want to
start special casing every malformed type of CSV, there is absolutely
something wrong with DictReader's response to files that have duplicate
headers. It throws away data silently.

That's how Python dictionaries work, by design:
    d = {'a': 1, 'a': 2}
"silently" discards the first value.

If you (and others on this list) aren't in favor of trying to find the
right header row (which I can understand: "In the face of ambiguity,
refuse the temptation to guess."), maybe a better solution would be to
raise a (suppressible) exception if the headers aren't uniquely named.
("Errors should never pass silently.  Unless explicitly silenced.")

What about a subclass then:

class CarefulDictReader(csv.DictReader):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        fieldnames = self.fieldnames
        if len(fieldnames) != len(set(fieldnames)):
            raise ValueError("Duplicate field names", fieldnames)


--
Amaury Forgeot d'Arc

Antoine Pitrou

unread,
Jan 23, 2013, 12:15:51 PM1/23/13
to python...@python.org
Le Wed, 23 Jan 2013 18:08:32 +0100,
"Amaury Forgeot d'Arc"
<amau...@gmail.com> a écrit :

> Hi,
>
> 2013/1/23 J. Cliff Dyer <j...@sdf.lonestar.org>
>
> > On Tue, 2013-01-22 at 17:51 -0800, alex23 wrote:
> > > I don't think we should start adding support for every malformed
> > > type of csv file that exists. It's easy enough to remove the
> > > unnecessary lines yourself before passing them to DictReader:
> > >
> > > from csv import DictReader
> > >
> > > with open('malformed.csv','rb') as csvfile:
> > > csvlines = list(l for l in csvfile if l.strip())
> > > csvreader = DictReader(csvlines)
> > >
> > > Personally, if I was dealing with this as often as you are, I'd
> > > probably make a custom context manager instead. The problem lies
> > > in the files themselves, not in csv's response to them.
> > > _______________________________________________
> > > Python-ideas mailing list
> > > Python...@python.org
> > > http://mail.python.org/mailman/listinfo/python-ideas
> > >
> >
> > With all due respect, while you make a good point that we don't
> > want to start special casing every malformed type of CSV, there is
> > absolutely something wrong with DictReader's response to files that
> > have duplicate headers. It throws away data silently.
> >
>
> That's how Python dictionaries work, by design:
> d = {'a': 1, 'a': 2}
> "silently" discards the first value.

It's still rather surprising (and, in many cases, undesired). I would
suggest adding a parameter to DictReader to raise an exception when
there are duplicate column headers.

Regards

Antoine.

Yuval Greenfield

unread,
Jan 23, 2013, 12:26:07 PM1/23/13
to Antoine Pitrou, python-ideas
On Wed, Jan 23, 2013 at 7:15 PM, Antoine Pitrou <soli...@pitrou.net> wrote:
It's still rather surprising (and, in many cases, undesired). I would
suggest adding a parameter to DictReader to raise an exception when
there are duplicate column headers.

Regards

Antoine.


Completely agree, it's a big surprise and a quiet bug. 

This is one of those changes we should remember for python 4.0. Until 4.0, give an option to raise an exception upon duplicates. After 4.0 throw an exception on duplicate headers by default with an option to ignore them.

Yuval


J. Cliff Dyer

unread,
Jan 23, 2013, 12:37:01 PM1/23/13
to python...@python.org
Whether it's a subclass or a change to the existing class is worth
having a discussion about. Obviously, the change could be made in a
subclass. Currently, that's what I do. The question at issue is
whether it should be made in the original. My position is that
something should change in the standard library, whether that is
modifying the code in some way to handle edge cases more robustly, or
updating the documentation to advise programmers on how to handle files
that aren't perfectly formed.

This might include documenting that self.reader is an available
attribute (where the programmer could iterate to find the header row
they're looking for, if needed, and then assign it to self.fieldnames).

I do like the idea of assigning the fieldnames variable and then raising
the ValueError, so if the user silences the exception, they still have
access to the field names found. However, I think the behavior should
be overridden on the fieldnames property, so as not to change the
semantics of the DictReader.

Mark Hackett

unread,
Jan 23, 2013, 1:24:07 PM1/23/13
to python...@python.org
On Wednesday 23 Jan 2013, J. Cliff Dyer wrote:
>
> Whether it's a subclass or a change to the existing class is worth
> having a discussion about. Obviously, the change could be made in a
> subclass. Currently, that's what I do. The question at issue is
> whether it should be made in the original. My position is that
> something should change in the standard library, whether that is
> modifying the code in some way to handle edge cases more robustly, or
> updating the documentation to advise programmers on how to handle files
> that aren't perfectly formed.
>

It looks entirely like a format checking on something that doesn't necessarily
have a format.

It therefore belongs in something else. I.e. you define your "csv schema", pass
it on to something that creates a "lint check" on the entire bytestream and/or
checks each input as read, and passed in like any decoration on a base
function in python.

CSV format checking isn't, IMO any different than the socket service decorators
that embed policy on the base function.

Bruce Leban

unread,
Jan 23, 2013, 1:20:21 PM1/23/13
to Antoine Pitrou, python...@python.org

On Wed, Jan 23, 2013 at 9:15 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
> That's how Python dictionaries work, by design:
>     d = {'a': 1, 'a': 2}
> "silently" discards the first value.

It's still rather surprising (and, in many cases, undesired). I would
suggest adding a parameter to DictReader to raise an exception when
there are duplicate column headers.

If there are duplicate column headers, they are probably there for a reason. I can't imagine a case where the desired result is to discard one of the columns. If DictReader is going to recognize this case, perhaps:

A,B,A
1,2,3
4,5,6

would be better as 

{'A': [1,3], 'B': 2}
{'A': [4,6], 'B': 5}

I realize that sometimes getting a single value and sometimes an array is potentially messy, but bear in mind that in most cases the reader of the csv file has some idea of what they are reading. There could be an optional parameter multivalue="A" that lists the columns that are allowed to have multiple values and if not present it raises an exception. To allow any column to be multivalued, you could use multivalue=True.

As to skipping over a leading blank line, this happened to me just yesterday. I was saving some data in csv files and all the files ended up with an extra blank line at the top. I'd be +1 for skipping over a blank line at the top, +0 for skipping over more than one blank line.

Mark Hackett

unread,
Jan 23, 2013, 1:32:20 PM1/23/13
to python...@python.org
On Wednesday 23 Jan 2013, Bruce Leban wrote:
> If there are duplicate column headers, they are probably there for a
> reason. I can't imagine a case where the desired result is to discard one
> of the columns. If DictReader is going to recognize this case, perhaps:
>

I can't see why there would be duplicate column headers for valid reason.

Someone may have written their CSV export incorrectly, but that's not actually
valid.

It would therefore be arguable for the program to give at least a WARNING that
it's throwing data away.

However, since python is mechanising this as a dictionary and since in python
setting A to 1 then setting A to 3 would throw away the earlier value for A
and the import function working AS EXPECTED in Python.

Hence a decorator to insist on some formatting issues (e.g. turning A into a
list of values 1,3 rather than throwing away the 1 or the 3). To do otherwise
would have someone in the official library have to write their own format
conversion and shove it in the middle and telling people what they should be
doing.

Jerry Hill

unread,
Jan 23, 2013, 2:59:42 PM1/23/13
to python...@python.org
On Wed, Jan 23, 2013 at 1:32 PM, Mark Hackett
<mark.h...@metoffice.gov.uk> wrote:
> I can't see why there would be duplicate column headers for valid reason.
>
> Someone may have written their CSV export incorrectly, but that's not actually
> valid.

Sure it is. Since there is no formal spec for .csv files, having a
multiple columns with the same text in the header is a perfectly valid
.csv file. For what it's worth, the informal spec for csv files seems
to be "whatever Excel does" and Excel (and every other
spreadsheet-oriented program) is happy to let you have duplicated
headers too.

> It would therefore be arguable for the program to give at least a WARNING that
> it's throwing data away.

I think the library should give the programmer some sort of indication
that they are losing data. Personally, I'd prefer an exception which
can either be caught or not, depending on whether the program is
designed to handle the situation or not.

> However, since python is mechanising this as a dictionary and since in python
> setting A to 1 then setting A to 3 would throw away the earlier value for A
> and the import function working AS EXPECTED in Python.

I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
It's not terribly surprising once you sit down and think about it, but
it's certainly at least a little unexpected to me that data is being
thrown away with no notice. It's unusual for errors to pass silently
in python.

--
Jerry

J. Cliff Dyer

unread,
Jan 23, 2013, 4:13:54 PM1/23/13
to Jerry Hill, python...@python.org
On Wed, 2013-01-23 at 14:59 -0500, Jerry Hill wrote:
> > However, since python is mechanising this as a dictionary and since
> in python
> > setting A to 1 then setting A to 3 would throw away the earlier
> value for A
> > and the import function working AS EXPECTED in Python.
>
> I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
> It's not terribly surprising once you sit down and think about it, but
> it's certainly at least a little unexpected to me that data is being
> thrown away with no notice. It's unusual for errors to pass silently
> in python.

Moreover, I think while it might be expected for a dict to do this, it
does not follow that a DictReader should be expected to silently throw
away the user's data. Just because it uses the dict format for storage
does not mean that it's okay to throw away user's data silently. Dicts
need to be blazingly fast for a host of reasons. DictReaders do not.
They're usually dealing with file input, so any slowness in the
DictReader itself is going to be dwarfed by the file access. As such we
can afford to be more programmer-friendly here.

Cheers,
Cliff

Yuval Greenfield

unread,
Jan 23, 2013, 4:54:05 PM1/23/13
to J. Cliff Dyer, python-ideas
On Wed, Jan 23, 2013 at 11:13 PM, J. Cliff Dyer <j...@sdf.lonestar.org> wrote:

Moreover, I think while it might be expected for a dict to do this, it
does not follow that a DictReader should be expected to silently throw
away the user's data.  Just because it uses the dict format for storage
does not mean that it's okay to throw away user's data silently.  Dicts
need to be blazingly fast for a host of reasons.  DictReaders do not.
They're usually dealing with file input, so any slowness in the
DictReader itself is going to be dwarfed by the file access.  As such we
can afford to be more programmer-friendly here.

If it were a NamedTupleReader, this wouldn't be an issue.

>>> from collections import namedtuple
>>> namedtuple('x', 'a b a c')
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    namedtuple('x', 'a b a c')
  File "C:\Python27\lib\collections.py", line 288, in namedtuple
    raise ValueError('Encountered duplicate field name: %r' % name)
ValueError: Encountered duplicate field name: 'a' 


Steven D'Aprano

unread,
Jan 23, 2013, 7:19:34 PM1/23/13
to python...@python.org
On 24/01/13 05:20, Bruce Leban wrote:

> I realize that sometimes getting a single value and sometimes an array is
> potentially messy, but bear in mind that in most cases the reader of the
> csv file has some idea of what they are reading. There could be an optional
> parameter multivalue="A" that lists the columns that are allowed to have
> multiple values and if not present it raises an exception. To allow any
> column to be multivalued, you could use multivalue=True.

-1 to adding optional parameters that change the behaviour of a class.

To deal with cases where you expect multiple columns with the same name,
add a new reader class that treats all columns to be multi-valued. The
standard DictReader class should continue to behave like a dict.

Don't over-engineer this MultiDictReader -- it should stay simple and treat
all column names as potentially multivalued. If the caller has some
requirements for which names can have how many columns -- "there should be
exactly three columns named X, and only one Y, and at least four Z" -- they
can check the result and decide for themselves if there is a problem.


> As to skipping over a leading blank line, this happened to me just
> yesterday. I was saving some data in csv files and all the files ended up
> with an extra blank line at the top. I'd be +1 for skipping over a blank
> line at the top, +0 for skipping over more than one blank line.


I don't see any reason not to skip blank lines at the top of the file.



--
Steven

Steven D'Aprano

unread,
Jan 23, 2013, 7:26:52 PM1/23/13
to python...@python.org
On 24/01/13 06:59, Jerry Hill wrote:
> On Wed, Jan 23, 2013 at 1:32 PM, Mark Hackett
> <mark.h...@metoffice.gov.uk> wrote:
>> I can't see why there would be duplicate column headers for valid reason.
>>
>> Someone may have written their CSV export incorrectly, but that's not actually
>> valid.
>
> Sure it is. Since there is no formal spec for .csv files, having a
> multiple columns with the same text in the header is a perfectly valid
> .csv file. For what it's worth, the informal spec for csv files seems
> to be "whatever Excel does" and Excel (and every other
> spreadsheet-oriented program) is happy to let you have duplicated
> headers too.

+1

I think keeping DictReader as it is now is fine for backward compatibility.
Or better, simply have DictReader raise an exception rather than silently
eat data. I don't expect that anyone is relying on that behaviour, nor is
it behaviour promised by the class.

But we should add a MultiDictReader that supports the multiple columns with
the same name.


>> It would therefore be arguable for the program to give at least a WARNING that
>> it's throwing data away.
>
> I think the library should give the programmer some sort of indication
> that they are losing data. Personally, I'd prefer an exception which
> can either be caught or not, depending on whether the program is
> designed to handle the situation or not.
>
>> However, since python is mechanising this as a dictionary and since in python
>> setting A to 1 then setting A to 3 would throw away the earlier value for A
>> and the import function working AS EXPECTED in Python.
>
> I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
> It's not terribly surprising once you sit down and think about it, but
> it's certainly at least a little unexpected to me that data is being
> thrown away with no notice. It's unusual for errors to pass silently
> in python.

Yes, we should not forget that a CSV file is not a dict. Just because DictReader
is implemented with a dict as the storage, doesn't mean that it should behave
exactly like a dict in all things. Multiple columns with the same name are legal
in CSV, so there should be a reader for that situation.



--
Steven

Amaury Forgeot d'Arc

unread,
Jan 24, 2013, 3:37:55 AM1/24/13
to Steven D'Aprano, python...@python.org
2013/1/24 Steven D'Aprano <st...@pearwood.info>

-1 to adding optional parameters that change the behaviour of a class.

Unfortunately there is a precedent with csv.DictWriter: extrasaction='raise' or 'ignore'.
And the feature is close to the one proposed here: how to deal with "invalid" data.

Mark Hackett

unread,
Jan 24, 2013, 5:33:01 AM1/24/13
to python...@python.org
Just because you did wrong before doesn't mean you need to do it wrong again!

Mark Hackett

unread,
Jan 24, 2013, 5:41:41 AM1/24/13
to python...@python.org
On Wednesday 23 Jan 2013, J. Cliff Dyer wrote:
>
> Moreover, I think while it might be expected for a dict to do this, it
> does not follow that a DictReader should be expected to silently throw
> away the user's data.
> Cheers,
> Cliff
>
>

Cliff, the name of the routine is "DictReader".

It is a very big hint.

Like I said, the situation here is putting formatting expectations on the file
being read in.

It's pretty identical with sockets or threading libraries in python. If you
want a specific action done that isn't "normal" for just "make one of them",
you put policy on it as a decoration. But if you wanted some specific action
and don't use the decorator to do so, you don't get an error, you get what you
get without the decorator.

Mark Hackett

unread,
Jan 24, 2013, 5:37:57 AM1/24/13
to python...@python.org
On Wednesday 23 Jan 2013, Jerry Hill wrote:
> On Wed, Jan 23, 2013 at 1:32 PM, Mark Hackett
>
> <mark.h...@metoffice.gov.uk> wrote:
> > I can't see why there would be duplicate column headers for valid reason.
> >
> > Someone may have written their CSV export incorrectly, but that's not
> > actually valid.
>
> Sure it is. Since there is no formal spec for .csv files, having a
> multiple columns with the same text in the header is a perfectly valid
> .csv file. For what it's worth, the informal spec for csv files seems

Then you don't want it put in a dictionary, since a dictionary doesn't allow
duplicate fields.

> to be "whatever Excel does" and Excel (and every other
> spreadsheet-oriented program) is happy to let you have duplicated
> headers too.

You don't, in Excel, use the name of the column in your calculation, you use
the unique column ID (A, B, C..AA, AB, ...).

>
> > It would therefore be arguable for the program to give at least a WARNING
> > that it's throwing data away.
>
> I think the library should give the programmer some sort of indication
> that they are losing data. Personally, I'd prefer an exception which
> can either be caught or not, depending on whether the program is
> designed to handle the situation or not.
>
> > However, since python is mechanising this as a dictionary and since in
> > python setting A to 1 then setting A to 3 would throw away the earlier
> > value for A and the import function working AS EXPECTED in Python.
>
> I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
> It's not terribly surprising once you sit down and think about it, but
> it's certainly at least a little unexpected to me that data is being
> thrown away with no notice. It's unusual for errors to pass silently
> in python.
>

Python doesn't warn about duplicate addition to keys, so as expected, it isn't
warning about them now.

Programming languages are hard enough to understand (why does everyone use a
different way of stopping a loop???), so it's not a good idea to have little
codas to the way things are done "oh, unless you're putting it into a
dictionary via this call...".

I can understand the library call doing so, mind, but I can also see the
writer of the library going "You're putting it into a dictionary. Well, you
know what happens when you put duplicate entries in them, right, else you
wouldn't be using this routine that puts csv entries into a dictionary".

Mark Hackett

unread,
Jan 24, 2013, 5:47:17 AM1/24/13
to python...@python.org
On Thursday 24 Jan 2013, Steven D'Aprano wrote:

> > I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
> > It's not terribly surprising once you sit down and think about it, but
> > it's certainly at least a little unexpected to me that data is being
> > thrown away with no notice. It's unusual for errors to pass silently
> > in python.
>
> Yes, we should not forget that a CSV file is not a dict. Just because
> DictReader is implemented with a dict as the storage, doesn't mean that it
> should behave exactly like a dict in all things. Multiple columns with the
> same name are legal in CSV, so there should be a reader for that
> situation.
>

But just because it's reading a csv file, we shouldn't change how a dictionary
works if you add the same key again.

Duplicate headings in a csv file are as legal as using the same name for
something else in a programming language.

e.g.

endvalue=a+b+c/5
...code using that result...
endvalue = os.printerr(file_descriptor)
...print out an error string...

this is "legal" but really REALLY smelly.

Similarly a multivalued csv file.

Excel uses the column ID not the name on the first row, to identify the columns
in its macro language. Because otherwise which "endvalue" column did you mean?

Shane Green

unread,
Jan 24, 2013, 6:55:05 AM1/24/13
to Mark Hackett, python...@python.org
Not sure if I'm reading the discussion correctly, but it sounds like there's discussion about whether swallowing CSV values when confronted with multiple columns by the same name, which seems very incorrect if so.  CSV doesn't even mandate column headers exist at all, as far as I know.  If anything I would think mapping column positions to header values would make sense, such that header.items() -> [(0, header1), (1, header2), (2, header3), etc.], and header1 and header2 could be equal.  To work with rows as dictionaries they can follow the FieldStorage model and have lists of values–either when there's a collision, or always–so all column values are contained. 

Nick Coghlan

unread,
Jan 24, 2013, 7:33:07 AM1/24/13
to Shane Green, python...@python.org
On Thu, Jan 24, 2013 at 9:55 PM, Shane Green <sh...@umbrellacode.com> wrote:
> Not sure if I'm reading the discussion correctly, but it sounds like there's
> discussion about whether swallowing CSV values when confronted with multiple
> columns by the same name, which seems very incorrect if so. CSV doesn't
> even mandate column headers exist at all, as far as I know. If anything I
> would think mapping column positions to header values would make sense, such
> that header.items() -> [(0, header1), (1, header2), (2, header3), etc.], and
> header1 and header2 could be equal. To work with rows as dictionaries they
> can follow the FieldStorage model and have lists of values–either when
> there's a collision, or always–so all column values are contained.

That's not quite the discussion. The discussion is specifically about
*DictReader*, and whether it should:

1. Do any data conditioning by ignoring empty lines and lines of just
field delimiters before the header row (consensus seems to be "no")
2. Give an error when encountering a duplicate field name (which will
lead to data loss when reading from the file) (consensus seems to be
"yes")

The problem with the latter suggestion is that it's a backwards
incompatible change - code where "use the last column with that name"
is the correct behaviour currently works, but would be broken if that
situation was declared an error.

Rather than messing with DictReader, it seems more fruitful to further
investigate the idea of a namedtuple based reader
(http://bugs.python.org/issue1818). The "multiple columns with the
same name" use case seems specialised enough that the standard readers
can continue to ignore it (although, as noted earlier in this thread,
a namedtuple based reader will correctly reject duplicate column
names)

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Antoine Pitrou

unread,
Jan 24, 2013, 7:38:58 AM1/24/13
to python...@python.org
Le Thu, 24 Jan 2013 22:33:07 +1000,
Nick Coghlan <ncog...@gmail.com> a
écrit :

> On Thu, Jan 24, 2013 at 9:55 PM, Shane Green
> <sh...@umbrellacode.com> wrote:
> > Not sure if I'm reading the discussion correctly, but it sounds
> > like there's discussion about whether swallowing CSV values when
> > confronted with multiple columns by the same name, which seems very
> > incorrect if so. CSV doesn't even mandate column headers exist at
> > all, as far as I know. If anything I would think mapping column
> > positions to header values would make sense, such that
> > header.items() -> [(0, header1), (1, header2), (2, header3), etc.],
> > and header1 and header2 could be equal. To work with rows as
> > dictionaries they can follow the FieldStorage model and have lists
> > of values–either when there's a collision, or always–so all column
> > values are contained.
>
> That's not quite the discussion. The discussion is specifically about
> *DictReader*, and whether it should:
>
> 1. Do any data conditioning by ignoring empty lines and lines of just
> field delimiters before the header row (consensus seems to be "no")
> 2. Give an error when encountering a duplicate field name (which will
> lead to data loss when reading from the file) (consensus seems to be
> "yes")
>
> The problem with the latter suggestion is that it's a backwards
> incompatible change - code where "use the last column with that name"
> is the correct behaviour currently works, but would be broken if that
> situation was declared an error.

It's not really a problem if the new behaviour is conditioned by a
constructor argument.

Regards

Antoine.

J. Cliff Dyer

unread,
Jan 24, 2013, 10:11:34 AM1/24/13
to Antoine Pitrou, python...@python.org
On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
> > 1. Do any data conditioning by ignoring empty lines and lines of
> > just field delimiters before the header row (consensus seems to be
> > "no")

Well, I wouldn't necessarily say we have a consensus on this one. This
idea received a +1 from Bruce Leban and an "I don't see any reason not
to" from Steven D'Aprano.

Objections are:

1. It's a backwards-incompatible change. (This could be mitigated in a
couple ways, as with the duplicate header problem, below). I don't think
anyone has argued that programmers would ever actually want to use the
blank line as the headers, only that they may be doing it now as a
workaround, and breaking the workarounds is undesirable.

2. You should pre-process the CSV instead of adapting the reader to
malformations. (In which case, I think the DictReader.reader attribute
should be better documented, so programmers have some guidance how to do
the pre-processing, as the current DictReader can cause data loss which
would make it difficult to recover the real headers without using the
underlying reader).


> > 2. Give an error when encountering a duplicate field name (which
> > will lead to data loss when reading from the file) (consensus seems
> > to be "yes")

Mostly, but with a strong objection from Mark Hackett, and hesitation
about altering current behavior from Amaury Forgeot d'Arc.

Proposals to solve this problem:

1. Raise an exception (After setting the fieldnames, I think, so if you
wanted to catch and continue or catch and edit the conflicting
fieldnames, you could do so).

2. Combine multiple fields with the same header into a list under the
same key.

2a. Make lists when there are multiple fields, but otherwise, key to
strings as is currently done

2b. For consistency, make all values lists, regardless of the number of
columns.

Proposals for implementation:

1. Create a new Reader class. Suggestions include
"CarefulDictReader" (for the version that raises an exception) and
"MultiDictReader" (for the versions that make lists of values).

2. Add an option to DictReader. The idea to add an option for a
MultiDictReader-like behavior was objected to, but there were multiple
suggestions to add an option for raising an exception, in one case with
the idea that in the future ("Python 4") the option would be standard
behavior.


Note: If we were to implement a CarefulDictReader, it could, without
backward incompatibility, implement both skipping of blank header lines,
and exception raising on duplicate headers.

Cheers,
Cliff

J. Cliff Dyer

unread,
Jan 24, 2013, 10:23:24 AM1/24/13
to Nick Coghlan, python...@python.org
On Thu, 2013-01-24 at 22:33 +1000, Nick Coghlan wrote:
> The problem with the latter suggestion is that it's a backwards
> incompatible change - code where "use the last column with that name"
> is the correct behaviour currently works, but would be broken if that
> situation was declared an error.

One example where a programmer would legitimately want to ignore errors
of this kind: A CSV file has a number of named columns, and a few
unnamed ones, and the programmer doesn't care about data from the
unnamed columns. The unnamed columns all have the same name (''), and
would raise this exception. Hence the need to be able to suppress it
somehow (e.g., by instantiation argument or by catching the exception)
without losing the fieldnames.

Cheers,
Cliff

Chris Angelico

unread,
Jan 24, 2013, 10:24:23 AM1/24/13
to python-ideas
On Fri, Jan 25, 2013 at 2:11 AM, J. Cliff Dyer <j...@sdf.lonestar.org> wrote:
> On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
>> > 1. Do any data conditioning by ignoring empty lines and lines of
>> > just field delimiters before the header row (consensus seems to be
>> > "no")
>
> Well, I wouldn't necessarily say we have a consensus on this one. This
> idea received a +1 from Bruce Leban and an "I don't see any reason not
> to" from Steven D'Aprano.

I've been lurking this thread, but fwiw, I'd put +1 on ignoring empty
lines/just delimiter lines. For a row of column headers, a completely
blank line makes no sense. It's a backward-incompatible change, yes,
but I can't imagine any code actively relying on this. ISTM this would
probably be safe for a minor release (Python 3.4), though of course
not for Python 3.3.1.

ChrisA

Shane Green

unread,
Jan 24, 2013, 10:28:49 AM1/24/13
to J. Cliff Dyer, Antoine Pitrou, python...@python.org
Since every form of CSV file counts EOL as a line terminator, I think discarding empty lines preceding the headers is arguably acceptable, but do not think discarding lines of just delimiters would be.  What about extending the DictReader API so it was easy to perform these actions explicitly, such as being able to discard() the field names to be re-evaluated on the next line?

Mark Hackett

unread,
Jan 24, 2013, 10:29:19 AM1/24/13
to python...@python.org
On Thursday 24 Jan 2013, J. Cliff Dyer wrote:
> > > 2. Give an error when encountering a duplicate field name (which
> > > will lead to data loss when reading from the file) (consensus seems
> > > to be "yes")
>
> Mostly, but with a strong objection from Mark Hackett, and hesitation
> about altering current behavior from Amaury Forgeot d'Arc.
>


More along the lines of your earlier:

> 1. It's a backwards-incompatible change.

strong objection. :-)

Programs that had been working will stop. Programs that won't work because it
doesn't throw an exception yet are no worse off.

When you change something, you'll hear almost entirely from those for whom the
change will be useful. From those who will find it an obstacle, you don't hear
from. Until it's implemented.

Requiring catching an exception means that until the code is changed, your
working program no longer works.

And as you later point out Cliff, empty and uninteresting field names may
legitimately exist and WANT to be ignored.

So although I CAN see a reasoning for an exception, I do not see it as enough
to put it in this version of the library. It's a learning process and for the
next version which will need code changes to incorporate anyway, that
knowledge can be used to make things better *next time*.

J. Cliff Dyer

unread,
Jan 24, 2013, 10:55:17 AM1/24/13
to Mark Hackett, python...@python.org
On Thu, 2013-01-24 at 15:29 +0000, Mark Hackett wrote:
> On Thursday 24 Jan 2013, J. Cliff Dyer wrote:
> > > > 2. Give an error when encountering a duplicate field name (which
> > > > will lead to data loss when reading from the file) (consensus seems
> > > > to be "yes")
> >
> > Mostly, but with a strong objection from Mark Hackett, and hesitation
> > about altering current behavior from Amaury Forgeot d'Arc.
> >
>
>
> More along the lines of your earlier:
>
> > 1. It's a backwards-incompatible change.
>
> strong objection. :-)
>
> Programs that had been working will stop. Programs that won't work because it
> doesn't throw an exception yet are no worse off.
>

Noted. I will say that this doesn't seem any worse than any other
backwards-incompatible change, which are sometimes allowed, so it should
probably be considered by the same standard.

That said, what are your feelings on adding a CarefulDictReader?

J. Cliff Dyer

unread,
Jan 24, 2013, 11:08:16 AM1/24/13
to Shane Green, Antoine Pitrou, python...@python.org
On Thu, 2013-01-24 at 07:28 -0800, Shane Green wrote:
> Since every form of CSV file counts EOL as a line terminator, I think
> discarding empty lines preceding the headers is arguably acceptable,
> but do not think discarding lines of just delimiters would be. What
> about extending the DictReader API so it was easy to perform these
> actions explicitly, such as being able to discard() the field names to
> be re-evaluated on the next line?

I think I like this idea. There's something a little distasteful about
making the user manually delve into the underlying reader, but this
makes it more user-friendly and more obvious how to proceed.

For clarity's sake, what is your objection to discarding lines of
delimiters? The reason I suggest doing it is that it is a common output
situation when exporting Excel files or LibreCalc files that have a
blank row at the top.

Yuval Greenfield

unread,
Jan 24, 2013, 11:08:34 AM1/24/13
to J. Cliff Dyer, Antoine Pitrou, python-ideas
On Thu, Jan 24, 2013 at 5:11 PM, J. Cliff Dyer <j...@sdf.lonestar.org> wrote:
On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
> > 1. Do any data conditioning by ignoring empty lines and lines of
> > just field delimiters before the header row (consensus seems to be
> > "no")

Well, I wouldn't necessarily say we have a consensus on this one.  This
idea received a +1 from Bruce Leban and an "I don't see any reason not
to" from Steven D'Aprano.


Count me in that list as well.

If it were urllib handling a special case for a server you don't control then fine. But it's a valid CSV file you can process yourself if you need more control. We should keep DictReader simple. This is also a reason against "CarefulDictReader". If you need to be more specific then use csv.Reader.
 

> > 2. Give an error when encountering a duplicate field name (which
> > will lead to data loss when reading from the file) (consensus seems
> > to be "yes")

Mostly, but with a strong objection from Mark Hackett, and hesitation
about altering current behavior from Amaury Forgeot d'Arc.

In that one too.

Maybe we should ask the people on this list http://hg.python.org/cpython/log/5b02d622d625/Lib/csv.py

Yuval

Yuval Greenfield

unread,
Jan 24, 2013, 11:09:28 AM1/24/13
to J. Cliff Dyer, Antoine Pitrou, python-ideas
To clarify - I agree with the aforementioned "consensus".

MRAB

unread,
Jan 24, 2013, 11:12:09 AM1/24/13
to python-ideas
On 2013-01-24 15:24, Chris Angelico wrote:
> On Fri, Jan 25, 2013 at 2:11 AM, J. Cliff Dyer <j...@sdf.lonestar.org> wrote:
>> On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
>>> > 1. Do any data conditioning by ignoring empty lines and lines of
>>> > just field delimiters before the header row (consensus seems to be
>>> > "no")
>>
>> Well, I wouldn't necessarily say we have a consensus on this one. This
>> idea received a +1 from Bruce Leban and an "I don't see any reason not
>> to" from Steven D'Aprano.
>
> I've been lurking this thread, but fwiw, I'd put +1 on ignoring empty
> lines/just delimiter lines. For a row of column headers, a completely
> blank line makes no sense. It's a backward-incompatible change, yes,
> but I can't imagine any code actively relying on this. ISTM this would
> probably be safe for a minor release (Python 3.4), though of course
> not for Python 3.3.1.
>
Ignoring empty lines before a header seems OK to me, but ignoring
just-delimiter lines doesn't.

To me, a just-delimiter line where it's expecting a header would mean
that all of the columns are unnamed, unless we insist that it's not a
header unless at least one column is named, and I don't think that that
should be the default behaviour.

As for duplicated columns names, I think that it should probably raise
an exception unless you've specified that duplicates should be put into
a list.

Mark Hackett

unread,
Jan 24, 2013, 11:23:50 AM1/24/13
to python...@python.org
On Thursday 24 Jan 2013, J. Cliff Dyer wrote:
> For clarity's sake, what is your objection to discarding lines of
> delimiters? The reason I suggest doing it is that it is a common output
> situation when exporting Excel files or LibreCalc files that have a
> blank row at the top.
>
> Cheers,
> Cliff
>

I'm putting too many pennies in this pot, I feel, but...

What was the purpose of those blank lines? Like duplicate column names at the
first row, what you need to do with them depends on why they are there and what
the program using the output wants to do.

If someone took the repository of macros from the spreadsheet which used
column numbers and this was used to recreate EXACTLY whatever calculations
were done without having to keep two copies of the same algorithm to account
for the dropping of rows in the script, then dropping the rows would break
this.

This really is policy (wrt the source of the CSV and the consumer of the
dictionary).

Make it a pre process of the CSV to be used and configured to fit what the
meaning of the CSV file output was to the producing program and what bits of it
make a difference to the consumer of the dictionary's contents.

J. Cliff Dyer

unread,
Jan 24, 2013, 11:40:09 AM1/24/13
to Mark Hackett, python...@python.org
On Thu, 2013-01-24 at 16:17 +0000, Mark Hackett wrote:
> >
> > That said, what are your feelings on adding a CarefulDictReader?
> >
>
> It's as good a solution to me as any.
>
> However, I'm not that good a programmer, and therefore what *I'd* do
> isn't
> necessarily a good idea, it's just one of the better ones out of the
> limited
> toolbox I have available.
>
> I'd prefer (for aesthetic reasons) some sort of stream converter. Much
> like
> freeze/thaw serialisation of data, it'd be a step between the raw csv
> and the
> reader that reads it.
>
>

I think my reason for wanting to have a CarefulDictReader (or a careful
DictReader), and why I think a stream converter isn't the best solution,
is that CSVs are very commonly used by people just starting to get their
feet wet with programming. Consider the use case: I've got my excel
file, and I'm just getting to the point where excel isn't cutting it
anymore. I want to start manipulating my data with python, and everyone
is telling me to use the csv library. DictReader sounds cool, because I
don't want to have to remember column numbers, and this is going make my
code much more readable. But I can't make it read my headers simply
because I put some blank space at the top of my excel file, above my
headers.

A stream converter is another layer of complexity that keeps this
potential new programmer from having a good experience with programming,
for what gain? So that the csv library can "properly" (?) treat a line
without data as a header? I think it would be fully reasonable (and add
little to no complexity to the code) to have a DictReader that treats
the first non-empty line as the header row.

The csv module is one of the big gateways into python programming for a
lot of people. That's also one of the reasons I think the sockets
library is a poor analogue here. A new programmer is unlikely to reach
the sockets library until they've been through a few of the urllibs, the
httplibs, requests, some part of http or an external web framework,
smtplib, or some other higher-level networking-related libraries.

For the same reason, I think if the solution isn't something handled
automatically by the library, it needs to be accompanied by improvements
to the documentation. If we're going to provide a DictReader that is
this easy to break, we need to answer the question: How do I fix it?


Cheers,
Cliff

Shane Green

unread,
Jan 24, 2013, 11:41:40 AM1/24/13
to J. Cliff Dyer, Antoine Pitrou, python...@python.org
Well, my objection to doing it automatically was based in part on not being familiar with the common scenarios you've brought up, but the other reasons I had in mind were that it seemed like the kind of thing that might also be indicative of an error–something wrong with the data someone might want to know was happening rather than have masked; and also because discarding such rows leaves a question about the delimiter: it's now known, but knowing it based on rows we've discarded seems unclean.  

J. Cliff Dyer

unread,
Jan 24, 2013, 11:41:07 AM1/24/13
to Mark Hackett, python...@python.org
On Thu, 2013-01-24 at 16:23 +0000, Mark Hackett wrote:

> If someone took the repository of macros from the spreadsheet which used
> column numbers and this was used to recreate EXACTLY whatever calculations
> were done without having to keep two copies of the same algorithm to account
> for the dropping of rows in the script, then dropping the rows would break
> this.
>

If that's the case, then why are you using a DictReader instead of a raw
csv.reader? You're already losing the first row.

Serhiy Storchaka

unread,
Jan 24, 2013, 3:35:14 PM1/24/13
to python...@python.org
On 23.01.13 03:51, alex23 wrote:
> with open('malformed.csv','rb') as csvfile:
> csvlines = list(l for l in csvfile if l.strip())
> csvreader = DictReader(csvlines)

csvreader = DictReader(l for l in csvfile if l.strip())

Steven D'Aprano

unread,
Jan 24, 2013, 6:15:14 PM1/24/13
to python...@python.org
On 25/01/13 03:08, J. Cliff Dyer wrote:
> On Thu, 2013-01-24 at 07:28 -0800, Shane Green wrote:
>> Since every form of CSV file counts EOL as a line terminator, I think
>> discarding empty lines preceding the headers is arguably acceptable,
>> but do not think discarding lines of just delimiters would be. What
>> about extending the DictReader API so it was easy to perform these
>> actions explicitly, such as being able to discard() the field names to
>> be re-evaluated on the next line?
>
> I think I like this idea. There's something a little distasteful about
> making the user manually delve into the underlying reader, but this
> makes it more user-friendly and more obvious how to proceed.

I couldn't disagree more. I think:

- it adds burden to the caller, since the caller is now expected to manually
inspect the field names and decide whether some should be discarded;

- it is less obvious: *how* does the caller decide that there are too many
field names?

- incomplete: if there is a discard(), where is the add()?

- completely irrelevant for the topic being discussed ("DictReader should
ignore leading blank lines... I know, let's give the caller the ability
to *discard* field names" -- but auto-detecting *too many* field names is
not the problem);

- and being able to change the field names on the fly is so far beyond
anything required for ordinary CSV that it doesn't belong in the CSV
module.


> For clarity's sake, what is your objection to discarding lines of
> delimiters? The reason I suggest doing it is that it is a common output
> situation when exporting Excel files or LibreCalc files that have a
> blank row at the top.


A row of delimiters should be treated by the reader object as a row with
explicitly empty fields. If the caller wishes to discard them, they can.
But the reader object shouldn't make that decision.

An empty row, on the other hand, should be just ignored. DictReader *already*
ignores empty rows, provided that they are not in the first row.



--
Steven

Steven D'Aprano

unread,
Jan 24, 2013, 6:53:51 PM1/24/13
to python...@python.org
On 25/01/13 02:11, J. Cliff Dyer wrote:
> On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
>>> 1. Do any data conditioning by ignoring empty lines and lines of
>>> just field delimiters before the header row (consensus seems to be
>>> "no")
>
> Well, I wouldn't necessarily say we have a consensus on this one. This
> idea received a +1 from Bruce Leban and an "I don't see any reason not
> to" from Steven D'Aprano.
>
> Objections are:
>
> 1. It's a backwards-incompatible change.

All bug fixes are backwards-incompatible changes. The question is, is
there anyone relying on this behaviour?

DictReader already ignores blank lines, *except for the very first line*.
Using Python 3.3:

py> from io import StringIO
py> from csv import DictReader
py> data = StringIO('spam,ham,eggs\n\n\n\n1,2,3\n\n\n\n\n4,5,6\n')
py> x = csv.DictReader(data)
py> next(x)
{'eggs': '3', 'ham': '2', 'spam': '1'}
py> next(x)
{'eggs': '6', 'ham': '5', 'spam': '4'}


I don't expect that there is anyone relying on a CSV file with a leading
blank line to be treated as one having no columns at all:

py> data = StringIO('\n\n\n\nspam,ham,eggs\n1,2,3\n4,5,6\n')
py> x = DictReader(data)
py> next(x)
{None: ['spam', 'ham', 'eggs']}
py> x.fieldnames
[]


I expect that there is probably code that works around this issue, by
skipping blank lines somehow, e.g.

DictReader(row for row in data if row.strip())

These work-arounds may (or not) be fragile or buggy, but they ought
to continue working even if DictReader changes its header detection.



--
Steven

Shane Green

unread,
Jan 24, 2013, 7:05:43 PM1/24/13
to Steven D'Aprano, python...@python.org
If this is part of the same response…

A row of delimiters should be treated by the reader object as a row with
explicitly empty fields. If the caller wishes to discard them, they can.
But the reader object shouldn't make that decision.

An empty row, on the other hand, should be just ignored. DictReader *already*
ignores empty rows, provided that they are not in the first row.

Then I think my description was unclear.  I wasn't suggesting we add methods for manipulating individual headers, only for telling the DictParser to drop existing headers and reevaluate them on the next row.  To make it easy to do something like 

while not any(records.fieldnames):
records.discard_fieldnames() # or something to that effect…

without changing any existing behaviour.

alex23

unread,
Jan 24, 2013, 8:49:53 PM1/24/13
to python...@python.org
On 25 Jan, 06:35, Serhiy Storchaka <storch...@gmail.com> wrote:
> On 23.01.13 03:51, alex23 wrote:
>
> >      with open('malformed.csv','rb') as csvfile:
> >          csvlines = list(l for l in csvfile if l.strip())
> >          csvreader = DictReader(csvlines)
>
> csvreader = DictReader(l for l in csvfile if l.strip())

Uh, thanks, although I'm not sure what you think you're showing me
that I'm not already aware of. I spelled it out as two separate
expressions for clarity, I didn't realise we were playing code golf in
our examples.

Stephen J. Turnbull

unread,
Jan 24, 2013, 9:38:30 PM1/24/13
to Steven D'Aprano, python...@python.org
Steven D'Aprano writes:

> - it adds burden to the caller, since the caller is now expected to
> manually inspect the field names and decide whether some should
> be discarded;

It's a dirty job but somebody has to do it.

And that ultimately has to be the *writer* of the CSV file, not the
reader. Both csv.DictReader and the caller are merely guessing unless
there's a private agreement with the writer. cvs.DictReader, as a
stdlib module, can't know about that agreement. The caller can
(although one obvious use case for csv.DictReader is that the caller
doesn't and is hoping csv.DictReader can guess better, oops).

Unless somebody has figured out how to give stdlib code "channeling"
capability?

Ethan Furman

unread,
Jan 24, 2013, 10:20:23 PM1/24/13
to python...@python.org
On 01/24/2013 02:47 AM, Mark Hackett wrote:
> On Thursday 24 Jan 2013, Steven D'Aprano wrote:
>
>>> I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
>>> It's not terribly surprising once you sit down and think about it, but
>>> it's certainly at least a little unexpected to me that data is being
>>> thrown away with no notice. It's unusual for errors to pass silently
>>> in python.
>>
>> Yes, we should not forget that a CSV file is not a dict. Just because
>> DictReader is implemented with a dict as the storage, doesn't mean that it
>> should behave exactly like a dict in all things. Multiple columns with the
>> same name are legal in CSV, so there should be a reader for that
>> situation.
>>
>
> But just because it's reading a csv file, we shouldn't change how a dictionary
> works if you add the same key again.

The proposal is not to change how a dict works, but what the proper
response is for DictReader when a duplicate key is found.

~Ethan~

Ethan Furman

unread,
Jan 24, 2013, 10:25:38 PM1/24/13
to python...@python.org
On 01/22/2013 05:06 PM, J. Cliff Dyer wrote:

> Thoughts? Do folks think this is worth adding to the csv library, or
> should I just keep using my subclass?

+1 for ignoring blank lines (including delimiter-only lines)

+1 for raising an exception on duplicate headers

+1 for a flag to not raise on duplicate empty headers (but a completely
empty header line is still ignored)

Terry Reedy

unread,
Jan 24, 2013, 11:26:19 PM1/24/13
to python...@python.org
On 1/24/2013 6:53 PM, Steven D'Aprano wrote:

> DictReader already ignores blank lines, *except for the very first line*.

Interesting. A proper csv file does not contain blank lines. The csv doc
is silent on what is does they are present. (The work 'blank' does not
appear.) Ignoring them seems reasonable, but then all should be ignored.
And the doc should say so.

> Using Python 3.3:
>
> py> from io import StringIO
> py> from csv import DictReader
> py> data = StringIO('spam,ham,eggs\n\n\n\n1,2,3\n\n\n\n\n4,5,6\n')
> py> x = csv.DictReader(data)
> py> next(x)
> {'eggs': '3', 'ham': '2', 'spam': '1'}
> py> next(x)
> {'eggs': '6', 'ham': '5', 'spam': '4'}
>
>
> I don't expect that there is anyone relying on a CSV file with a leading
> blank line to be treated as one having no columns at all:
>
> py> data = StringIO('\n\n\n\nspam,ham,eggs\n1,2,3\n4,5,6\n')
> py> x = DictReader(data)
> py> next(x)
> {None: ['spam', 'ham', 'eggs']}
> py> x.fieldnames
> []
>
>
> I expect that there is probably code that works around this issue, by
> skipping blank lines somehow, e.g.
>
> DictReader(row for row in data if row.strip())
>
> These work-arounds may (or not) be fragile or buggy, but they ought
> to continue working even if DictReader changes its header detection.

--
Terry Jan Reedy

Serhiy Storchaka

unread,
Jan 25, 2013, 5:01:08 AM1/25/13
to python...@python.org
On 25.01.13 03:49, alex23 wrote:
> On 25 Jan, 06:35, Serhiy Storchaka <storch...@gmail.com> wrote:
>> csvreader = DictReader(l for l in csvfile if l.strip())
>
> Uh, thanks, although I'm not sure what you think you're showing me
> that I'm not already aware of. I spelled it out as two separate
> expressions for clarity, I didn't realise we were playing code golf in
> our examples.

I point that you no need to read all file in a memory. You can use an
iterator and process it line by line.

Mark Hackett

unread,
Jan 25, 2013, 5:58:28 AM1/25/13
to python...@python.org
On Friday 25 Jan 2013, Ethan Furman wrote:
> On 01/24/2013 02:47 AM, Mark Hackett wrote:
> > On Thursday 24 Jan 2013, Steven D'Aprano wrote:
> >>> I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
> >>> It's not terribly surprising once you sit down and think about it, but
> >>> it's certainly at least a little unexpected to me that data is being
> >>> thrown away with no notice. It's unusual for errors to pass silently
> >>> in python.
> >>
> >> Yes, we should not forget that a CSV file is not a dict. Just because
> >> DictReader is implemented with a dict as the storage, doesn't mean
> >> that it should behave exactly like a dict in all things. Multiple
> >> columns with the same name are legal in CSV, so there should be a reader
> >> for that situation.
> >
> > But just because it's reading a csv file, we shouldn't change how a
> > dictionary works if you add the same key again.
>
> The proposal is not to change how a dict works, but what the proper
> response is for DictReader when a duplicate key is found.
>

Ethan, the proposal is predicated on the "silent abandonment" (which isn't
actually the case any more than doing:

a=4
a=9

is abandoning silently the 4.) being unexpected.

Except, just like the assignment in the aside above, this is entirely what IS
expected if you're putting a CSV line into a dictionary with duplicate key
names.

If you don't want it to do what a dictionary does, then don't use DictReader,
as Chris proposes.

My only niggle with that idea is that you'd be writing a lot of "SumptyReader"
for each case and is redundant. But that may, in practice, be no problem at
all.

If you didn't want it to do what a dict does, don't use a dict.

Mark Hackett

unread,
Jan 25, 2013, 6:00:31 AM1/25/13
to python...@python.org
On Thursday 24 Jan 2013, Steven D'Aprano wrote:
> - it is less obvious: how does the caller decide that there are too many
> field names?
>

Additionally, the user of the library now has to read much more about the
library (either code or documentation, which has to track the code too), to
decide what it is going to do.

If you have to read the code, then it's not really OO, is it. It's light grey,
not black box.

Ethan Furman

unread,
Jan 25, 2013, 11:30:25 AM1/25/13
to python...@python.org
On 01/25/2013 03:00 AM, Mark Hackett wrote:
> On Thursday 24 Jan 2013, Steven D'Aprano wrote:
>> - it is less obvious: how does the caller decide that there are too many
>> field names?
>>
>
> Additionally, the user of the library now has to read much more about the
> library (either code or documentation, which has to track the code too), to
> decide what it is going to do.
>
> If you have to read the code, then it's not really OO, is it. It's light grey,
> not black box.

If you have to read the code, the documentation needs improvement.

~Ethan~

Mark Hackett

unread,
Jan 25, 2013, 11:53:46 AM1/25/13
to python...@python.org
On Friday 25 Jan 2013, Ethan Furman wrote:
> On 01/25/2013 03:00 AM, Mark Hackett wrote:
> > On Thursday 24 Jan 2013, Steven D'Aprano wrote:
> >> - it is less obvious: how does the caller decide that there are too many
> >> field names?
> >
> > Additionally, the user of the library now has to read much more about the
> > library (either code or documentation, which has to track the code too),
> > to decide what it is going to do.
> >
> > If you have to read the code, then it's not really OO, is it. It's light
> > grey, not black box.
>
> If you have to read the code, the documentation needs improvement.
>

And if you put your feet too close to the fire, your feet will burn.

Neither have anything to do with the subject at hand, however.

Which is if a dictionary acts a certain way and calling a routine that creates
a dictionary AND WORKS DIFFERENTLY, then why did you use a routine that
creates a dictionary?

You see, the option here is to leave it operating as a dictionary operates.
And in that case, you do not need to document anything. The documentation of
how it works is already covered by the python basics: "How does a dictionary
work in Python?".

So don't change it, and you don't have to improve the documentation.

Ethan Furman

unread,
Jan 25, 2013, 11:48:43 AM1/25/13
to python...@python.org
On 01/25/2013 02:58 AM, Mark Hackett wrote:
> On Friday 25 Jan 2013, Ethan Furman wrote:
>> On 01/24/2013 02:47 AM, Mark Hackett wrote:
>>> On Thursday 24 Jan 2013, Steven D'Aprano wrote:
>>>>> I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
>>>>> It's not terribly surprising once you sit down and think about it, but
>>>>> it's certainly at least a little unexpected to me that data is being
>>>>> thrown away with no notice. It's unusual for errors to pass silently
>>>>> in python.
>>>>
>>>> Yes, we should not forget that a CSV file is not a dict. Just because
>>>> DictReader is implemented with a dict as the storage, doesn't mean
>>>> that it should behave exactly like a dict in all things. Multiple
>>>> columns with the same name are legal in CSV, so there should be a reader
>>>> for that situation.
>>>
>>> But just because it's reading a csv file, we shouldn't change how a
>>> dictionary works if you add the same key again.
>>
>> The proposal is not to change how a dict works, but what the proper
>> response is for DictReader when a duplicate key is found.
>
> Ethan, the proposal is predicated on the "silent abandonment" (which isn't
> actually the case any more than doing:
>
> a=4
> a=9
>
> is abandoning silently the 4.) being unexpected.

We're going to have to agree to disagree on this point -- I think there
is a huge difference between reassigning a variable which is completely
under your control from losing entire columns of data from a file which
you may have never seen before.


> Except, just like the assignment in the aside above, this is entirely what IS
> expected if you're putting a CSV line into a dictionary with duplicate key
> names.

Expected by whom? The library writer? Sure. The application writer?
Maybe. The person creating the spreadsheet that's going to be dumped to
csv to be imported into the program that thought, "This field also needs
an item number... I'll call it 'item_no', just like that other column"
-- Nope.


> If you don't want it to do what a dictionary does, then don't use DictReader,
> as Chris proposes.

DictReader puts a name on a column -- that's its primary use; I don't
think the designers had the goal of dropping data when they implemented
it -- I suspect it was just missed as a possibility (not being the
"normal" type of csv file) or putting a warning in the docs was missed.

~Ethan~

ru...@yahoo.com

unread,
Jan 25, 2013, 1:03:03 PM1/25/13
to python...@googlegroups.com
On 01/25/2013 09:53 AM, Mark Hackett wrote:> On Friday 25 Jan 2013, Ethan Furman wrote:
>> On 01/25/2013 03:00 AM, Mark Hackett wrote:
>> > On Thursday 24 Jan 2013, Steven D'Aprano wrote:
>> >> - it is less obvious: how does the caller decide that there are too many
>> >>     field names?
>> >
>> > Additionally, the user of the library now has to read much more about the
>> > library (either code or documentation, which has to track the code too),
>> > to decide what it is going to do.
>> >
>> > If you have to read the code, then it's not really OO, is it. It's light
>> > grey, not black box.
>>
>> If you have to read the code, the documentation needs improvement.
>>
>
> And if you put your feet too close to the fire, your feet will burn.
>
> Neither have anything to do with the subject at hand, however.
>
> Which is if a dictionary acts a certain way and calling a routine that creates
> a dictionary AND WORKS DIFFERENTLY, then why did you use a routine that
> creates a dictionary?
>
> You see, the option here is to leave it operating as a dictionary operates.
> And in that case, you do not need to document anything. The documentation of
> how it works is already covered by the python basics: "How does a dictionary
> work in Python?".

The csv DictReader *uses* a dictionary for its output. That
it does so imposes no requirements on how it should parse or
otherwise handle the input that eventually goes into that
dict.

I can understand the appeal of keeping things simple and
simply cramming whatever comes out of a simple parse of
the header into the dict keys.  Simplicity is good and
that is a valid opinion.  However it is not a-priori the
obviously best one no matter how much hand-waving and
foot stomping comes with it.

I would prefer to see a suppressible exception when header
keys are duplicated on the grounds that such a csv file
is not in general an appropriate input for the DictReader.


> So don't change it, and you don't have to improve the documentation.

If it's not changed then documentation definitely should
be fixed.  The very fact that when the behaviour was pointed
out here, the result was a long discussion rather than one
or two responses that said, "of course it behaves that way"
is the strongest evidence that the current description
is inadequate.

Shane Green

unread,
Jan 26, 2013, 2:43:11 AM1/26/13
to ru...@yahoo.com, python...@googlegroups.com
I've been trying to avoid the wrath, but can't any longer.  Let me start but clarifying that I know what a dictionary is, how it works, and what Python is, so we can bypass calling that into question.  I also know what CSV is, and I've dealt with a lot of real-life examples of CSV data: not just exports from excel, log data from the energy management space, sensor values, etc.; critical electrical fault data generated by very legacy, stupid equipment.  And while it's true that a dictionary is a dictionary and it works the way it works, the real point that drives home is that it's an inappropriate mechanism for dealing ordered rows of sequential values.  Regardless of what choices were made for the implementation, if the module's name is csv, it should be able to do the things it says it does with any legal CSV content without losing information.  Just because its how a dictionary works doesn't mean column 3's value replacing column 1's value is something other than the loss of data.  One CSV file I worked with had headers for five columns of information, then the header "VALUE" for every 5 minute period in an hour.  Using this CSV parser would leave the client with one sample an hour: how dictionaries work isn't going to bring back 10 values, so information was lost.  

The final point is a simple one: while that CSV file format was stupid, it was perfectly legal.  Something that deals with CSV content should not be losing any of its content.  It also should be barfing or throwing exceptions, by the way.  

Shane Green

unread,
Jan 26, 2013, 2:52:15 AM1/26/13
to ru...@yahoo.com, python...@googlegroups.com
And what about fixing it by replacing implementing a class that does it correctly, maps values to column numbers, keeps values as lists modeled after FieldStorage.  Make iterating it work just like it does now by replacing the values with the last value in each least before returning it, and provide iterator methods for getting at the new functionality, which includes iterating items with repeating header names in order, etc; and also iter records, or something like that, to iterate the head: [value, …] maps?

Shane Green

unread,
Jan 26, 2013, 3:00:48 AM1/26/13
to python...@googlegroups.com
I love it when the single word I skip completely changes the sentence's meaning…

The final point is a simple one: while that CSV file format was stupid, it was perfectly legal.  Something that deals with CSV content should not be losing any of its content.  It also should not be barfing or throwing exceptions, by the way.  




Shane Green

unread,
Jan 26, 2013, 6:55:48 AM1/26/13
to python...@python.org
Sorry if this is a dupe–it went to the google groups address the first time around, and I think that's different…


I've been trying to avoid the wrath, but can't any longer.  Let me start but clarifying that I know what a dictionary is, how it works, and what Python is, so we can bypass calling that into question.  I also know what CSV is, and I've dealt with a lot of real-life examples of CSV data: not just exports from excel, log data from the energy management space, sensor values, etc.; critical electrical fault data generated by very legacy, stupid equipment.  And while it's true that a dictionary is a dictionary and it works the way it works, the real point that drives home is that it's an inappropriate mechanism for dealing ordered rows of sequential values.  Regardless of what choices were made for the implementation, if the module's name is csv, it should be able to do the things it says it does with any legal CSV content without losing information.  Just because its how a dictionary works doesn't mean column 3's value replacing column 1's value is something other than the loss of data.  One CSV file I worked with had headers for five columns of information, then the header "VALUE" for every 5 minute period in an hour.  Using this CSV parser would leave the client with one sample an hour: how dictionaries work isn't going to bring back 10 values, so information was lost.  

The final point is a simple one: while that CSV file format was stupid, it was perfectly legal.  Something that deals with CSV content should not be losing any of its content.  It also should [not] be barfing or throwing exceptions, by the way.  

And what about fixing it by replacing implementing a class that does it correctly, maps values to column numbers, keeps values as lists modeled after FieldStorage.  Make iterating it work just like it does now by replacing the values with the last value in each least before returning it, and provide iterator methods for getting at the new functionality, which includes iterating items with repeating header names in order, etc; and also iter records, or something like that, to iterate the head: [value, …] maps?

Stephen J. Turnbull

unread,
Jan 26, 2013, 8:53:53 AM1/26/13
to Shane Green, python...@python.org
Shane Green writes:

> And while it's true that a dictionary is a dictionary and it works
> the way it works, the real point that drives home is that it's an
> inappropriate mechanism for dealing ordered rows of sequential
> values.

Right! So use csv.reader, or csv.DictReader with an explicit
fieldnames argument.

The point of csv.DictReader with default fieldnames is to take a
"well-behaved" table and turn it into a sequence of "poor-man's"
objects.

> The final point is a simple one: while that CSV file format was
> stupid, it was perfectly legal. Something that deals with CSV
> content should not be losing any of its content.

That's a reasonable requirement.

> It also should [not] be barfing or throwing exceptions, by the way.

That's not. As long as the module provides classes capable of
handling any CSV format (it does), it may also provide convenience
classes for special purposes with restricted formats. Those classes
may throw exceptions on input that doesn't satisfy the restrictions.

> And what about fixing it by replacing implementing a class that
> does it correctly, [...]?

Doesn't help users who want automatically detected access-by-name.
They must have unique field names. (I don't have a use case. I
assume the implementer of csv.DictReader did.<wink/>)

Shane Green

unread,
Jan 26, 2013, 9:39:11 AM1/26/13
to Stephen J. Turnbull, python...@python.org
Okay, I like your point about DictReader having a place with a subset of CSV tables, and agree that, given that definition, it should throw an exception when its fed something that doesn't conform to this definition.  I like that.

One thing, though, the new version would let you access column data by name as well: 

Instead of
row["timestamp"] == 1359210019.299478

It would be
row["timestamp"] == [1359210019.299478]

And potentially 
row["timestamp"] == [1359210019.299478,1359210019.299478]

It could also be accessed as: 
row.headers[0] == "timestamp"
row.headers[1] == "timestamp"
row.values[0] == 1359210019.299478
row.values[1] == 1359210019.299478

Could still provide: 
for name,value in records.iterfirstitems(): # get the first value for each column with a given name.
  - or - 
for name,value in records.iterlasttitems(): # get the last value for each column with a given name.

And the exact functionality you have now: 
records.itervaluemaps() # or something… just a map(dict(records.iterlastitesm()))
Overkill, but really simple things to add… 

The only thing this really adds to the "convenience" of the current DictReader for well-behaved tables, is the ability to access values sequentially or by name; other than that, the only difference would be iterating on a generator method's output instead of the instance itself.  


Shane Green

unread,
Jan 27, 2013, 9:10:49 AM1/27/13
to python...@python.org
Something as simple as this (straw man) demonstrates what I mean: 

class Record(defaultdict):
    def __init__(self, headers, fields):
        super(Record, self).__init__(list)
        self.headers = headers
        self.fields = fields
        map(self.enter, self.headers, self.fields)
    def valuemap(self, first=False):
        index = 0 if first else -1
        return dict([(key,values[index]) for key,values in self.items()])
    def enter(self, header, *values):
        if isinstance(header, int):
            header = self.headers[header]
        self[header].extend(values)
    def itemseq(self):
        return zip(self.headers,self.fields)
    def __getitem__(self, spec):
        if isinstance(spec, int):
            return self.fields[spec]
        return super(Record, self).__getitem__(spec)
    def __getslice__(self, *args):
        return self.fields.__getslice__(*args)


This would let you access column values using header names, just like before.  Each column's value(s) is now in a list, and would contain multiple values anytime for any column repeated more than once in the header.  
Values can also be accessed sequentially using integer indexes, and the valuemap() returns a standard dictionary that conforms to the previous standard exactly: there is a one to one mapping between column headers and values, which the last value associated with a given column name being the value. 

While I think the changes should be added without changing what exists for backward compatibility reasons, I've started to think the existing version should also be deprecated, rather than maintained as a special case.  Even when the format is perfect for the existing code, I don't see any big advantages to using it over this approach. 

Keep in mind the example is just a quick straw man: performance is a big difference (and plenty of bugs), but that doesn't seem like the right thing to base the decision, as performance can easily be enhanced later.  

In summary, given headers: A, B, C, D, E, B, G

record.headers == ["A", "B", "C", "D", "E", "B", "G"]
record.fields = [0, 1, 2, 3, 4, 5, 6, 7]

record["A"] == [0]
record["B"] == [1, 5]

# Note sequential access values are not in lists, and the second "B" column's value 5 is in it's original 5th position. 
record[0] == 0
record[1] ==1
record[2] == 2
record[3] == 3
record[4] == 4
record[5] == 5

record.items() == [("A", [0]), ("B", [1, 5)), …]
record.valuemap() == {"A": 0, "B": 5, …} # This returns exactly what DictReader does today, a single value per named column, with the last value being the one used. 

Begin forwarded message:

Mark Hackett

unread,
Jan 28, 2013, 7:06:39 AM1/28/13
to python...@python.org
On Sunday 27 Jan 2013, Shane Green wrote:
> While I think the changes should be added without changing what exists for
> backward compatibility reasons, I've started to think the existing version
> should also be deprecated, rather than maintained as a special case
>

That sounds effective.

Mark Hackett

unread,
Jan 28, 2013, 7:13:45 AM1/28/13
to python...@python.org
On Saturday 26 Jan 2013, Stephen J. Turnbull wrote:
> Shane Green writes:
> > And while it's true that a dictionary is a dictionary and it works
> > the way it works, the real point that drives home is that it's an
> > inappropriate mechanism for dealing ordered rows of sequential
> > values.
>
> Right! So use csv.reader, or csv.DictReader with an explicit
> fieldnames argument.
>
> The point of csv.DictReader with default fieldnames is to take a
> "well-behaved" table and turn it into a sequence of "poor-man's"
> objects.
>

Well though there's another example out there of what do do next, I was
thinking of being able to define the csv file format so that you could write it
out correctly too.

And to that end, some form of description of the csv file is needed. I was
thinking something like this:

A,B,C,A,D,E
{(A:2,A:1),B,C,D,E}

which would put columns 4 and 1 in the first entry (under the name A) as a
list, in that order, followed by B, C, D and E all expected to be single
unique names.

This also allows the same definition to be used to write it out.

Blank headers are denoted with:

A,,,,,,B,C

And headers not used in the dictionary (discarded) are handled by not being
put in the "where do we put this" line:
A,B,C,D
{A,D}

When writing out, you cannot have empty headers (since these values get
dropped and the output format spec is now no longer suitable), and you must
assign each header a dictionary (else again the dictionary doesn't contain all
the data that was in the input).

To write out these two types of input file, you need to create a new csv format
spec which CAN be written out.

Therefore you will have to deliberately define an output that loses data.

Mark Hackett

unread,
Jan 28, 2013, 7:21:19 AM1/28/13
to python...@python.org
On Friday 25 Jan 2013, ru...@yahoo.com wrote:
>
> The csv DictReader *uses* a dictionary for its output. That
> it does so imposes no requirements on how it should parse or
> otherwise handle the input that eventually goes into that
> dict.

And that doesn't mean that writing

dict[A]=1
dict[A]=9

results in dict[A] being a list containing 1 and 9.

A program using a dictionary entry has to know whether the input has duplicate
headers because in the case where only the first line is done, writing out the
value of dict[A] gives you "1". Writing out dict[A] if it's a list gives you
"[1,9]" which must be parsed differently.

Mark Hackett

unread,
Jan 28, 2013, 7:21:58 AM1/28/13
to python...@python.org
On Friday 25 Jan 2013, Ethan Furman wrote:
> We're going to have to agree to disagree on this point -- I think there
> is a huge difference between reassigning a variable which is completely
> under your control from losing entire columns of data from a file which
> you may have never seen before.
>

But if you've never seen it before, how do you know that you're going to get a
LIST in one column?

Ethan Furman

unread,
Jan 28, 2013, 10:53:44 AM1/28/13
to python...@python.org
On 01/28/2013 04:21 AM, Mark Hackett wrote:
> On Friday 25 Jan 2013, Ethan Furman wrote:
>> We're going to have to agree to disagree on this point -- I think there
>> is a huge difference between reassigning a variable which is completely
>> under your control from losing entire columns of data from a file which
>> you may have never seen before.
>>
>
> But if you've never seen it before, how do you know that you're going to get a
> LIST in one column?

I don't, which is why an exception should be raised.

~Ethan~

Mark Hackett

unread,
Jan 28, 2013, 12:13:52 PM1/28/13
to python...@python.org
On Monday 28 Jan 2013, Ethan Furman wrote:
> On 01/28/2013 04:21 AM, Mark Hackett wrote:
> > On Friday 25 Jan 2013, Ethan Furman wrote:
> >> We're going to have to agree to disagree on this point -- I think there
> >> is a huge difference between reassigning a variable which is completely
> >> under your control from losing entire columns of data from a file which
> >> you may have never seen before.
> >
> > But if you've never seen it before, how do you know that you're going to
> > get a LIST in one column?
>
> I don't, which is why an exception should be raised.
>
> ~Ethan~

And there's an argument for that that I've agreed to before.

There's a counter that this will cause programs that used to work to fail.

Whether the pro is higher than the con or the other way round is what I
question.

You, however, seem to believe this is a forgone conclusion.

And that's where I disagree.

MRAB

unread,
Jan 28, 2013, 12:26:31 PM1/28/13
to python-ideas
On 2013-01-28 15:53, Ethan Furman wrote:
> On 01/28/2013 04:21 AM, Mark Hackett wrote:
>> On Friday 25 Jan 2013, Ethan Furman wrote:
>>> We're going to have to agree to disagree on this point -- I think there
>>> is a huge difference between reassigning a variable which is completely
>>> under your control from losing entire columns of data from a file which
>>> you may have never seen before.
>>>
>>
>> But if you've never seen it before, how do you know that you're going to get a
>> LIST in one column?
>
> I don't, which is why an exception should be raised.
>
+1

It shouldn't silently drop the columns, nor should it silently merge
the columns into a list. It should complain, unless you state that it
should merge if necessary because, presumably, you're prepared for such
an eventuality.

Mark Hackett

unread,
Jan 28, 2013, 12:45:16 PM1/28/13
to python-ideas
On Monday 28 Jan 2013, MRAB wrote:
> It shouldn't silently drop the columns
>

Why not?

It's adding to a dictionary and adding a duplicate key replaces the earlier
one.

If it dropped the columns and shouldn't have, then the results will be seen to
be wrong anyway, so there's not a huge amount of need for this.

If it WANTED to keep both columns with the duplicate names, it won't work and
needs abandoning. So no different from now.

If it WANTED duplicate keys (e.g. blanks which aren't imported and aren't
wanted), then you've just broken it. They can't necessarily change the csv file
to put headers in. So now you've made the call useless for this case.

And why, really, are there duplicate column names in there anyway? You can
come up with the assertion that this might be wanted, but they're not normally
what you see in a csv file.

I've never seen nor used a csv file that duplicated column names other than
being blank.

If it had been such a problem, the call would already have been abandoned.

Bruce Leban

unread,
Jan 28, 2013, 7:01:56 PM1/28/13
to python-ideas
The reader could return a multidict. If you know it's a multidict you an access the 'discarded' values. Otherwise, it appears just like the dict that we have today. A middle ground between people that don't want the interface changed and those who want to get the multiple values. Personally, I prefer code that raises exceptions when it gets unreasonable input, and I think duplicate field names qualifies. But if that's the the general sentiment than a multidict is a potential compromise.

Alexandre Zani

unread,
Jan 28, 2013, 8:15:22 PM1/28/13
to Bruce Leban, python-ideas
I think raising an exception on duplicate headers is actually very likely to cause working code to break. Consider that all you need for that to happen is an extra couple of empty separators on the first line creating two "" headers. That seems like the sort of behavior that is easy to occur in spreadsheet programs. (Empty cells are usually not very well differentiated from non-existent cells in spreadsheet UIs IME) A StrictDictReader is better, but I think this is overkill.

As for a MultiDictReader, I don't think this is superior to csv.reader. In both cases, you need to keep track of the column orders. And if you already know the column order, you might as well just manually specify the field names in DictReader.


Shane Green

unread,
Jan 29, 2013, 12:24:06 AM1/29/13
to Mark Hackett, python-ideas
Actually I've seen a many real life examples of CSV files with repeated column names, working with log data in the energy management space.  CSV has been around for a very long time, and is used for a lot more than spreadsheets; there are a lot of funky formats out there.  Things like, every "VALUE" column is a 15 minute reading.  It seems like we're getting too hung up on dicts: all the information about a record is precisely stored by two sequences of values: the headers, and the field values.  Those entires and their order can both be useful to a consumer of CSV records, and should be made available.  The record also maps headers to corresponding value sequences for mapped access.  

Stephen J. Turnbull

unread,
Jan 29, 2013, 3:17:46 AM1/29/13
to Shane Green, python-ideas
Shane Green writes:

> Actually I've seen a many real life examples of CSV files with
> repeated column names,

Sure, but this really isn't the issue. If it were, "cvs.reader is
your friend" would be all the answer that the issue deserves IMHO.

> It seems like we're getting too hung up on dicts:

Not at all. (For reasons I don't understand) Somebody has a use case
where it's useful to have the field names stored in each record,
rather than stored once and have both field names and field values
accessed by position as needed. The point is to return a name-value
*mapping object* for *each* row, and that may as well be a dict.

The people who suggest a multidict or a list-valued dict are missing
that point, AFAICS. Eg, in your "BLABLA", "VALUE", ..., "VALUE"
example, position really is what matters, so a dict of any kind is
inappropriate IMO. Again, it's arbitrary whether the list-valued dict
does d["VALUE"].append(x) or d["VALUE"].insert(0,x), and it's hard for
me to guess which it would do in practice: .append is easier to write,
but .insert seems closer to the behavior of csv.reader (which is what
we really want in your example IMO).

Shane Green

unread,
Jan 29, 2013, 5:18:21 AM1/29/13
to Stephen J. Turnbull, python-ideas
So I wasn't really questioning the usefulness of the dictionary representation, but couldn't the returned object also let you access the header and value sequences, etc?  I was also thinking the conversion to simple dict with single (non-list) values per column could be part of the API.  

Appending duplicate field values as they're read reflects the order the duplicate entries appear in the source (when I've encountered CSV that purposely used duplicate column headers, the sequence they appear was critical).  The output from the current implementation should reflect the last duplicate value, as that always replaces previous ones in the dict, so my conversions returned the last value (-1), which should do the same…I think.  It was a straw man ;-).

I see your point about the point.  I think it would be good to have an implementation that kept all the information but still put the most usable API on it possible, rather than saying you can't have dictionary access unless you want to lose duplicate values, for example.  I mean, I've needed to consume CSV a lot, and that's what would have made the module useful to me, and the implementation that keeps all the information and lets it easily to trimmed as-not-needed seems better than one that just wipes it out to start.  

Oscar Benjamin

unread,
Jan 29, 2013, 6:16:02 AM1/29/13
to Shane Green, python-ideas
On 29 January 2013 10:18, Shane Green <sh...@umbrellacode.com> wrote:
> So I wasn't really questioning the usefulness of the dictionary
> representation, but couldn't the returned object also let you access the
> header and value sequences, etc? I was also thinking the conversion to
> simple dict with single (non-list) values per column could be part of the
> API.
>
> Appending duplicate field values as they're read reflects the order the
> duplicate entries appear in the source (when I've encountered CSV that
> purposely used duplicate column headers, the sequence they appear was
> critical). The output from the current implementation should reflect the
> last duplicate value, as that always replaces previous ones in the dict, so
> my conversions returned the last value (-1), which should do the same…I
> think. It was a straw man ;-).
>
> I see your point about the point. I think it would be good to have an
> implementation that kept all the information but still put the most usable
> API on it possible, rather than saying you can't have dictionary access
> unless you want to lose duplicate values, for example. I mean, I've needed
> to consume CSV a lot, and that's what would have made the module useful to
> me, and the implementation that keeps all the information and lets it easily
> to trimmed as-not-needed seems better than one that just wipes it out to
> start.

This is exactly what the csv.reader objects do.

While it is a problem that csv.DictReader silently discards data when
that is very likely an error, there's no need to try and guess how
people want to deal with duplicate column headers and invent a new
class for it. It's easy enough to write your own wrapper that exactly
performs whatever processing you happen to want:

def multireader(csvreader):
try:
headers = next(csvreader)
except StopIteration:
raise ValueError('No header')
for row in csvreader:
d = defaultdict(list)
for h, v in zip(headers, row):
d[h].append(v)
yield d


Oscar

Shane Green

unread,
Jan 29, 2013, 6:33:05 AM1/29/13
to Oscar Benjamin, python-ideas
Okay, sure, I guess the starting point of my argument is, DictReader is nice, why not make one that supports duplicate columns and easily implement the other behaviors, whether it's discarding values from duplicate columns so there's a one-to-one mapping, or just raising an exception when a duplicate column is encountered to start with, in terms of something that handles this superset of legal CSV formats that do in fact specify exactly what header names each of their values should be mapped to?  

Mark Hackett

unread,
Jan 29, 2013, 6:39:28 AM1/29/13
to python...@python.org
On Tuesday 29 Jan 2013, Alexandre Zani wrote:
>
> As for a MultiDictReader, I don't think this is superior to csv.reader. In
> both cases, you need to keep track of the column orders. And if you already
> know the column order, you might as well just manually specify the field
> names in DictReader.
>

But it would allow you to access the index by name.

value=csv_array[indecies{"Total Cost"}]

A little more verbose than

value=csv_dict{"Total Cost"}

But it's easier to read what it's doing than

value=csv_array[3]

Shane Green

unread,
Jan 29, 2013, 6:54:09 AM1/29/13
to python-ideas
And funky CSV formats don't make the current version not work for anyone. It 
works for the people it's been working for all along. Why stop that?

Agreed: I'm actually not for changing the existing stuff. I don't think something that used to return single values, should start returning lists, and if it's going to start raising exceptions, I think that should be an option you enable explicitly.  I think maybe this should be deprecated, in favor something that implements what we're discussing.  I'm also realizing that way of thinking means it's slightly off topic, and apologize for that ;-)

Steven D'Aprano

unread,
Jan 29, 2013, 7:26:13 AM1/29/13
to python...@python.org
On 29/01/13 04:45, Mark Hackett wrote:
> On Monday 28 Jan 2013, MRAB wrote:
>> It shouldn't silently drop the columns
>>
>
> Why not?
>
> It's adding to a dictionary and adding a duplicate key replaces the earlier
> one.

Then adding to a dictionary was a mistake.

The choice of a dict is *implementation*, not *interface*. The interface needed
is to return a mapping of column names to values. The nature of that mapping is
an implementation detail, and dict is only the simplest solution, not necessarily
the correct solution.

There is nothing about CSV files that imply that the right behaviour is to drop
columns. The nature of CSV files is to allow duplicate column names, and so CSV
readers should too. That implies that using a dict, which silently drops duplicate
keys, was the wrong choice.

We might argue that using duplicate column names is stupid, but CSV supports it,
and so should CSV readers.


> If it dropped the columns and shouldn't have, then the results will be seen to
> be wrong anyway, so there's not a huge amount of need for this.

You cannot assume that the caller knows that there are duplicated column names.
That's why dropping columns is problematic: it *silently* drops them, giving the
caller no idea that it has happened.

Given that DictReader already exists, and that there probably is someone out
there who is relying on it silently eating columns, I think that the only
reasonable way forward is to add a new reader that supports multiple columns
with the same name. The caller can then use whichever reader suits their
use-case:


* I don't care about duplicate-name columns, just give me some arbitrary one;
- use DictReader

* I want all of the duplicate-name columns;
- use MultiDictReader

* I want some of the duplicate-name columns;
- use MultiDictReader, and then filter the results as you get them


(When I put it like that, DictReader sounds even less useful. But as I said,
I daresay *somebody* is relying on it right now, so we can't change it.)


> And why, really, are there duplicate column names in there anyway? You can
> come up with the assertion that this might be wanted, but they're not normally
> what you see in a csv file.
>
> I've never seen nor used a csv file that duplicated column names other than
> being blank.

Well there you go. That is exactly one such example of duplicate column names.




--
Steven

Mark Hackett

unread,
Jan 29, 2013, 7:30:49 AM1/29/13
to python...@python.org
On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
> On 29/01/13 04:45, Mark Hackett wrote:
> > On Monday 28 Jan 2013, MRAB wrote:
> >> It shouldn't silently drop the columns
> >
> > Why not?
> >
> > It's adding to a dictionary and adding a duplicate key replaces the
> > earlier one.
>
> Then adding to a dictionary was a mistake.
>

I agree.

So don't use DictReader in that case.

We have Oscar with the method to do your own (and looked fairly simple and
straightforward).
Chris with carefuldictreader.
Shane with his dual-retention object.

Mark Hackett

unread,
Jan 29, 2013, 7:35:01 AM1/29/13
to python...@python.org
On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
> > If it dropped the columns and shouldn't have, then the results will be
> > seen to be wrong anyway, so there's not a huge amount of need for this.
>
> You cannot assume that the caller knows that there are duplicated column
> names
>

You cannot assume they wanted them as a list.

You cannot assume that duplicate replacement is what they want.

If someone is using a csv file with header names they have never read, how are
they going to use the data? They won't even know the name to access the value
in the dictionary! So I discard the claim that the caller may not know the
column names are duplicated. They have to know what the headers are to use
DictReader.

Shane Green

unread,
Jan 29, 2013, 8:08:25 AM1/29/13
to Mark Hackett, python...@python.org
Let's remove the assumptions about their information by retaining all of it, and make an assumption that everyone is capable of dealing with lists. 

Steven D'Aprano

unread,
Jan 29, 2013, 8:28:19 AM1/29/13
to python...@python.org
On 29/01/13 23:35, Mark Hackett wrote:
> On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
>>> If it dropped the columns and shouldn't have, then the results will be
>>> seen to be wrong anyway, so there's not a huge amount of need for this.
>>
>> You cannot assume that the caller knows that there are duplicated column
>> names
>>
>
> You cannot assume they wanted them as a list.

I don't need to assume that. They can take the list and post-process it into
any data type they want.

A list is a natural fit for associating multiple values to a single key,
because it doesn't lose data: it is variable-sized, so it can handle "no
values" or "1000 values" equally easily; it is ordered, and it is iterable.
If the caller wants something else, they can convert it.

> You cannot assume that duplicate replacement is what they want.

I don't think I ever suggested that it was.


> If someone is using a csv file with header names they have never read, how are
> they going to use the data?

reader = csv.DictReader(whatever)
for mapping in reader:
for key, value in mapping.items():
process(key, value)


Or perhaps you only care about one column, and don't care about the other, unknown,
columns:

for mapping in reader:
value = mapping.get('spam', 'some default')
process(value)



> They won't even know the name to access the value in the dictionary!

Dealing with arbitrary field names in data you read from a file is not hard.



--
Steven

Mark Hackett

unread,
Jan 29, 2013, 8:44:35 AM1/29/13
to python...@python.org
On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
> On 29/01/13 23:35, Mark Hackett wrote:
> > On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
> >>> If it dropped the columns and shouldn't have, then the results will be
> >>> seen to be wrong anyway, so there's not a huge amount of need for this.
> >>
> >> You cannot assume that the caller knows that there are duplicated column
> >> names
> >
> > You cannot assume they wanted them as a list.
>
> I don't need to assume that. They can take the list and post-process it
> into any data type they want.

Yes you ARE assuming it. You want them to post process it. But if they don't
know there are duplicates there and have found their script works for their
needs and therefore never looked, they will now get the wrong answer.

As Oscar says, they could process the csv file themselves by hand and code in
EXACTLY what they want. They don't have to put it in a dictionary then.

And you've already said

> Then adding to a dictionary was a mistake.

So they shouldn't be using DictReader.

Shane Green

unread,
Jan 29, 2013, 8:45:25 AM1/29/13
to python-ideas
On Jan 29, 2013, at 5:10 AM, Mark Hackett <mark.h...@metoffice.gov.uk> wrote:

> On Tuesday 29 Jan 2013, you wrote:
>> Let's remove the assumptions about their information by retaining all of
>> it, and make an assumption that everyone is capable of dealing with lists.
>>
>
> Then lets not use a dictionary. And leave the DictReader alone.
>

Yes, I think a more useful CSV construct would map header names to lists of values, provide access to original header and value sequences, and methods for iterating sequential (header,value) items (with possibly repeating header values, and which could be fed to dict() to produce exactly what DictReader produces), As such, it would not be a DictReader because it would produce something that just extended the dictionary API. I would think something like CSVRecord, or just Record, would be more accurate.

Shane Green

unread,
Jan 29, 2013, 9:09:12 AM1/29/13
to Mark Hackett, python...@python.org
I'm not sure this is constructive.

I think it's safe to assume changing something in an API that used to return single values, into something that now returns lists of those values, will be a problem for folks.  

I also think it's safe to assume folks can design their applications for an API that returns lists of values.  In support of this assumption, I will point out that's precisely what CGI's FieldStorage does to represent all HTML form values because some form values (radio buttons, checkboxes, etc.), can have more than one value associated with their name on submission.

Finally, I would assert that the more legally formatted content your content reader accurately reads and handles, the better.



Chris Angelico

unread,
Jan 29, 2013, 9:55:09 AM1/29/13
to python-ideas
On Wed, Jan 30, 2013 at 1:09 AM, Shane Green <sh...@umbrellacode.com> wrote:
> I think it's safe to assume changing something in an API that used to return
> single values, into something that now returns lists of those values, will
> be a problem for folks.
>
> I also think it's safe to assume folks can design their applications for an
> API that returns lists of values.

Agreed on both points. A new API that returns lists of everything
would be a lot safer than fiddling with the current one.

ChrisA

Mark Lawrence

unread,
Jan 29, 2013, 12:38:50 PM1/29/13
to python...@python.org
On 29/01/2013 12:30, Mark Hackett wrote:
> On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
>> On 29/01/13 04:45, Mark Hackett wrote:
>>> On Monday 28 Jan 2013, MRAB wrote:
>>>> It shouldn't silently drop the columns
>>>
>>> Why not?
>>>
>>> It's adding to a dictionary and adding a duplicate key replaces the
>>> earlier one.
>>
>> Then adding to a dictionary was a mistake.
>>
>
> I agree.
>
> So don't use DictReader in that case.
>
> We have Oscar with the method to do your own (and looked fairly simple and
> straightforward).
> Chris with carefuldictreader.
> Shane with his dual-retention object.
>

Please can we also have a
RemoveTheNullByteThatsPutAtheEndOfTheFileByBrainDeadMicrosoftMoney? :)

--
Cheers.

Mark Lawrence

Eric V. Smith

unread,
Jan 29, 2013, 1:49:01 PM1/29/13
to python...@python.org
On 01/29/2013 07:35 AM, Mark Hackett wrote:
> On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
>>> If it dropped the columns and shouldn't have, then the results will be
>>> seen to be wrong anyway, so there's not a huge amount of need for this.
>>
>> You cannot assume that the caller knows that there are duplicated column
>> names
>>
>
> You cannot assume they wanted them as a list.
>
> You cannot assume that duplicate replacement is what they want.
>
> If someone is using a csv file with header names they have never read, how are
> they going to use the data? They won't even know the name to access the value
> in the dictionary! So I discard the claim that the caller may not know the
> column names are duplicated. They have to know what the headers are to use
> DictReader.

Not true: I process some csv files just to translate them into another
format, say tab delimited. I don't care about the column names, but
dropping columns would sure bother me. I don't think any of the files
I've processed have duplicate columns, but I wouldn't swear to it. And
if they did, that would be an error I'd like to know about.

Eric.

Stephen J. Turnbull

unread,
Jan 29, 2013, 2:19:30 PM1/29/13
to Eric V. Smith, python...@python.org
Eric V. Smith writes:

> Not true: I process some csv files just to translate them into another
> format, say tab delimited. I don't care about the column names,

Then you'd be nuts to use csv.DictReader! csv.reader does exactly
what you want.

DictReader is about transforming a data format from a sequence of rows
of values accessed by position, one of which might be a header, to a
headerless sequence of objects with values accessed by name. If your
use case doesn't involve access by name, it is irrelevant.

Eric V. Smith

unread,
Jan 29, 2013, 2:21:58 PM1/29/13
to Stephen J. Turnbull, python...@python.org
On 01/29/2013 02:19 PM, Stephen J. Turnbull wrote:
> Eric V. Smith writes:
>
> > Not true: I process some csv files just to translate them into another
> > format, say tab delimited. I don't care about the column names,
>
> Then you'd be nuts to use csv.DictReader! csv.reader does exactly
> what you want.
>
> DictReader is about transforming a data format from a sequence of rows
> of values accessed by position, one of which might be a header, to a
> headerless sequence of objects with values accessed by name. If your
> use case doesn't involve access by name, it is irrelevant.

True. But my point stands: it's possible to read the data (even with a
DictReader), do something with the data, and not know the column names
in advance. It's not an impossible use case.

Eric.

Stephen J. Turnbull

unread,
Jan 29, 2013, 3:37:38 PM1/29/13
to Eric V. Smith, python...@python.org
Eric V. Smith writes:

> True. But my point stands: it's possible to read the data (even with a
> DictReader), do something with the data, and not know the column names
> in advance. It's not an impossible use case.

But it is. Dicts don't guarantee iteration order, so you will most
likely get an output file that not only has a different delimiter, but
a different order of fields.

The right use case here is duck-typing. Something like "I have a
bunch of tables of data about car models from different manufacturers
which have different sets of columns, and I know that all of them have
a column labeled 'MSRP', but which column might vary across tables."

Of course, I don't actually believe you'd get that lucky.

Eric V. Smith

unread,
Jan 29, 2013, 3:59:42 PM1/29/13
to Stephen J. Turnbull, python...@python.org
On 1/29/2013 3:37 PM, Stephen J. Turnbull wrote:
> Eric V. Smith writes:
>
> > True. But my point stands: it's possible to read the data (even with a
> > DictReader), do something with the data, and not know the column names
> > in advance. It's not an impossible use case.
>
> But it is. Dicts don't guarantee iteration order, so you will most
> likely get an output file that not only has a different delimiter, but
> a different order of fields.

We're going to have to agree to disagree. Order is not always important.

--
Eric.

Mark Hackett

unread,
Jan 30, 2013, 5:32:54 AM1/30/13
to python...@python.org
On Tuesday 29 Jan 2013, Eric V. Smith wrote:
> On 1/29/2013 3:37 PM, Stephen J. Turnbull wrote:
> > Eric V. Smith writes:
> > > True. But my point stands: it's possible to read the data (even with a
> > > DictReader), do something with the data, and not know the column names
> > > in advance. It's not an impossible use case.
> >
> > But it is. Dicts don't guarantee iteration order, so you will most
> > likely get an output file that not only has a different delimiter, but
> > a different order of fields.
>
> We're going to have to agree to disagree. Order is not always important.
>

It's not impossible that we're living in a simulated world.

If you don't know what's in the csv file at all, then how do you know what
you're supposed to do with it.

Reading into a list will ensure order, so that is usable if order is
important. If the names aren't important at all, then you should drop the first
line and read it into a list again. If the names are important, you'd better
know what names the headers are using.

Steven D'Aprano

unread,
Jan 30, 2013, 7:09:20 AM1/30/13
to python...@python.org
On 30/01/13 21:32, Mark Hackett wrote:

> If you don't know what's in the csv file at all, then how do you know what
> you're supposed to do with it.

Maybe you're processing the file without caring what the column names are,
but you still need to map column name to column contents. This is no more
unusual than processing a dict where you don't know the keys: you just iterate
over them.

Or maybe you're scanning the file for one specific column name, and you don't
care what the other names are.

Or, most likely, you know what you are *expecting* in the CSV file, but because
data files don't always contain what you expect, you want to be notified if
there is something unexpected rather than just have it silently do the wrong
thing.



--
Steven

Mark Hackett

unread,
Jan 30, 2013, 7:14:09 AM1/30/13
to python...@python.org
On Wednesday 30 Jan 2013, Steven D'Aprano wrote:
> On 30/01/13 21:32, Mark Hackett wrote:
> > If you don't know what's in the csv file at all, then how do you know
> > what you're supposed to do with it.
>
> Maybe you're processing the file without caring what the column names are,

If you don't care, then you shouldn't be using a dictionary because you have
to know to say what one you want.

> but you still need to map column name to column contents.

Why? You said this hypothetical reckless person doesn't care.

> This is no more
> unusual than processing a dict where you don't know the keys: you just
> iterate over them.
>

Which is only used for printing the info out.

There's a much easier way to do that:

"cat file.csv"

> Or maybe you're scanning the file for one specific column name, and you
> don't care what the other names are.
>

Then you'll know if it's duplicated or not.

> Or, most likely, you know what you are *expecting* in the CSV file, but
> because data files don't always contain what you expect, you want to be
> notified if there is something unexpected rather than just have it
> silently do the wrong thing.
>

There's a way to do that:

"head -n1 file.csv".

You know, have a look.

Shane Green

unread,
Jan 30, 2013, 7:24:53 AM1/30/13
to python-ideas
So I've done some thinking on it, a bit of research, etc., and have worked with a lot of different CSV content.  There are a lot of parallels between the name/value pairs of an HTML form submission, and our use case.  

Namely:
- There's typically only one value per name, but it's perfectly legal to have multiple values assigned to a name.
- When there are duplicate multiple values assigned to a name, order can be very important. 
- They made the mistake of mapping names to values; they made the mistake of mapping name field names to singular values when there was only one value, and multiple values where there were multiple values.  
- Each of these have been deprecated an their FieldStorage now always maps field names to lists of values.  

I've implemented a Record class I'm going to pitch for feedback.  Although I followed the FieldStorage API for a couple of methods, it didn't translate very well because their values are complex objects.  This Record class is a dictionary type that maps header names to the values from columns labeled by that same header.  Most lists have a single field because usually headers aren't duplicated.  When multiple values are in a field, they are listed in the order they were read from the CSV file.  The API provides convenience methods for getting the first or last value listed for a given column name, making it very easy to turn work with singular values when desired.  The dictionary API will likely bent primary mechanism for interacting with it, however, knows the header and row sequences it was built from, and provides sequential access to them as well.  In addition to working with non-standard CSV, performing transformations, etc.this information makes it possible to reproduce correctly ordered CSV.

While I don't really know yet whether it would make sense to support any kind of manipulation of values on the record instances themselves, versus using more copy()/update() approach to defining modifying records or something, but I did decide to wrap the row values in a tuple, making it read only.  This was for several reasons.   One was to address a potential inconsistency that might arise should we decide to support editing, and the other is because the record is the representation of that row read from the source file, and so it should always accurately reflect that content.

About the code: I wrote it tonight, tested it for an hour, so it's not meant to be perfect or final, but it should stir up a very concrete discussion about the API, if nothing else ;-)  I included a generator that seemed to work on the some test files.  It most definitely is not meant to be critiqued or a distraction, but I've included it in case anyone ends up wanting to investigate the things further.  Although the iterator function provides a slightly different signature that DictReader, that's not because I'm trying toe change anything; please keep in mind the generator was just a test.  Also, I'd like to mention one last time that I don't think we should change what exists to reflect any of these changes: I was thinking it would be a new set of classes and functions that, that would become the preferred implementation in the future.  




class Record(dict):
    def __init__(self, headers, fields):
        if len(headers) != len(fields):
            # I don't make decicions about how gaps should be filled. 
            raise ValueError("header/field size mismatch")
        self._headers = headers
        self._fields = tuple(fields)
        [self.setdefault(h,[]).append(v) for h,v in self.fielditems()]
        super(Record, self).__init__()
    def fielditems(self):
        """
            Get header,value sequence that reflects CSV source.  
        """
        return zip(self.headers(),self.fields())
    def headers(self):
        """
            Get ordered sequence of headers reflecting CSV source. 
        """
        return self._headers
    def fields(self):
        """
            Get ordered sequence of values reflecting CSV row source. 
        """
        return self._fields
    def getfirst(self, name, default=None):
        """
            Get value of last field associated with header named  
            'name'; return 'default' if no such value exists. 
        """
        return self[name][0] if name in self else default
    def getlast(self, name, default=None):
        """
            Get value of last field associated with header named  
            'name'; return 'default' if no such value exists. 
        """
        return self[name][-1] if name in self else default
    def getlist(self, name): 
        """
            Get values of all fields associated with header named 'name'.
        """
        return self.get(name, [])
    def pretty(self, header=True):
        lines = []
        if header:
            lines.append(
                ["%s".ljust(10).rjust(20) % h for h in self.headers()])
        lines.append(
            ["%s".ljust(10).rjust(20) % v for v in self.fields()])
        return "\n\n".join(["|".join(line).strip() for line in lines])
    def __getslice__(self, start=0, stop=None):
        return self.fields()[start: stop]


import itertools

Undefined = object()
def iterrecords(f, headers=None, bucketheader=Undefined, 
    missingfieldsok=False, dialect="excel", *args, **kw):
    rows = reader(f, dialect, *args, **kw)
    for row in itertools.ifilter(None, rows):
        if not headers:
            headers = row
            headcount = len(headers)
            print headers
            continue
        rowcount = len(row)
        rowheaders = headers
        if rowcount < headcount:
            if not missingfieldsok:
                raise KeyError("row has more values than headers")
        elif rowcount > headcount: 
            if bucketheader is Undefined:
                raise KeyError("row has more values than headers")
            rowheaders += [bucketheader] * (rowcount - headcount)
        record = Record(rowheaders, row)
        yield record


# That's run within the context of the "csv" module to work… maybe.  

Shane Green

unread,
Jan 30, 2013, 7:59:17 AM1/30/13
to python-ideas
I should probably also have noted the dictionary API behaviour since it's not explicitly: 
keys() -> list of unique() header names.
values() -> list of field values lists.
items() -> [(header, field-list),] pairs.

And then of course dictionary lookup.  One thing that comes to mind is that there's really no value to the unordered sequence of value lists; there could be some value in extending an OrderedDict, making all the iteration methods consistent and therefore something that could be used to do something like write values, etc….




Jeff Jenkins

unread,
Jan 30, 2013, 9:04:47 AM1/30/13
to Shane Green, python-ideas
I think this may have been lost somewhere in the last 90 messages, but adding a warning to DictReader in the docs seems like it solves almost the entire problem.  New csv.DictReader users are informed, no one's old code breaks, and a separate discussion can be had about whether it's worth adding a csv.MultiDictReader which uses lists.


Shane Green

unread,
Jan 30, 2013, 9:44:26 AM1/30/13
to Jeff Jenkins, python-ideas


"""Also, I'd like to mention one last time that I don't think we should change what exists to reflect any of these changes: I was thinking it would be a new set of classes and functions that, that would become the preferred implementation in the future."""


This is kind of that new discussion.  I agree…

Mark Hackett

unread,
Jan 30, 2013, 10:16:37 AM1/30/13
to python...@python.org
On Wednesday 30 Jan 2013, Jeff Jenkins wrote:
> I think this may have been lost somewhere in the last 90 messages, but
> adding a warning to DictReader in the docs seems like it solves almost the
> entire problem.

Jeff, it breaks code that works now because duplicates aren't cared about.

Shane is putting code up for a NEW call that you can use if you're worried
about how the current one works and consideration for this issue is being
included in the derivation of a new library for the next (and therefore
allowed to be incompatible) python library version.
It is loading more messages.
0 new messages