[Python-ideas] csv.DictReader could handle headers more intelligently.

Nick Coghlan

unread,

Jan 24, 2013, 7:33:07 AM1/24/13

to Shane Green, python...@python.org

On Thu, Jan 24, 2013 at 9:55 PM, Shane Green <sh...@umbrellacode.com> wrote:
> Not sure if I'm reading the discussion correctly, but it sounds like there's
> discussion about whether swallowing CSV values when confronted with multiple
> columns by the same name, which seems very incorrect if so. CSV doesn't
> even mandate column headers exist at all, as far as I know. If anything I
> would think mapping column positions to header values would make sense, such
> that header.items() -> [(0, header1), (1, header2), (2, header3), etc.], and
> header1 and header2 could be equal. To work with rows as dictionaries they
> can follow the FieldStorage model and have lists of values–either when
> there's a collision, or always–so all column values are contained.

That's not quite the discussion. The discussion is specifically about
*DictReader*, and whether it should:

1. Do any data conditioning by ignoring empty lines and lines of just
field delimiters before the header row (consensus seems to be "no")
2. Give an error when encountering a duplicate field name (which will
lead to data loss when reading from the file) (consensus seems to be
"yes")

The problem with the latter suggestion is that it's a backwards
incompatible change - code where "use the last column with that name"
is the correct behaviour currently works, but would be broken if that
situation was declared an error.

Rather than messing with DictReader, it seems more fruitful to further
investigate the idea of a namedtuple based reader
(http://bugs.python.org/issue1818). The "multiple columns with the
same name" use case seems specialised enough that the standard readers
can continue to ignore it (although, as noted earlier in this thread,
a namedtuple based reader will correctly reject duplicate column
names)

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Antoine Pitrou

unread,

Jan 24, 2013, 7:38:58 AM1/24/13

to python...@python.org

Le Thu, 24 Jan 2013 22:33:07 +1000,
Nick Coghlan <ncog...@gmail.com> a
écrit :

> On Thu, Jan 24, 2013 at 9:55 PM, Shane Green
> <sh...@umbrellacode.com> wrote:
> > Not sure if I'm reading the discussion correctly, but it sounds
> > like there's discussion about whether swallowing CSV values when
> > confronted with multiple columns by the same name, which seems very
> > incorrect if so. CSV doesn't even mandate column headers exist at
> > all, as far as I know. If anything I would think mapping column
> > positions to header values would make sense, such that
> > header.items() -> [(0, header1), (1, header2), (2, header3), etc.],
> > and header1 and header2 could be equal. To work with rows as
> > dictionaries they can follow the FieldStorage model and have lists
> > of values–either when there's a collision, or always–so all column
> > values are contained.
>
> That's not quite the discussion. The discussion is specifically about
> *DictReader*, and whether it should:
>
> 1. Do any data conditioning by ignoring empty lines and lines of just
> field delimiters before the header row (consensus seems to be "no")
> 2. Give an error when encountering a duplicate field name (which will
> lead to data loss when reading from the file) (consensus seems to be
> "yes")
>
> The problem with the latter suggestion is that it's a backwards
> incompatible change - code where "use the last column with that name"
> is the correct behaviour currently works, but would be broken if that
> situation was declared an error.

It's not really a problem if the new behaviour is conditioned by a
constructor argument.

Regards

Antoine.

J. Cliff Dyer

unread,

Jan 24, 2013, 10:11:34 AM1/24/13

to Antoine Pitrou, python...@python.org

On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
> > 1. Do any data conditioning by ignoring empty lines and lines of
> > just field delimiters before the header row (consensus seems to be
> > "no")

Well, I wouldn't necessarily say we have a consensus on this one. This
idea received a +1 from Bruce Leban and an "I don't see any reason not
to" from Steven D'Aprano.

Objections are:

1. It's a backwards-incompatible change. (This could be mitigated in a
couple ways, as with the duplicate header problem, below). I don't think
anyone has argued that programmers would ever actually want to use the
blank line as the headers, only that they may be doing it now as a
workaround, and breaking the workarounds is undesirable.

2. You should pre-process the CSV instead of adapting the reader to
malformations. (In which case, I think the DictReader.reader attribute
should be better documented, so programmers have some guidance how to do
the pre-processing, as the current DictReader can cause data loss which
would make it difficult to recover the real headers without using the
underlying reader).

> > 2. Give an error when encountering a duplicate field name (which
> > will lead to data loss when reading from the file) (consensus seems
> > to be "yes")

Mostly, but with a strong objection from Mark Hackett, and hesitation
about altering current behavior from Amaury Forgeot d'Arc.

Proposals to solve this problem:

1. Raise an exception (After setting the fieldnames, I think, so if you
wanted to catch and continue or catch and edit the conflicting
fieldnames, you could do so).

2. Combine multiple fields with the same header into a list under the
same key.

2a. Make lists when there are multiple fields, but otherwise, key to
strings as is currently done

2b. For consistency, make all values lists, regardless of the number of
columns.

Proposals for implementation:

1. Create a new Reader class. Suggestions include
"CarefulDictReader" (for the version that raises an exception) and
"MultiDictReader" (for the versions that make lists of values).

2. Add an option to DictReader. The idea to add an option for a
MultiDictReader-like behavior was objected to, but there were multiple
suggestions to add an option for raising an exception, in one case with
the idea that in the future ("Python 4") the option would be standard
behavior.

Note: If we were to implement a CarefulDictReader, it could, without
backward incompatibility, implement both skipping of blank header lines,
and exception raising on duplicate headers.

Cheers,
Cliff

J. Cliff Dyer

unread,

Jan 24, 2013, 10:23:24 AM1/24/13

to Nick Coghlan, python...@python.org

On Thu, 2013-01-24 at 22:33 +1000, Nick Coghlan wrote:
> The problem with the latter suggestion is that it's a backwards
> incompatible change - code where "use the last column with that name"
> is the correct behaviour currently works, but would be broken if that
> situation was declared an error.

One example where a programmer would legitimately want to ignore errors
of this kind: A CSV file has a number of named columns, and a few
unnamed ones, and the programmer doesn't care about data from the
unnamed columns. The unnamed columns all have the same name (''), and
would raise this exception. Hence the need to be able to suppress it
somehow (e.g., by instantiation argument or by catching the exception)
without losing the fieldnames.

Cheers,
Cliff

Chris Angelico

unread,

Jan 24, 2013, 10:24:23 AM1/24/13

to python-ideas

On Fri, Jan 25, 2013 at 2:11 AM, J. Cliff Dyer <j...@sdf.lonestar.org> wrote:
> On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
>> > 1. Do any data conditioning by ignoring empty lines and lines of
>> > just field delimiters before the header row (consensus seems to be
>> > "no")
>
> Well, I wouldn't necessarily say we have a consensus on this one. This
> idea received a +1 from Bruce Leban and an "I don't see any reason not
> to" from Steven D'Aprano.

I've been lurking this thread, but fwiw, I'd put +1 on ignoring empty
lines/just delimiter lines. For a row of column headers, a completely
blank line makes no sense. It's a backward-incompatible change, yes,
but I can't imagine any code actively relying on this. ISTM this would
probably be safe for a minor release (Python 3.4), though of course
not for Python 3.3.1.

ChrisA

Shane Green

unread,

Jan 24, 2013, 10:28:49 AM1/24/13

to J. Cliff Dyer, Antoine Pitrou, python...@python.org

Since every form of CSV file counts EOL as a line terminator, I think discarding empty lines preceding the headers is arguably acceptable, but do not think discarding lines of just delimiters would be. What about extending the DictReader API so it was easy to perform these actions explicitly, such as being able to discard() the field names to be re-evaluated on the next line?

Shane Green

Mark Hackett

unread,

Jan 24, 2013, 10:29:19 AM1/24/13

to python...@python.org

On Thursday 24 Jan 2013, J. Cliff Dyer wrote:
> > > 2. Give an error when encountering a duplicate field name (which
> > > will lead to data loss when reading from the file) (consensus seems
> > > to be "yes")
>
> Mostly, but with a strong objection from Mark Hackett, and hesitation
> about altering current behavior from Amaury Forgeot d'Arc.
>

More along the lines of your earlier:

> 1. It's a backwards-incompatible change.

strong objection. :-)

Programs that had been working will stop. Programs that won't work because it
doesn't throw an exception yet are no worse off.

When you change something, you'll hear almost entirely from those for whom the
change will be useful. From those who will find it an obstacle, you don't hear
from. Until it's implemented.

Requiring catching an exception means that until the code is changed, your
working program no longer works.

And as you later point out Cliff, empty and uninteresting field names may
legitimately exist and WANT to be ignored.

So although I CAN see a reasoning for an exception, I do not see it as enough
to put it in this version of the library. It's a learning process and for the
next version which will need code changes to incorporate anyway, that
knowledge can be used to make things better *next time*.

J. Cliff Dyer

unread,

Jan 24, 2013, 10:55:17 AM1/24/13

to Mark Hackett, python...@python.org

On Thu, 2013-01-24 at 15:29 +0000, Mark Hackett wrote:
> On Thursday 24 Jan 2013, J. Cliff Dyer wrote:
> > > > 2. Give an error when encountering a duplicate field name (which
> > > > will lead to data loss when reading from the file) (consensus seems
> > > > to be "yes")
> >
> > Mostly, but with a strong objection from Mark Hackett, and hesitation
> > about altering current behavior from Amaury Forgeot d'Arc.
> >
>
>
> More along the lines of your earlier:
>
> > 1. It's a backwards-incompatible change.
>
> strong objection. :-)
>
> Programs that had been working will stop. Programs that won't work because it
> doesn't throw an exception yet are no worse off.
>

Noted. I will say that this doesn't seem any worse than any other
backwards-incompatible change, which are sometimes allowed, so it should
probably be considered by the same standard.

That said, what are your feelings on adding a CarefulDictReader?

J. Cliff Dyer

unread,

Jan 24, 2013, 11:08:16 AM1/24/13

to Shane Green, Antoine Pitrou, python...@python.org

On Thu, 2013-01-24 at 07:28 -0800, Shane Green wrote:
> Since every form of CSV file counts EOL as a line terminator, I think
> discarding empty lines preceding the headers is arguably acceptable,
> but do not think discarding lines of just delimiters would be. What
> about extending the DictReader API so it was easy to perform these
> actions explicitly, such as being able to discard() the field names to
> be re-evaluated on the next line?

I think I like this idea. There's something a little distasteful about
making the user manually delve into the underlying reader, but this
makes it more user-friendly and more obvious how to proceed.

For clarity's sake, what is your objection to discarding lines of
delimiters? The reason I suggest doing it is that it is a common output
situation when exporting Excel files or LibreCalc files that have a
blank row at the top.

Yuval Greenfield

unread,

Jan 24, 2013, 11:08:34 AM1/24/13

to J. Cliff Dyer, Antoine Pitrou, python-ideas

On Thu, Jan 24, 2013 at 5:11 PM, J. Cliff Dyer <j...@sdf.lonestar.org> wrote:

On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
> > 1. Do any data conditioning by ignoring empty lines and lines of
> > just field delimiters before the header row (consensus seems to be
> > "no")

Well, I wouldn't necessarily say we have a consensus on this one. This
idea received a +1 from Bruce Leban and an "I don't see any reason not
to" from Steven D'Aprano.

Count me in that list as well.

If it were urllib handling a special case for a server you don't control then fine. But it's a valid CSV file you can process yourself if you need more control. We should keep DictReader simple. This is also a reason against "CarefulDictReader". If you need to be more specific then use csv.Reader.

> > 2. Give an error when encountering a duplicate field name (which
> > will lead to data loss when reading from the file) (consensus seems
> > to be "yes")

Mostly, but with a strong objection from Mark Hackett, and hesitation
about altering current behavior from Amaury Forgeot d'Arc.

In that one too.

Maybe we should ask the people on this list http://hg.python.org/cpython/log/5b02d622d625/Lib/csv.py

Yuval

Yuval Greenfield

unread,

Jan 24, 2013, 11:09:28 AM1/24/13

to J. Cliff Dyer, Antoine Pitrou, python-ideas

To clarify - I agree with the aforementioned "consensus".

MRAB

unread,

Jan 24, 2013, 11:12:09 AM1/24/13

to python-ideas

On 2013-01-24 15:24, Chris Angelico wrote:
> On Fri, Jan 25, 2013 at 2:11 AM, J. Cliff Dyer <j...@sdf.lonestar.org> wrote:
>> On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
>>> > 1. Do any data conditioning by ignoring empty lines and lines of
>>> > just field delimiters before the header row (consensus seems to be
>>> > "no")
>>
>> Well, I wouldn't necessarily say we have a consensus on this one. This
>> idea received a +1 from Bruce Leban and an "I don't see any reason not
>> to" from Steven D'Aprano.
>
> I've been lurking this thread, but fwiw, I'd put +1 on ignoring empty
> lines/just delimiter lines. For a row of column headers, a completely
> blank line makes no sense. It's a backward-incompatible change, yes,
> but I can't imagine any code actively relying on this. ISTM this would
> probably be safe for a minor release (Python 3.4), though of course
> not for Python 3.3.1.
>

Ignoring empty lines before a header seems OK to me, but ignoring
just-delimiter lines doesn't.

To me, a just-delimiter line where it's expecting a header would mean
that all of the columns are unnamed, unless we insist that it's not a
header unless at least one column is named, and I don't think that that
should be the default behaviour.

As for duplicated columns names, I think that it should probably raise
an exception unless you've specified that duplicates should be put into
a list.

Mark Hackett

unread,

Jan 24, 2013, 11:23:50 AM1/24/13

to python...@python.org

On Thursday 24 Jan 2013, J. Cliff Dyer wrote:

> For clarity's sake, what is your objection to discarding lines of
> delimiters? The reason I suggest doing it is that it is a common output
> situation when exporting Excel files or LibreCalc files that have a
> blank row at the top.
>
> Cheers,
> Cliff
>

I'm putting too many pennies in this pot, I feel, but...

What was the purpose of those blank lines? Like duplicate column names at the
first row, what you need to do with them depends on why they are there and what
the program using the output wants to do.

If someone took the repository of macros from the spreadsheet which used
column numbers and this was used to recreate EXACTLY whatever calculations
were done without having to keep two copies of the same algorithm to account
for the dropping of rows in the script, then dropping the rows would break
this.

This really is policy (wrt the source of the CSV and the consumer of the
dictionary).

Make it a pre process of the CSV to be used and configured to fit what the
meaning of the CSV file output was to the producing program and what bits of it
make a difference to the consumer of the dictionary's contents.

J. Cliff Dyer

unread,

Jan 24, 2013, 11:40:09 AM1/24/13

to Mark Hackett, python...@python.org

On Thu, 2013-01-24 at 16:17 +0000, Mark Hackett wrote:
> >
> > That said, what are your feelings on adding a CarefulDictReader?
> >
>

> It's as good a solution to me as any.
>
> However, I'm not that good a programmer, and therefore what *I'd* do
> isn't
> necessarily a good idea, it's just one of the better ones out of the
> limited
> toolbox I have available.
>
> I'd prefer (for aesthetic reasons) some sort of stream converter. Much
> like
> freeze/thaw serialisation of data, it'd be a step between the raw csv
> and the
> reader that reads it.
>
>

I think my reason for wanting to have a CarefulDictReader (or a careful
DictReader), and why I think a stream converter isn't the best solution,
is that CSVs are very commonly used by people just starting to get their
feet wet with programming. Consider the use case: I've got my excel
file, and I'm just getting to the point where excel isn't cutting it
anymore. I want to start manipulating my data with python, and everyone
is telling me to use the csv library. DictReader sounds cool, because I
don't want to have to remember column numbers, and this is going make my
code much more readable. But I can't make it read my headers simply
because I put some blank space at the top of my excel file, above my
headers.

A stream converter is another layer of complexity that keeps this
potential new programmer from having a good experience with programming,
for what gain? So that the csv library can "properly" (?) treat a line
without data as a header? I think it would be fully reasonable (and add
little to no complexity to the code) to have a DictReader that treats
the first non-empty line as the header row.

The csv module is one of the big gateways into python programming for a
lot of people. That's also one of the reasons I think the sockets
library is a poor analogue here. A new programmer is unlikely to reach
the sockets library until they've been through a few of the urllibs, the
httplibs, requests, some part of http or an external web framework,
smtplib, or some other higher-level networking-related libraries.

For the same reason, I think if the solution isn't something handled
automatically by the library, it needs to be accompanied by improvements
to the documentation. If we're going to provide a DictReader that is
this easy to break, we need to answer the question: How do I fix it?

Cheers,
Cliff

Shane Green

unread,

Jan 24, 2013, 11:41:40 AM1/24/13

to J. Cliff Dyer, Antoine Pitrou, python...@python.org

Well, my objection to doing it automatically was based in part on not being familiar with the common scenarios you've brought up, but the other reasons I had in mind were that it seemed like the kind of thing that might also be indicative of an error–something wrong with the data someone might want to know was happening rather than have masked; and also because discarding such rows leaves a question about the delimiter: it's now known, but knowing it based on rows we've discarded seems unclean.

Shane Green

J. Cliff Dyer

unread,

Jan 24, 2013, 11:41:07 AM1/24/13

to Mark Hackett, python...@python.org

On Thu, 2013-01-24 at 16:23 +0000, Mark Hackett wrote:

> If someone took the repository of macros from the spreadsheet which used
> column numbers and this was used to recreate EXACTLY whatever calculations
> were done without having to keep two copies of the same algorithm to account
> for the dropping of rows in the script, then dropping the rows would break
> this.
>

If that's the case, then why are you using a DictReader instead of a raw
csv.reader? You're already losing the first row.

Serhiy Storchaka

unread,

Jan 24, 2013, 3:35:14 PM1/24/13

to python...@python.org

On 23.01.13 03:51, alex23 wrote:
> with open('malformed.csv','rb') as csvfile:
> csvlines = list(l for l in csvfile if l.strip())
> csvreader = DictReader(csvlines)

csvreader = DictReader(l for l in csvfile if l.strip())

Steven D'Aprano

unread,

Jan 24, 2013, 6:15:14 PM1/24/13

to python...@python.org

On 25/01/13 03:08, J. Cliff Dyer wrote:
> On Thu, 2013-01-24 at 07:28 -0800, Shane Green wrote:
>> Since every form of CSV file counts EOL as a line terminator, I think
>> discarding empty lines preceding the headers is arguably acceptable,
>> but do not think discarding lines of just delimiters would be. What
>> about extending the DictReader API so it was easy to perform these
>> actions explicitly, such as being able to discard() the field names to
>> be re-evaluated on the next line?
>
> I think I like this idea. There's something a little distasteful about
> making the user manually delve into the underlying reader, but this
> makes it more user-friendly and more obvious how to proceed.

I couldn't disagree more. I think:

- it adds burden to the caller, since the caller is now expected to manually
inspect the field names and decide whether some should be discarded;

- it is less obvious: *how* does the caller decide that there are too many
field names?

- incomplete: if there is a discard(), where is the add()?

- completely irrelevant for the topic being discussed ("DictReader should
ignore leading blank lines... I know, let's give the caller the ability
to *discard* field names" -- but auto-detecting *too many* field names is
not the problem);

- and being able to change the field names on the fly is so far beyond
anything required for ordinary CSV that it doesn't belong in the CSV
module.

> For clarity's sake, what is your objection to discarding lines of
> delimiters? The reason I suggest doing it is that it is a common output
> situation when exporting Excel files or LibreCalc files that have a
> blank row at the top.

A row of delimiters should be treated by the reader object as a row with
explicitly empty fields. If the caller wishes to discard them, they can.
But the reader object shouldn't make that decision.

An empty row, on the other hand, should be just ignored. DictReader *already*
ignores empty rows, provided that they are not in the first row.

--
Steven

Steven D'Aprano

unread,

Jan 24, 2013, 6:53:51 PM1/24/13

to python...@python.org

On 25/01/13 02:11, J. Cliff Dyer wrote:
> On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
>>> 1. Do any data conditioning by ignoring empty lines and lines of
>>> just field delimiters before the header row (consensus seems to be
>>> "no")
>
> Well, I wouldn't necessarily say we have a consensus on this one. This
> idea received a +1 from Bruce Leban and an "I don't see any reason not
> to" from Steven D'Aprano.
>
> Objections are:
>
> 1. It's a backwards-incompatible change.

All bug fixes are backwards-incompatible changes. The question is, is
there anyone relying on this behaviour?

DictReader already ignores blank lines, *except for the very first line*.
Using Python 3.3:

py> from io import StringIO
py> from csv import DictReader
py> data = StringIO('spam,ham,eggs\n\n\n\n1,2,3\n\n\n\n\n4,5,6\n')
py> x = csv.DictReader(data)
py> next(x)
{'eggs': '3', 'ham': '2', 'spam': '1'}
py> next(x)
{'eggs': '6', 'ham': '5', 'spam': '4'}

I don't expect that there is anyone relying on a CSV file with a leading
blank line to be treated as one having no columns at all:

py> data = StringIO('\n\n\n\nspam,ham,eggs\n1,2,3\n4,5,6\n')
py> x = DictReader(data)
py> next(x)
{None: ['spam', 'ham', 'eggs']}
py> x.fieldnames
[]

I expect that there is probably code that works around this issue, by
skipping blank lines somehow, e.g.

DictReader(row for row in data if row.strip())

These work-arounds may (or not) be fragile or buggy, but they ought
to continue working even if DictReader changes its header detection.

--
Steven

Shane Green

unread,

Jan 24, 2013, 7:05:43 PM1/24/13

to Steven D'Aprano, python...@python.org

If this is part of the same response…

A row of delimiters should be treated by the reader object as a row with
explicitly empty fields. If the caller wishes to discard them, they can.
But the reader object shouldn't make that decision.

An empty row, on the other hand, should be just ignored. DictReader *already*
ignores empty rows, provided that they are not in the first row.

Then I think my description was unclear. I wasn't suggesting we add methods for manipulating individual headers, only for telling the DictParser to drop existing headers and reevaluate them on the next row. To make it easy to do something like

while not any(records.fieldnames):

records.discard_fieldnames() # or something to that effect…

without changing any existing behaviour.

Shane Green

alex23

unread,

Jan 24, 2013, 8:49:53 PM1/24/13

to python...@python.org

On 25 Jan, 06:35, Serhiy Storchaka <storch...@gmail.com> wrote:
> On 23.01.13 03:51, alex23 wrote:
>
> > with open('malformed.csv','rb') as csvfile:
> > csvlines = list(l for l in csvfile if l.strip())
> > csvreader = DictReader(csvlines)
>
> csvreader = DictReader(l for l in csvfile if l.strip())

Uh, thanks, although I'm not sure what you think you're showing me
that I'm not already aware of. I spelled it out as two separate
expressions for clarity, I didn't realise we were playing code golf in
our examples.

Stephen J. Turnbull

unread,

Jan 24, 2013, 9:38:30 PM1/24/13

to Steven D'Aprano, python...@python.org

Steven D'Aprano writes:

> - it adds burden to the caller, since the caller is now expected to
> manually inspect the field names and decide whether some should
> be discarded;

It's a dirty job but somebody has to do it.

And that ultimately has to be the *writer* of the CSV file, not the
reader. Both csv.DictReader and the caller are merely guessing unless
there's a private agreement with the writer. cvs.DictReader, as a
stdlib module, can't know about that agreement. The caller can
(although one obvious use case for csv.DictReader is that the caller
doesn't and is hoping csv.DictReader can guess better, oops).

Unless somebody has figured out how to give stdlib code "channeling"
capability?

Ethan Furman

unread,

Jan 24, 2013, 10:20:23 PM1/24/13

to python...@python.org

On 01/24/2013 02:47 AM, Mark Hackett wrote:
> On Thursday 24 Jan 2013, Steven D'Aprano wrote:
>
>>> I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
>>> It's not terribly surprising once you sit down and think about it, but
>>> it's certainly at least a little unexpected to me that data is being
>>> thrown away with no notice. It's unusual for errors to pass silently
>>> in python.
>>
>> Yes, we should not forget that a CSV file is not a dict. Just because
>> DictReader is implemented with a dict as the storage, doesn't mean that it
>> should behave exactly like a dict in all things. Multiple columns with the
>> same name are legal in CSV, so there should be a reader for that
>> situation.
>>
>
> But just because it's reading a csv file, we shouldn't change how a dictionary
> works if you add the same key again.

The proposal is not to change how a dict works, but what the proper
response is for DictReader when a duplicate key is found.

~Ethan~

Ethan Furman

unread,

Jan 24, 2013, 10:25:38 PM1/24/13

to python...@python.org

On 01/22/2013 05:06 PM, J. Cliff Dyer wrote:

> Thoughts? Do folks think this is worth adding to the csv library, or
> should I just keep using my subclass?

+1 for ignoring blank lines (including delimiter-only lines)

+1 for raising an exception on duplicate headers

+1 for a flag to not raise on duplicate empty headers (but a completely
empty header line is still ignored)

Terry Reedy

unread,

Jan 24, 2013, 11:26:19 PM1/24/13

to python...@python.org

On 1/24/2013 6:53 PM, Steven D'Aprano wrote:

> DictReader already ignores blank lines, *except for the very first line*.

Interesting. A proper csv file does not contain blank lines. The csv doc
is silent on what is does they are present. (The work 'blank' does not
appear.) Ignoring them seems reasonable, but then all should be ignored.
And the doc should say so.

> Using Python 3.3:
>
> py> from io import StringIO
> py> from csv import DictReader
> py> data = StringIO('spam,ham,eggs\n\n\n\n1,2,3\n\n\n\n\n4,5,6\n')
> py> x = csv.DictReader(data)
> py> next(x)
> {'eggs': '3', 'ham': '2', 'spam': '1'}
> py> next(x)
> {'eggs': '6', 'ham': '5', 'spam': '4'}
>
>
> I don't expect that there is anyone relying on a CSV file with a leading
> blank line to be treated as one having no columns at all:
>
> py> data = StringIO('\n\n\n\nspam,ham,eggs\n1,2,3\n4,5,6\n')
> py> x = DictReader(data)
> py> next(x)
> {None: ['spam', 'ham', 'eggs']}
> py> x.fieldnames
> []
>
>
> I expect that there is probably code that works around this issue, by
> skipping blank lines somehow, e.g.
>
> DictReader(row for row in data if row.strip())
>
> These work-arounds may (or not) be fragile or buggy, but they ought
> to continue working even if DictReader changes its header detection.

--

Terry Jan Reedy

Serhiy Storchaka

unread,

Jan 25, 2013, 5:01:08 AM1/25/13

to python...@python.org

On 25.01.13 03:49, alex23 wrote:
> On 25 Jan, 06:35, Serhiy Storchaka <storch...@gmail.com> wrote:
>> csvreader = DictReader(l for l in csvfile if l.strip())
>
> Uh, thanks, although I'm not sure what you think you're showing me
> that I'm not already aware of. I spelled it out as two separate
> expressions for clarity, I didn't realise we were playing code golf in
> our examples.

I point that you no need to read all file in a memory. You can use an
iterator and process it line by line.

Mark Hackett

unread,

Jan 25, 2013, 5:58:28 AM1/25/13

to python...@python.org

On Friday 25 Jan 2013, Ethan Furman wrote:
> On 01/24/2013 02:47 AM, Mark Hackett wrote:
> > On Thursday 24 Jan 2013, Steven D'Aprano wrote:
> >>> I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
> >>> It's not terribly surprising once you sit down and think about it, but
> >>> it's certainly at least a little unexpected to me that data is being
> >>> thrown away with no notice. It's unusual for errors to pass silently
> >>> in python.
> >>
> >> Yes, we should not forget that a CSV file is not a dict. Just because
> >> DictReader is implemented with a dict as the storage, doesn't mean
> >> that it should behave exactly like a dict in all things. Multiple
> >> columns with the same name are legal in CSV, so there should be a reader
> >> for that situation.
> >
> > But just because it's reading a csv file, we shouldn't change how a
> > dictionary works if you add the same key again.
>
> The proposal is not to change how a dict works, but what the proper
> response is for DictReader when a duplicate key is found.
>

Ethan, the proposal is predicated on the "silent abandonment" (which isn't
actually the case any more than doing:

a=4
a=9

is abandoning silently the 4.) being unexpected.

Except, just like the assignment in the aside above, this is entirely what IS
expected if you're putting a CSV line into a dictionary with duplicate key
names.

If you don't want it to do what a dictionary does, then don't use DictReader,
as Chris proposes.

My only niggle with that idea is that you'd be writing a lot of "SumptyReader"
for each case and is redundant. But that may, in practice, be no problem at
all.

If you didn't want it to do what a dict does, don't use a dict.

Mark Hackett

unread,

Jan 25, 2013, 6:00:31 AM1/25/13

to python...@python.org

On Thursday 24 Jan 2013, Steven D'Aprano wrote:

> - it is less obvious: how does the caller decide that there are too many
> field names?
>

Additionally, the user of the library now has to read much more about the
library (either code or documentation, which has to track the code too), to
decide what it is going to do.

If you have to read the code, then it's not really OO, is it. It's light grey,
not black box.

Ethan Furman

unread,

Jan 25, 2013, 11:30:25 AM1/25/13

to python...@python.org

On 01/25/2013 03:00 AM, Mark Hackett wrote:
> On Thursday 24 Jan 2013, Steven D'Aprano wrote:
>> - it is less obvious: how does the caller decide that there are too many
>> field names?
>>
>
> Additionally, the user of the library now has to read much more about the
> library (either code or documentation, which has to track the code too), to
> decide what it is going to do.
>
> If you have to read the code, then it's not really OO, is it. It's light grey,
> not black box.

If you have to read the code, the documentation needs improvement.

~Ethan~

Mark Hackett

unread,

Jan 25, 2013, 11:53:46 AM1/25/13

to python...@python.org

On Friday 25 Jan 2013, Ethan Furman wrote:

> On 01/25/2013 03:00 AM, Mark Hackett wrote:
> > On Thursday 24 Jan 2013, Steven D'Aprano wrote:
> >> - it is less obvious: how does the caller decide that there are too many
> >> field names?
> >
> > Additionally, the user of the library now has to read much more about the
> > library (either code or documentation, which has to track the code too),
> > to decide what it is going to do.
> >
> > If you have to read the code, then it's not really OO, is it. It's light
> > grey, not black box.
>
> If you have to read the code, the documentation needs improvement.
>

And if you put your feet too close to the fire, your feet will burn.

Neither have anything to do with the subject at hand, however.

Which is if a dictionary acts a certain way and calling a routine that creates
a dictionary AND WORKS DIFFERENTLY, then why did you use a routine that
creates a dictionary?

You see, the option here is to leave it operating as a dictionary operates.
And in that case, you do not need to document anything. The documentation of
how it works is already covered by the python basics: "How does a dictionary
work in Python?".

So don't change it, and you don't have to improve the documentation.

Ethan Furman

unread,

Jan 25, 2013, 11:48:43 AM1/25/13

to python...@python.org

On 01/25/2013 02:58 AM, Mark Hackett wrote:
> On Friday 25 Jan 2013, Ethan Furman wrote:
>> On 01/24/2013 02:47 AM, Mark Hackett wrote:
>>> On Thursday 24 Jan 2013, Steven D'Aprano wrote:
>>>>> I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
>>>>> It's not terribly surprising once you sit down and think about it, but
>>>>> it's certainly at least a little unexpected to me that data is being
>>>>> thrown away with no notice. It's unusual for errors to pass silently
>>>>> in python.
>>>>
>>>> Yes, we should not forget that a CSV file is not a dict. Just because
>>>> DictReader is implemented with a dict as the storage, doesn't mean
>>>> that it should behave exactly like a dict in all things. Multiple
>>>> columns with the same name are legal in CSV, so there should be a reader
>>>> for that situation.
>>>
>>> But just because it's reading a csv file, we shouldn't change how a
>>> dictionary works if you add the same key again.
>>
>> The proposal is not to change how a dict works, but what the proper
>> response is for DictReader when a duplicate key is found.
>
> Ethan, the proposal is predicated on the "silent abandonment" (which isn't
> actually the case any more than doing:
>
> a=4
> a=9
>
> is abandoning silently the 4.) being unexpected.

We're going to have to agree to disagree on this point -- I think there
is a huge difference between reassigning a variable which is completely
under your control from losing entire columns of data from a file which
you may have never seen before.

> Except, just like the assignment in the aside above, this is entirely what IS
> expected if you're putting a CSV line into a dictionary with duplicate key
> names.

Expected by whom? The library writer? Sure. The application writer?
Maybe. The person creating the spreadsheet that's going to be dumped to
csv to be imported into the program that thought, "This field also needs
an item number... I'll call it 'item_no', just like that other column"
-- Nope.

> If you don't want it to do what a dictionary does, then don't use DictReader,
> as Chris proposes.

DictReader puts a name on a column -- that's its primary use; I don't
think the designers had the goal of dropping data when they implemented
it -- I suspect it was just missed as a possibility (not being the
"normal" type of csv file) or putting a warning in the docs was missed.

~Ethan~

ru...@yahoo.com

unread,

Jan 25, 2013, 1:03:03 PM1/25/13

to python...@googlegroups.com

On 01/25/2013 09:53 AM, Mark Hackett wrote:> On Friday 25 Jan 2013, Ethan Furman wrote:
>> On 01/25/2013 03:00 AM, Mark Hackett wrote:
>> > On Thursday 24 Jan 2013, Steven D'Aprano wrote:
>> >> - it is less obvious: how does the caller decide that there are too many
>> >> field names?
>> >
>> > Additionally, the user of the library now has to read much more about the
>> > library (either code or documentation, which has to track the code too),
>> > to decide what it is going to do.
>> >
>> > If you have to read the code, then it's not really OO, is it. It's light
>> > grey, not black box.
>>
>> If you have to read the code, the documentation needs improvement.
>>
>
> And if you put your feet too close to the fire, your feet will burn.
>
> Neither have anything to do with the subject at hand, however.
>
> Which is if a dictionary acts a certain way and calling a routine that creates
> a dictionary AND WORKS DIFFERENTLY, then why did you use a routine that
> creates a dictionary?
>
> You see, the option here is to leave it operating as a dictionary operates.
> And in that case, you do not need to document anything. The documentation of
> how it works is already covered by the python basics: "How does a dictionary
> work in Python?".

The csv DictReader *uses* a dictionary for its output. That
it does so imposes no requirements on how it should parse or
otherwise handle the input that eventually goes into that
dict.

I can understand the appeal of keeping things simple and
simply cramming whatever comes out of a simple parse of
the header into the dict keys. Simplicity is good and
that is a valid opinion. However it is not a-priori the
obviously best one no matter how much hand-waving and
foot stomping comes with it.

I would prefer to see a suppressible exception when header
keys are duplicated on the grounds that such a csv file
is not in general an appropriate input for the DictReader.

> So don't change it, and you don't have to improve the documentation.

If it's not changed then documentation definitely should
be fixed. The very fact that when the behaviour was pointed
out here, the result was a long discussion rather than one
or two responses that said, "of course it behaves that way"
is the strongest evidence that the current description
is inadequate.

Shane Green

unread,

Jan 26, 2013, 2:43:11 AM1/26/13

to ru...@yahoo.com, python...@googlegroups.com

I've been trying to avoid the wrath, but can't any longer. Let me start but clarifying that I know what a dictionary is, how it works, and what Python is, so we can bypass calling that into question. I also know what CSV is, and I've dealt with a lot of real-life examples of CSV data: not just exports from excel, log data from the energy management space, sensor values, etc.; critical electrical fault data generated by very legacy, stupid equipment. And while it's true that a dictionary is a dictionary and it works the way it works, the real point that drives home is that it's an inappropriate mechanism for dealing ordered rows of sequential values. Regardless of what choices were made for the implementation, if the module's name is csv, it should be able to do the things it says it does with any legal CSV content without losing information. Just because its how a dictionary works doesn't mean column 3's value replacing column 1's value is something other than the loss of data. One CSV file I worked with had headers for five columns of information, then the header "VALUE" for every 5 minute period in an hour. Using this CSV parser would leave the client with one sample an hour: how dictionaries work isn't going to bring back 10 values, so information was lost.

The final point is a simple one: while that CSV file format was stupid, it was perfectly legal. Something that deals with CSV content should not be losing any of its content. It also should be barfing or throwing exceptions, by the way.

Shane Green

Shane Green

unread,

Jan 26, 2013, 2:52:15 AM1/26/13

to ru...@yahoo.com, python...@googlegroups.com

And what about fixing it by replacing implementing a class that does it correctly, maps values to column numbers, keeps values as lists modeled after FieldStorage. Make iterating it work just like it does now by replacing the values with the last value in each least before returning it, and provide iterator methods for getting at the new functionality, which includes iterating items with repeating header names in order, etc; and also iter records, or something like that, to iterate the head: [value, …] maps?

Shane Green

Shane Green

unread,

Jan 26, 2013, 3:00:48 AM1/26/13

to python...@googlegroups.com

I love it when the single word I skip completely changes the sentence's meaning…

The final point is a simple one: while that CSV file format was stupid, it was perfectly legal. Something that deals with CSV content should not be losing any of its content. It also should not be barfing or throwing exceptions, by the way.

Shane Green

unread,

Jan 26, 2013, 6:55:48 AM1/26/13

to python...@python.org

Sorry if this is a dupe–it went to the google groups address the first time around, and I think that's different…

I've been trying to avoid the wrath, but can't any longer. Let me start but clarifying that I know what a dictionary is, how it works, and what Python is, so we can bypass calling that into question. I also know what CSV is, and I've dealt with a lot of real-life examples of CSV data: not just exports from excel, log data from the energy management space, sensor values, etc.; critical electrical fault data generated by very legacy, stupid equipment. And while it's true that a dictionary is a dictionary and it works the way it works, the real point that drives home is that it's an inappropriate mechanism for dealing ordered rows of sequential values. Regardless of what choices were made for the implementation, if the module's name is csv, it should be able to do the things it says it does with any legal CSV content without losing information. Just because its how a dictionary works doesn't mean column 3's value replacing column 1's value is something other than the loss of data. One CSV file I worked with had headers for five columns of information, then the header "VALUE" for every 5 minute period in an hour. Using this CSV parser would leave the client with one sample an hour: how dictionaries work isn't going to bring back 10 values, so information was lost.

The final point is a simple one: while that CSV file format was stupid, it was perfectly legal. Something that deals with CSV content should not be losing any of its content. It also should [not] be barfing or throwing exceptions, by the way.

And what about fixing it by replacing implementing a class that does it correctly, maps values to column numbers, keeps values as lists modeled after FieldStorage. Make iterating it work just like it does now by replacing the values with the last value in each least before returning it, and provide iterator methods for getting at the new functionality, which includes iterating items with repeating header names in order, etc; and also iter records, or something like that, to iterate the head: [value, …] maps?

Shane Green

Stephen J. Turnbull

unread,

Jan 26, 2013, 8:53:53 AM1/26/13

to Shane Green, python...@python.org

Shane Green writes:

> And while it's true that a dictionary is a dictionary and it works
> the way it works, the real point that drives home is that it's an
> inappropriate mechanism for dealing ordered rows of sequential
> values.

Right! So use csv.reader, or csv.DictReader with an explicit
fieldnames argument.

The point of csv.DictReader with default fieldnames is to take a
"well-behaved" table and turn it into a sequence of "poor-man's"
objects.

> The final point is a simple one: while that CSV file format was
> stupid, it was perfectly legal. Something that deals with CSV
> content should not be losing any of its content.

That's a reasonable requirement.

> It also should [not] be barfing or throwing exceptions, by the way.

That's not. As long as the module provides classes capable of
handling any CSV format (it does), it may also provide convenience
classes for special purposes with restricted formats. Those classes
may throw exceptions on input that doesn't satisfy the restrictions.

> And what about fixing it by replacing implementing a class that

> does it correctly, [...]?

Doesn't help users who want automatically detected access-by-name.
They must have unique field names. (I don't have a use case. I
assume the implementer of csv.DictReader did.<wink/>)

Shane Green

unread,

Jan 26, 2013, 9:39:11 AM1/26/13

to Stephen J. Turnbull, python...@python.org

Okay, I like your point about DictReader having a place with a subset of CSV tables, and agree that, given that definition, it should throw an exception when its fed something that doesn't conform to this definition. I like that.

One thing, though, the new version would let you access column data by name as well:

Instead of

row["timestamp"] == 1359210019.299478

It would be

row["timestamp"] == [1359210019.299478]

And potentially

row["timestamp"] == [1359210019.299478,1359210019.299478]

It could also be accessed as:

row.headers[0] == "timestamp"

row.headers[1] == "timestamp"

row.values[0] == 1359210019.299478

row.values[1] == 1359210019.299478

Could still provide:

for name,value in records.iterfirstitems(): # get the first value for each column with a given name.

- or -

for name,value in records.iterlasttitems(): # get the last value for each column with a given name.

And the exact functionality you have now:

records.itervaluemaps() # or something… just a map(dict(records.iterlastitesm()))

Overkill, but really simple things to add…

The only thing this really adds to the "convenience" of the current DictReader for well-behaved tables, is the ability to access values sequentially or by name; other than that, the only difference would be iterating on a generator method's output instead of the instance itself.

Shane Green

Shane Green

unread,

Jan 27, 2013, 9:10:49 AM1/27/13

to python...@python.org

Something as simple as this (straw man) demonstrates what I mean:

class Record(defaultdict):
def __init__(self, headers, fields):
super(Record, self).__init__(list)
self.headers = headers
self.fields = fields
map(self.enter, self.headers, self.fields)
def valuemap(self, first=False):
index = 0 if first else -1
return dict([(key,values[index]) for key,values in self.items()])
def enter(self, header, *values):
if isinstance(header, int):
header = self.headers[header]
self[header].extend(values)
def itemseq(self):
return zip(self.headers,self.fields)
def __getitem__(self, spec):
if isinstance(spec, int):
return self.fields[spec]
return super(Record, self).__getitem__(spec)
def __getslice__(self, *args):
return self.fields.__getslice__(*args)

This would let you access column values using header names, just like before. Each column's value(s) is now in a list, and would contain multiple values anytime for any column repeated more than once in the header.

Values can also be accessed sequentially using integer indexes, and the valuemap() returns a standard dictionary that conforms to the previous standard exactly: there is a one to one mapping between column headers and values, which the last value associated with a given column name being the value.

While I think the changes should be added without changing what exists for backward compatibility reasons, I've started to think the existing version should also be deprecated, rather than maintained as a special case. Even when the format is perfect for the existing code, I don't see any big advantages to using it over this approach.

Keep in mind the example is just a quick straw man: performance is a big difference (and plenty of bugs), but that doesn't seem like the right thing to base the decision, as performance can easily be enhanced later.

In summary, given headers: A, B, C, D, E, B, G

record.headers == ["A", "B", "C", "D", "E", "B", "G"]

record.fields = [0, 1, 2, 3, 4, 5, 6, 7]

record["A"] == [0]

record["B"] == [1, 5]

# Note sequential access values are not in lists, and the second "B" column's value 5 is in it's original 5th position.

record[0] == 0

record[1] ==1

record[2] == 2

record[3] == 3

record[4] == 4

record[5] == 5

record.items() == [("A", [0]), ("B", [1, 5)), …]

record.valuemap() == {"A": 0, "B": 5, …} # This returns exactly what DictReader does today, a single value per named column, with the last value being the one used.

Shane Green

Begin forwarded message:

Mark Hackett

unread,

Jan 28, 2013, 7:06:39 AM1/28/13

to python...@python.org

On Sunday 27 Jan 2013, Shane Green wrote:
> While I think the changes should be added without changing what exists for
> backward compatibility reasons, I've started to think the existing version
> should also be deprecated, rather than maintained as a special case
>

That sounds effective.

Mark Hackett

unread,

Jan 28, 2013, 7:13:45 AM1/28/13

to python...@python.org

On Saturday 26 Jan 2013, Stephen J. Turnbull wrote:
> Shane Green writes:
> > And while it's true that a dictionary is a dictionary and it works
> > the way it works, the real point that drives home is that it's an
> > inappropriate mechanism for dealing ordered rows of sequential
> > values.
>
> Right! So use csv.reader, or csv.DictReader with an explicit
> fieldnames argument.
>
> The point of csv.DictReader with default fieldnames is to take a
> "well-behaved" table and turn it into a sequence of "poor-man's"
> objects.
>

Well though there's another example out there of what do do next, I was
thinking of being able to define the csv file format so that you could write it
out correctly too.

And to that end, some form of description of the csv file is needed. I was
thinking something like this:

A,B,C,A,D,E
{(A:2,A:1),B,C,D,E}

which would put columns 4 and 1 in the first entry (under the name A) as a
list, in that order, followed by B, C, D and E all expected to be single
unique names.

This also allows the same definition to be used to write it out.

Blank headers are denoted with:

A,,,,,,B,C

And headers not used in the dictionary (discarded) are handled by not being
put in the "where do we put this" line:
A,B,C,D
{A,D}

When writing out, you cannot have empty headers (since these values get
dropped and the output format spec is now no longer suitable), and you must
assign each header a dictionary (else again the dictionary doesn't contain all
the data that was in the input).

To write out these two types of input file, you need to create a new csv format
spec which CAN be written out.

Therefore you will have to deliberately define an output that loses data.

Mark Hackett

unread,

Jan 28, 2013, 7:21:19 AM1/28/13

to python...@python.org

On Friday 25 Jan 2013, ru...@yahoo.com wrote:
>
> The csv DictReader *uses* a dictionary for its output. That
> it does so imposes no requirements on how it should parse or
> otherwise handle the input that eventually goes into that
> dict.

And that doesn't mean that writing

dict[A]=1
dict[A]=9

results in dict[A] being a list containing 1 and 9.

A program using a dictionary entry has to know whether the input has duplicate
headers because in the case where only the first line is done, writing out the
value of dict[A] gives you "1". Writing out dict[A] if it's a list gives you
"[1,9]" which must be parsed differently.

Mark Hackett

unread,

Jan 28, 2013, 7:21:58 AM1/28/13

to python...@python.org

On Friday 25 Jan 2013, Ethan Furman wrote:
> We're going to have to agree to disagree on this point -- I think there
> is a huge difference between reassigning a variable which is completely
> under your control from losing entire columns of data from a file which
> you may have never seen before.
>

But if you've never seen it before, how do you know that you're going to get a
LIST in one column?

Ethan Furman

unread,

Jan 28, 2013, 10:53:44 AM1/28/13

to python...@python.org

On 01/28/2013 04:21 AM, Mark Hackett wrote:
> On Friday 25 Jan 2013, Ethan Furman wrote:
>> We're going to have to agree to disagree on this point -- I think there
>> is a huge difference between reassigning a variable which is completely
>> under your control from losing entire columns of data from a file which
>> you may have never seen before.
>>
>
> But if you've never seen it before, how do you know that you're going to get a
> LIST in one column?

I don't, which is why an exception should be raised.

~Ethan~

Mark Hackett

unread,

Jan 28, 2013, 12:13:52 PM1/28/13

to python...@python.org

On Monday 28 Jan 2013, Ethan Furman wrote:
> On 01/28/2013 04:21 AM, Mark Hackett wrote:
> > On Friday 25 Jan 2013, Ethan Furman wrote:
> >> We're going to have to agree to disagree on this point -- I think there
> >> is a huge difference between reassigning a variable which is completely
> >> under your control from losing entire columns of data from a file which
> >> you may have never seen before.
> >
> > But if you've never seen it before, how do you know that you're going to
> > get a LIST in one column?
>
> I don't, which is why an exception should be raised.
>
> ~Ethan~

And there's an argument for that that I've agreed to before.

There's a counter that this will cause programs that used to work to fail.

Whether the pro is higher than the con or the other way round is what I
question.

You, however, seem to believe this is a forgone conclusion.

And that's where I disagree.

MRAB

unread,

Jan 28, 2013, 12:26:31 PM1/28/13

to python-ideas

On 2013-01-28 15:53, Ethan Furman wrote:
> On 01/28/2013 04:21 AM, Mark Hackett wrote:
>> On Friday 25 Jan 2013, Ethan Furman wrote:
>>> We're going to have to agree to disagree on this point -- I think there
>>> is a huge difference between reassigning a variable which is completely
>>> under your control from losing entire columns of data from a file which
>>> you may have never seen before.
>>>
>>
>> But if you've never seen it before, how do you know that you're going to get a
>> LIST in one column?
>
> I don't, which is why an exception should be raised.
>

+1

It shouldn't silently drop the columns, nor should it silently merge
the columns into a list. It should complain, unless you state that it
should merge if necessary because, presumably, you're prepared for such
an eventuality.

Mark Hackett

unread,

Jan 28, 2013, 12:45:16 PM1/28/13

to python-ideas

On Monday 28 Jan 2013, MRAB wrote:
> It shouldn't silently drop the columns
>

Why not?

It's adding to a dictionary and adding a duplicate key replaces the earlier
one.

If it dropped the columns and shouldn't have, then the results will be seen to
be wrong anyway, so there's not a huge amount of need for this.

If it WANTED to keep both columns with the duplicate names, it won't work and
needs abandoning. So no different from now.

If it WANTED duplicate keys (e.g. blanks which aren't imported and aren't
wanted), then you've just broken it. They can't necessarily change the csv file
to put headers in. So now you've made the call useless for this case.

And why, really, are there duplicate column names in there anyway? You can
come up with the assertion that this might be wanted, but they're not normally
what you see in a csv file.

I've never seen nor used a csv file that duplicated column names other than
being blank.

If it had been such a problem, the call would already have been abandoned.

Bruce Leban

unread,

Jan 28, 2013, 7:01:56 PM1/28/13

to python-ideas

The reader could return a multidict. If you know it's a multidict you an access the 'discarded' values. Otherwise, it appears just like the dict that we have today. A middle ground between people that don't want the interface changed and those who want to get the multiple values. Personally, I prefer code that raises exceptions when it gets unreasonable input, and I think duplicate field names qualifies. But if that's the the general sentiment than a multidict is a potential compromise.

--- Bruce

Follow me: http://www.twitter.com/Vroo http://www.vroospeak.com

Alexandre Zani

unread,

Jan 28, 2013, 8:15:22 PM1/28/13

to Bruce Leban, python-ideas

I think raising an exception on duplicate headers is actually very likely to cause working code to break. Consider that all you need for that to happen is an extra couple of empty separators on the first line creating two "" headers. That seems like the sort of behavior that is easy to occur in spreadsheet programs. (Empty cells are usually not very well differentiated from non-existent cells in spreadsheet UIs IME) A StrictDictReader is better, but I think this is overkill.

As for a MultiDictReader, I don't think this is superior to csv.reader. In both cases, you need to keep track of the column orders. And if you already know the column order, you might as well just manually specify the field names in DictReader.

Shane Green

unread,

Jan 29, 2013, 12:24:06 AM1/29/13

to Mark Hackett, python-ideas

Actually I've seen a many real life examples of CSV files with repeated column names, working with log data in the energy management space. CSV has been around for a very long time, and is used for a lot more than spreadsheets; there are a lot of funky formats out there. Things like, every "VALUE" column is a 15 minute reading. It seems like we're getting too hung up on dicts: all the information about a record is precisely stored by two sequences of values: the headers, and the field values. Those entires and their order can both be useful to a consumer of CSV records, and should be made available. The record also maps headers to corresponding value sequences for mapped access.

Stephen J. Turnbull

unread,

Jan 29, 2013, 3:17:46 AM1/29/13

to Shane Green, python-ideas

Shane Green writes:

> Actually I've seen a many real life examples of CSV files with
> repeated column names,

Sure, but this really isn't the issue. If it were, "cvs.reader is
your friend" would be all the answer that the issue deserves IMHO.

> It seems like we're getting too hung up on dicts:

Not at all. (For reasons I don't understand) Somebody has a use case
where it's useful to have the field names stored in each record,
rather than stored once and have both field names and field values
accessed by position as needed. The point is to return a name-value
*mapping object* for *each* row, and that may as well be a dict.

The people who suggest a multidict or a list-valued dict are missing
that point, AFAICS. Eg, in your "BLABLA", "VALUE", ..., "VALUE"
example, position really is what matters, so a dict of any kind is
inappropriate IMO. Again, it's arbitrary whether the list-valued dict
does d["VALUE"].append(x) or d["VALUE"].insert(0,x), and it's hard for
me to guess which it would do in practice: .append is easier to write,
but .insert seems closer to the behavior of csv.reader (which is what
we really want in your example IMO).

Shane Green

unread,

Jan 29, 2013, 5:18:21 AM1/29/13

to Stephen J. Turnbull, python-ideas

So I wasn't really questioning the usefulness of the dictionary representation, but couldn't the returned object also let you access the header and value sequences, etc? I was also thinking the conversion to simple dict with single (non-list) values per column could be part of the API.

Appending duplicate field values as they're read reflects the order the duplicate entries appear in the source (when I've encountered CSV that purposely used duplicate column headers, the sequence they appear was critical). The output from the current implementation should reflect the last duplicate value, as that always replaces previous ones in the dict, so my conversions returned the last value (-1), which should do the same…I think. It was a straw man ;-).

I see your point about the point. I think it would be good to have an implementation that kept all the information but still put the most usable API on it possible, rather than saying you can't have dictionary access unless you want to lose duplicate values, for example. I mean, I've needed to consume CSV a lot, and that's what would have made the module useful to me, and the implementation that keeps all the information and lets it easily to trimmed as-not-needed seems better than one that just wipes it out to start.

Shane Green

Oscar Benjamin

unread,

Jan 29, 2013, 6:16:02 AM1/29/13

to Shane Green, python-ideas

On 29 January 2013 10:18, Shane Green <sh...@umbrellacode.com> wrote:
> So I wasn't really questioning the usefulness of the dictionary
> representation, but couldn't the returned object also let you access the
> header and value sequences, etc? I was also thinking the conversion to
> simple dict with single (non-list) values per column could be part of the
> API.
>
> Appending duplicate field values as they're read reflects the order the
> duplicate entries appear in the source (when I've encountered CSV that
> purposely used duplicate column headers, the sequence they appear was
> critical). The output from the current implementation should reflect the
> last duplicate value, as that always replaces previous ones in the dict, so
> my conversions returned the last value (-1), which should do the same…I
> think. It was a straw man ;-).
>
> I see your point about the point. I think it would be good to have an
> implementation that kept all the information but still put the most usable
> API on it possible, rather than saying you can't have dictionary access
> unless you want to lose duplicate values, for example. I mean, I've needed
> to consume CSV a lot, and that's what would have made the module useful to
> me, and the implementation that keeps all the information and lets it easily
> to trimmed as-not-needed seems better than one that just wipes it out to
> start.

This is exactly what the csv.reader objects do.

While it is a problem that csv.DictReader silently discards data when
that is very likely an error, there's no need to try and guess how
people want to deal with duplicate column headers and invent a new
class for it. It's easy enough to write your own wrapper that exactly
performs whatever processing you happen to want:

def multireader(csvreader):
try:
headers = next(csvreader)
except StopIteration:
raise ValueError('No header')
for row in csvreader:
d = defaultdict(list)
for h, v in zip(headers, row):
d[h].append(v)
yield d

Oscar

Shane Green

unread,

Jan 29, 2013, 6:33:05 AM1/29/13

to Oscar Benjamin, python-ideas

Okay, sure, I guess the starting point of my argument is, DictReader is nice, why not make one that supports duplicate columns and easily implement the other behaviors, whether it's discarding values from duplicate columns so there's a one-to-one mapping, or just raising an exception when a duplicate column is encountered to start with, in terms of something that handles this superset of legal CSV formats that do in fact specify exactly what header names each of their values should be mapped to?

Shane Green

Mark Hackett

unread,

Jan 29, 2013, 6:39:28 AM1/29/13

to python...@python.org

On Tuesday 29 Jan 2013, Alexandre Zani wrote:
>
> As for a MultiDictReader, I don't think this is superior to csv.reader. In
> both cases, you need to keep track of the column orders. And if you already
> know the column order, you might as well just manually specify the field
> names in DictReader.
>

But it would allow you to access the index by name.

value=csv_array[indecies{"Total Cost"}]

A little more verbose than

value=csv_dict{"Total Cost"}

But it's easier to read what it's doing than

value=csv_array[3]

Shane Green

unread,

Jan 29, 2013, 6:54:09 AM1/29/13

to python-ideas

And funky CSV formats don't make the current version not work for anyone. It
works for the people it's been working for all along. Why stop that?

Agreed: I'm actually not for changing the existing stuff. I don't think something that used to return single values, should start returning lists, and if it's going to start raising exceptions, I think that should be an option you enable explicitly. I think maybe this should be deprecated, in favor something that implements what we're discussing. I'm also realizing that way of thinking means it's slightly off topic, and apologize for that ;-)

Shane Green

Steven D'Aprano

unread,

Jan 29, 2013, 7:26:13 AM1/29/13

to python...@python.org

On 29/01/13 04:45, Mark Hackett wrote:
> On Monday 28 Jan 2013, MRAB wrote:
>> It shouldn't silently drop the columns
>>
>
> Why not?
>
> It's adding to a dictionary and adding a duplicate key replaces the earlier
> one.

Then adding to a dictionary was a mistake.

The choice of a dict is *implementation*, not *interface*. The interface needed
is to return a mapping of column names to values. The nature of that mapping is
an implementation detail, and dict is only the simplest solution, not necessarily
the correct solution.

There is nothing about CSV files that imply that the right behaviour is to drop
columns. The nature of CSV files is to allow duplicate column names, and so CSV
readers should too. That implies that using a dict, which silently drops duplicate
keys, was the wrong choice.

We might argue that using duplicate column names is stupid, but CSV supports it,
and so should CSV readers.

> If it dropped the columns and shouldn't have, then the results will be seen to
> be wrong anyway, so there's not a huge amount of need for this.

You cannot assume that the caller knows that there are duplicated column names.
That's why dropping columns is problematic: it *silently* drops them, giving the
caller no idea that it has happened.

Given that DictReader already exists, and that there probably is someone out
there who is relying on it silently eating columns, I think that the only
reasonable way forward is to add a new reader that supports multiple columns
with the same name. The caller can then use whichever reader suits their
use-case:

* I don't care about duplicate-name columns, just give me some arbitrary one;
- use DictReader

* I want all of the duplicate-name columns;
- use MultiDictReader

* I want some of the duplicate-name columns;
- use MultiDictReader, and then filter the results as you get them

(When I put it like that, DictReader sounds even less useful. But as I said,
I daresay *somebody* is relying on it right now, so we can't change it.)

> And why, really, are there duplicate column names in there anyway? You can
> come up with the assertion that this might be wanted, but they're not normally
> what you see in a csv file.
>
> I've never seen nor used a csv file that duplicated column names other than
> being blank.

Well there you go. That is exactly one such example of duplicate column names.

--
Steven

Mark Hackett

unread,

Jan 29, 2013, 7:30:49 AM1/29/13

to python...@python.org

On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
> On 29/01/13 04:45, Mark Hackett wrote:
> > On Monday 28 Jan 2013, MRAB wrote:
> >> It shouldn't silently drop the columns
> >
> > Why not?
> >
> > It's adding to a dictionary and adding a duplicate key replaces the
> > earlier one.
>
> Then adding to a dictionary was a mistake.
>

I agree.

So don't use DictReader in that case.

We have Oscar with the method to do your own (and looked fairly simple and
straightforward).
Chris with carefuldictreader.
Shane with his dual-retention object.

Mark Hackett

unread,

Jan 29, 2013, 7:35:01 AM1/29/13

to python...@python.org

On Tuesday 29 Jan 2013, Steven D'Aprano wrote:

> > If it dropped the columns and shouldn't have, then the results will be
> > seen to be wrong anyway, so there's not a huge amount of need for this.
>
> You cannot assume that the caller knows that there are duplicated column
> names
>

You cannot assume they wanted them as a list.

You cannot assume that duplicate replacement is what they want.

If someone is using a csv file with header names they have never read, how are
they going to use the data? They won't even know the name to access the value
in the dictionary! So I discard the claim that the caller may not know the
column names are duplicated. They have to know what the headers are to use
DictReader.

Shane Green

unread,

Jan 29, 2013, 8:08:25 AM1/29/13

to Mark Hackett, python...@python.org

Let's remove the assumptions about their information by retaining all of it, and make an assumption that everyone is capable of dealing with lists.

Shane Green

Steven D'Aprano

unread,

Jan 29, 2013, 8:28:19 AM1/29/13

to python...@python.org

On 29/01/13 23:35, Mark Hackett wrote:
> On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
>>> If it dropped the columns and shouldn't have, then the results will be
>>> seen to be wrong anyway, so there's not a huge amount of need for this.
>>
>> You cannot assume that the caller knows that there are duplicated column
>> names
>>
>
> You cannot assume they wanted them as a list.

I don't need to assume that. They can take the list and post-process it into
any data type they want.

A list is a natural fit for associating multiple values to a single key,
because it doesn't lose data: it is variable-sized, so it can handle "no
values" or "1000 values" equally easily; it is ordered, and it is iterable.
If the caller wants something else, they can convert it.

> You cannot assume that duplicate replacement is what they want.

I don't think I ever suggested that it was.

> If someone is using a csv file with header names they have never read, how are
> they going to use the data?

reader = csv.DictReader(whatever)
for mapping in reader:
for key, value in mapping.items():
process(key, value)

Or perhaps you only care about one column, and don't care about the other, unknown,
columns:

for mapping in reader:
value = mapping.get('spam', 'some default')
process(value)

> They won't even know the name to access the value in the dictionary!

Dealing with arbitrary field names in data you read from a file is not hard.

--
Steven

Mark Hackett

unread,

Jan 29, 2013, 8:44:35 AM1/29/13

to python...@python.org

On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
> On 29/01/13 23:35, Mark Hackett wrote:
> > On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
> >>> If it dropped the columns and shouldn't have, then the results will be
> >>> seen to be wrong anyway, so there's not a huge amount of need for this.
> >>
> >> You cannot assume that the caller knows that there are duplicated column
> >> names
> >
> > You cannot assume they wanted them as a list.
>
> I don't need to assume that. They can take the list and post-process it
> into any data type they want.

Yes you ARE assuming it. You want them to post process it. But if they don't
know there are duplicates there and have found their script works for their
needs and therefore never looked, they will now get the wrong answer.

As Oscar says, they could process the csv file themselves by hand and code in
EXACTLY what they want. They don't have to put it in a dictionary then.

And you've already said

> Then adding to a dictionary was a mistake.

So they shouldn't be using DictReader.

Shane Green

unread,

Jan 29, 2013, 8:45:25 AM1/29/13

to python-ideas

On Jan 29, 2013, at 5:10 AM, Mark Hackett <mark.h...@metoffice.gov.uk> wrote:

> On Tuesday 29 Jan 2013, you wrote:
>> Let's remove the assumptions about their information by retaining all of
>> it, and make an assumption that everyone is capable of dealing with lists.
>>
>

> Then lets not use a dictionary. And leave the DictReader alone.
>

Yes, I think a more useful CSV construct would map header names to lists of values, provide access to original header and value sequences, and methods for iterating sequential (header,value) items (with possibly repeating header values, and which could be fed to dict() to produce exactly what DictReader produces), As such, it would not be a DictReader because it would produce something that just extended the dictionary API. I would think something like CSVRecord, or just Record, would be more accurate.

Shane Green

unread,

Jan 29, 2013, 9:09:12 AM1/29/13

to Mark Hackett, python...@python.org

I'm not sure this is constructive.

I think it's safe to assume changing something in an API that used to return single values, into something that now returns lists of those values, will be a problem for folks.

I also think it's safe to assume folks can design their applications for an API that returns lists of values. In support of this assumption, I will point out that's precisely what CGI's FieldStorage does to represent all HTML form values because some form values (radio buttons, checkboxes, etc.), can have more than one value associated with their name on submission.

Finally, I would assert that the more legally formatted content your content reader accurately reads and handles, the better.

Shane Green

Chris Angelico

unread,

Jan 29, 2013, 9:55:09 AM1/29/13

to python-ideas

On Wed, Jan 30, 2013 at 1:09 AM, Shane Green <sh...@umbrellacode.com> wrote:
> I think it's safe to assume changing something in an API that used to return
> single values, into something that now returns lists of those values, will
> be a problem for folks.
>
> I also think it's safe to assume folks can design their applications for an
> API that returns lists of values.

Agreed on both points. A new API that returns lists of everything
would be a lot safer than fiddling with the current one.

ChrisA

Mark Lawrence

unread,

Jan 29, 2013, 12:38:50 PM1/29/13

to python...@python.org

On 29/01/2013 12:30, Mark Hackett wrote:
> On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
>> On 29/01/13 04:45, Mark Hackett wrote:
>>> On Monday 28 Jan 2013, MRAB wrote:
>>>> It shouldn't silently drop the columns
>>>
>>> Why not?
>>>
>>> It's adding to a dictionary and adding a duplicate key replaces the
>>> earlier one.
>>
>> Then adding to a dictionary was a mistake.
>>
>
> I agree.
>
> So don't use DictReader in that case.
>
> We have Oscar with the method to do your own (and looked fairly simple and
> straightforward).
> Chris with carefuldictreader.
> Shane with his dual-retention object.
>

Please can we also have a
RemoveTheNullByteThatsPutAtheEndOfTheFileByBrainDeadMicrosoftMoney? :)

--
Cheers.

Mark Lawrence

Eric V. Smith

unread,

Jan 29, 2013, 1:49:01 PM1/29/13

to python...@python.org

On 01/29/2013 07:35 AM, Mark Hackett wrote:
> On Tuesday 29 Jan 2013, Steven D'Aprano wrote:
>>> If it dropped the columns and shouldn't have, then the results will be
>>> seen to be wrong anyway, so there's not a huge amount of need for this.
>>
>> You cannot assume that the caller knows that there are duplicated column
>> names
>>
>
> You cannot assume they wanted them as a list.
>
> You cannot assume that duplicate replacement is what they want.
>
> If someone is using a csv file with header names they have never read, how are
> they going to use the data? They won't even know the name to access the value
> in the dictionary! So I discard the claim that the caller may not know the
> column names are duplicated. They have to know what the headers are to use
> DictReader.

Not true: I process some csv files just to translate them into another
format, say tab delimited. I don't care about the column names, but
dropping columns would sure bother me. I don't think any of the files
I've processed have duplicate columns, but I wouldn't swear to it. And
if they did, that would be an error I'd like to know about.

Eric.

Stephen J. Turnbull

unread,

Jan 29, 2013, 2:19:30 PM1/29/13

to Eric V. Smith, python...@python.org

Eric V. Smith writes:

> Not true: I process some csv files just to translate them into another
> format, say tab delimited. I don't care about the column names,

Then you'd be nuts to use csv.DictReader! csv.reader does exactly
what you want.

DictReader is about transforming a data format from a sequence of rows
of values accessed by position, one of which might be a header, to a
headerless sequence of objects with values accessed by name. If your
use case doesn't involve access by name, it is irrelevant.

Eric V. Smith

unread,

Jan 29, 2013, 2:21:58 PM1/29/13

to Stephen J. Turnbull, python...@python.org

On 01/29/2013 02:19 PM, Stephen J. Turnbull wrote:
> Eric V. Smith writes:
>
> > Not true: I process some csv files just to translate them into another
> > format, say tab delimited. I don't care about the column names,
>
> Then you'd be nuts to use csv.DictReader! csv.reader does exactly
> what you want.
>
> DictReader is about transforming a data format from a sequence of rows
> of values accessed by position, one of which might be a header, to a
> headerless sequence of objects with values accessed by name. If your
> use case doesn't involve access by name, it is irrelevant.

True. But my point stands: it's possible to read the data (even with a
DictReader), do something with the data, and not know the column names
in advance. It's not an impossible use case.

Eric.

Stephen J. Turnbull

unread,

Jan 29, 2013, 3:37:38 PM1/29/13

to Eric V. Smith, python...@python.org

Eric V. Smith writes:

> True. But my point stands: it's possible to read the data (even with a
> DictReader), do something with the data, and not know the column names
> in advance. It's not an impossible use case.

But it is. Dicts don't guarantee iteration order, so you will most
likely get an output file that not only has a different delimiter, but
a different order of fields.

The right use case here is duck-typing. Something like "I have a
bunch of tables of data about car models from different manufacturers
which have different sets of columns, and I know that all of them have
a column labeled 'MSRP', but which column might vary across tables."

Of course, I don't actually believe you'd get that lucky.

Eric V. Smith

unread,

Jan 29, 2013, 3:59:42 PM1/29/13

to Stephen J. Turnbull, python...@python.org

On 1/29/2013 3:37 PM, Stephen J. Turnbull wrote:
> Eric V. Smith writes:
>
> > True. But my point stands: it's possible to read the data (even with a
> > DictReader), do something with the data, and not know the column names
> > in advance. It's not an impossible use case.
>
> But it is. Dicts don't guarantee iteration order, so you will most
> likely get an output file that not only has a different delimiter, but
> a different order of fields.

We're going to have to agree to disagree. Order is not always important.

--
Eric.

Mark Hackett

unread,

Jan 30, 2013, 5:32:54 AM1/30/13

to python...@python.org

On Tuesday 29 Jan 2013, Eric V. Smith wrote:
> On 1/29/2013 3:37 PM, Stephen J. Turnbull wrote:
> > Eric V. Smith writes:
> > > True. But my point stands: it's possible to read the data (even with a
> > > DictReader), do something with the data, and not know the column names
> > > in advance. It's not an impossible use case.
> >
> > But it is. Dicts don't guarantee iteration order, so you will most
> > likely get an output file that not only has a different delimiter, but
> > a different order of fields.
>
> We're going to have to agree to disagree. Order is not always important.
>

It's not impossible that we're living in a simulated world.

If you don't know what's in the csv file at all, then how do you know what
you're supposed to do with it.

Reading into a list will ensure order, so that is usable if order is
important. If the names aren't important at all, then you should drop the first
line and read it into a list again. If the names are important, you'd better
know what names the headers are using.

Steven D'Aprano

unread,

Jan 30, 2013, 7:09:20 AM1/30/13

to python...@python.org

On 30/01/13 21:32, Mark Hackett wrote:

> If you don't know what's in the csv file at all, then how do you know what
> you're supposed to do with it.

Maybe you're processing the file without caring what the column names are,
but you still need to map column name to column contents. This is no more
unusual than processing a dict where you don't know the keys: you just iterate
over them.

Or maybe you're scanning the file for one specific column name, and you don't
care what the other names are.

Or, most likely, you know what you are *expecting* in the CSV file, but because
data files don't always contain what you expect, you want to be notified if
there is something unexpected rather than just have it silently do the wrong
thing.

--
Steven

Mark Hackett

unread,

Jan 30, 2013, 7:14:09 AM1/30/13

to python...@python.org

On Wednesday 30 Jan 2013, Steven D'Aprano wrote:
> On 30/01/13 21:32, Mark Hackett wrote:
> > If you don't know what's in the csv file at all, then how do you know
> > what you're supposed to do with it.
>
> Maybe you're processing the file without caring what the column names are,

If you don't care, then you shouldn't be using a dictionary because you have
to know to say what one you want.

> but you still need to map column name to column contents.

Why? You said this hypothetical reckless person doesn't care.

> This is no more
> unusual than processing a dict where you don't know the keys: you just
> iterate over them.
>

Which is only used for printing the info out.

There's a much easier way to do that:

"cat file.csv"

> Or maybe you're scanning the file for one specific column name, and you
> don't care what the other names are.
>

Then you'll know if it's duplicated or not.

> Or, most likely, you know what you are *expecting* in the CSV file, but
> because data files don't always contain what you expect, you want to be
> notified if there is something unexpected rather than just have it
> silently do the wrong thing.
>

There's a way to do that:

"head -n1 file.csv".

You know, have a look.

Shane Green

unread,

Jan 30, 2013, 7:24:53 AM1/30/13

to python-ideas

So I've done some thinking on it, a bit of research, etc., and have worked with a lot of different CSV content. There are a lot of parallels between the name/value pairs of an HTML form submission, and our use case.

Namely:

- There's typically only one value per name, but it's perfectly legal to have multiple values assigned to a name.

- When there are duplicate multiple values assigned to a name, order can be very important.

- They made the mistake of mapping names to values; they made the mistake of mapping name field names to singular values when there was only one value, and multiple values where there were multiple values.

- Each of these have been deprecated an their FieldStorage now always maps field names to lists of values.

I've implemented a Record class I'm going to pitch for feedback. Although I followed the FieldStorage API for a couple of methods, it didn't translate very well because their values are complex objects. This Record class is a dictionary type that maps header names to the values from columns labeled by that same header. Most lists have a single field because usually headers aren't duplicated. When multiple values are in a field, they are listed in the order they were read from the CSV file. The API provides convenience methods for getting the first or last value listed for a given column name, making it very easy to turn work with singular values when desired. The dictionary API will likely bent primary mechanism for interacting with it, however, knows the header and row sequences it was built from, and provides sequential access to them as well. In addition to working with non-standard CSV, performing transformations, etc.this information makes it possible to reproduce correctly ordered CSV.

While I don't really know yet whether it would make sense to support any kind of manipulation of values on the record instances themselves, versus using more copy()/update() approach to defining modifying records or something, but I did decide to wrap the row values in a tuple, making it read only. This was for several reasons. One was to address a potential inconsistency that might arise should we decide to support editing, and the other is because the record is the representation of that row read from the source file, and so it should always accurately reflect that content.

About the code: I wrote it tonight, tested it for an hour, so it's not meant to be perfect or final, but it should stir up a very concrete discussion about the API, if nothing else ;-) I included a generator that seemed to work on the some test files. It most definitely is not meant to be critiqued or a distraction, but I've included it in case anyone ends up wanting to investigate the things further. Although the iterator function provides a slightly different signature that DictReader, that's not because I'm trying toe change anything; please keep in mind the generator was just a test. Also, I'd like to mention one last time that I don't think we should change what exists to reflect any of these changes: I was thinking it would be a new set of classes and functions that, that would become the preferred implementation in the future.

class Record(dict):
def __init__(self, headers, fields):
if len(headers) != len(fields):
# I don't make decicions about how gaps should be filled.
raise ValueError("header/field size mismatch")
self._headers = headers
self._fields = tuple(fields)
[self.setdefault(h,[]).append(v) for h,v in self.fielditems()]
super(Record, self).__init__()
def fielditems(self):
"""
Get header,value sequence that reflects CSV source.
"""
return zip(self.headers(),self.fields())
def headers(self):
"""
Get ordered sequence of headers reflecting CSV source.
"""
return self._headers
def fields(self):
"""
Get ordered sequence of values reflecting CSV row source.
"""
return self._fields
def getfirst(self, name, default=None):
"""
Get value of last field associated with header named
'name'; return 'default' if no such value exists.
"""
return self[name][0] if name in self else default
def getlast(self, name, default=None):
"""
Get value of last field associated with header named
'name'; return 'default' if no such value exists.
"""
return self[name][-1] if name in self else default
def getlist(self, name):
"""
Get values of all fields associated with header named 'name'.
"""
return self.get(name, [])
def pretty(self, header=True):
lines = []
if header:
lines.append(
["%s".ljust(10).rjust(20) % h for h in self.headers()])
lines.append(
["%s".ljust(10).rjust(20) % v for v in self.fields()])
return "\n\n".join(["|".join(line).strip() for line in lines])
def __getslice__(self, start=0, stop=None):
return self.fields()[start: stop]

import itertools

Undefined = object()
def iterrecords(f, headers=None, bucketheader=Undefined,
missingfieldsok=False, dialect="excel", *args, **kw):
rows = reader(f, dialect, *args, **kw)
for row in itertools.ifilter(None, rows):
if not headers:
headers = row
headcount = len(headers)
print headers
continue
rowcount = len(row)
rowheaders = headers
if rowcount < headcount:
if not missingfieldsok:
raise KeyError("row has more values than headers")
elif rowcount > headcount:
if bucketheader is Undefined:
raise KeyError("row has more values than headers")
rowheaders += [bucketheader] * (rowcount - headcount)
record = Record(rowheaders, row)
yield record

# That's run within the context of the "csv" module to work… maybe.

Shane Green

Shane Green

unread,

Jan 30, 2013, 7:59:17 AM1/30/13

to python-ideas

I should probably also have noted the dictionary API behaviour since it's not explicitly:

keys() -> list of unique() header names.

values() -> list of field values lists.

items() -> [(header, field-list),] pairs.

And then of course dictionary lookup. One thing that comes to mind is that there's really no value to the unordered sequence of value lists; there could be some value in extending an OrderedDict, making all the iteration methods consistent and therefore something that could be used to do something like write values, etc….

Jeff Jenkins

unread,

Jan 30, 2013, 9:04:47 AM1/30/13

to Shane Green, python-ideas

I think this may have been lost somewhere in the last 90 messages, but adding a warning to DictReader in the docs seems like it solves almost the entire problem. New csv.DictReader users are informed, no one's old code breaks, and a separate discussion can be had about whether it's worth adding a csv.MultiDictReader which uses lists.

Shane Green

unread,

Jan 30, 2013, 9:44:26 AM1/30/13

to Jeff Jenkins, python-ideas

"""Also, I'd like to mention one last time that I don't think we should change what exists to reflect any of these changes: I was thinking it would be a new set of classes and functions that, that would become the preferred implementation in the future."""

This is kind of that new discussion. I agree…

Shane Green