On Tue, 2013-01-22 at 17:51 -0800, alex23 wrote:With all due respect, while you make a good point that we don't want to
> I don't think we should start adding support for every malformed type
> of csv file that exists. It's easy enough to remove the unnecessary
> lines yourself before passing them to DictReader:
>
> from csv import DictReader
>
> with open('malformed.csv','rb') as csvfile:
> csvlines = list(l for l in csvfile if l.strip())
> csvreader = DictReader(csvlines)
>
> Personally, if I was dealing with this as often as you are, I'd
> probably make a custom context manager instead. The problem lies in
> the files themselves, not in csv's response to them.
> _______________________________________________
> Python-ideas mailing list
> Python...@python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
start special casing every malformed type of CSV, there is absolutely
something wrong with DictReader's response to files that have duplicate
headers. It throws away data silently.
If you (and others on this list) aren't in favor of trying to find the
right header row (which I can understand: "In the face of ambiguity,
refuse the temptation to guess."), maybe a better solution would be to
raise a (suppressible) exception if the headers aren't uniquely named.
("Errors should never pass silently. Unless explicitly silenced.")
It's still rather surprising (and, in many cases, undesired). I would
suggest adding a parameter to DictReader to raise an exception when
there are duplicate column headers.
Regards
Antoine.
It's still rather surprising (and, in many cases, undesired). I wouldsuggest adding a parameter to DictReader to raise an exception when
there are duplicate column headers.
Regards
Antoine.
> That's how Python dictionaries work, by design:It's still rather surprising (and, in many cases, undesired). I would
> d = {'a': 1, 'a': 2}
> "silently" discards the first value.
suggest adding a parameter to DictReader to raise an exception when
there are duplicate column headers.
Moreover, I think while it might be expected for a dict to do this, it
does not follow that a DictReader should be expected to silently throw
away the user's data. Just because it uses the dict format for storage
does not mean that it's okay to throw away user's data silently. Dicts
need to be blazingly fast for a host of reasons. DictReaders do not.
They're usually dealing with file input, so any slowness in the
DictReader itself is going to be dwarfed by the file access. As such we
can afford to be more programmer-friendly here.
-1 to adding optional parameters that change the behaviour of a class.
It's not really a problem if the new behaviour is conditioned by a
constructor argument.
Regards
Antoine.
On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:Well, I wouldn't necessarily say we have a consensus on this one. This
> > 1. Do any data conditioning by ignoring empty lines and lines of
> > just field delimiters before the header row (consensus seems to be
> > "no")
idea received a +1 from Bruce Leban and an "I don't see any reason not
to" from Steven D'Aprano.
Mostly, but with a strong objection from Mark Hackett, and hesitation
> > 2. Give an error when encountering a duplicate field name (which
> > will lead to data loss when reading from the file) (consensus seems
> > to be "yes")
about altering current behavior from Amaury Forgeot d'Arc.
A row of delimiters should be treated by the reader object as a row with
explicitly empty fields. If the caller wishes to discard them, they can.
But the reader object shouldn't make that decision.
An empty row, on the other hand, should be just ignored. DictReader *already*
ignores empty rows, provided that they are not in the first row.
The final point is a simple one: while that CSV file format was stupid, it was perfectly legal. Something that deals with CSV content should not be losing any of its content. It also should not be barfing or throwing exceptions, by the way.
I've been trying to avoid the wrath, but can't any longer. Let me start but clarifying that I know what a dictionary is, how it works, and what Python is, so we can bypass calling that into question. I also know what CSV is, and I've dealt with a lot of real-life examples of CSV data: not just exports from excel, log data from the energy management space, sensor values, etc.; critical electrical fault data generated by very legacy, stupid equipment. And while it's true that a dictionary is a dictionary and it works the way it works, the real point that drives home is that it's an inappropriate mechanism for dealing ordered rows of sequential values. Regardless of what choices were made for the implementation, if the module's name is csv, it should be able to do the things it says it does with any legal CSV content without losing information. Just because its how a dictionary works doesn't mean column 3's value replacing column 1's value is something other than the loss of data. One CSV file I worked with had headers for five columns of information, then the header "VALUE" for every 5 minute period in an hour. Using this CSV parser would leave the client with one sample an hour: how dictionaries work isn't going to bring back 10 values, so information was lost.
The final point is a simple one: while that CSV file format was stupid, it was perfectly legal. Something that deals with CSV content should not be losing any of its content. It also should [not] be barfing or throwing exceptions, by the way.
And what about fixing it by replacing implementing a class that does it correctly, maps values to column numbers, keeps values as lists modeled after FieldStorage. Make iterating it work just like it does now by replacing the values with the last value in each least before returning it, and provide iterator methods for getting at the new functionality, which includes iterating items with repeating header names in order, etc; and also iter records, or something like that, to iterate the head: [value, …] maps?
class Record(defaultdict):def __init__(self, headers, fields):super(Record, self).__init__(list)self.headers = headersself.fields = fieldsmap(self.enter, self.headers, self.fields)def valuemap(self, first=False):index = 0 if first else -1return dict([(key,values[index]) for key,values in self.items()])def enter(self, header, *values):if isinstance(header, int):header = self.headers[header]self[header].extend(values)def itemseq(self):return zip(self.headers,self.fields)def __getitem__(self, spec):if isinstance(spec, int):return self.fields[spec]return super(Record, self).__getitem__(spec)def __getslice__(self, *args):return self.fields.__getslice__(*args)
And funky CSV formats don't make the current version not work for anyone. It
works for the people it's been working for all along. Why stop that?
class Record(dict):def __init__(self, headers, fields):if len(headers) != len(fields):# I don't make decicions about how gaps should be filled.raise ValueError("header/field size mismatch")self._headers = headersself._fields = tuple(fields)[self.setdefault(h,[]).append(v) for h,v in self.fielditems()]super(Record, self).__init__()def fielditems(self):"""Get header,value sequence that reflects CSV source."""return zip(self.headers(),self.fields())def headers(self):"""Get ordered sequence of headers reflecting CSV source."""return self._headersdef fields(self):"""Get ordered sequence of values reflecting CSV row source."""return self._fieldsdef getfirst(self, name, default=None):"""Get value of last field associated with header named'name'; return 'default' if no such value exists."""return self[name][0] if name in self else defaultdef getlast(self, name, default=None):"""Get value of last field associated with header named'name'; return 'default' if no such value exists."""return self[name][-1] if name in self else defaultdef getlist(self, name):"""Get values of all fields associated with header named 'name'."""return self.get(name, [])def pretty(self, header=True):lines = []if header:lines.append(["%s".ljust(10).rjust(20) % h for h in self.headers()])lines.append(["%s".ljust(10).rjust(20) % v for v in self.fields()])return "\n\n".join(["|".join(line).strip() for line in lines])def __getslice__(self, start=0, stop=None):return self.fields()[start: stop]import itertoolsUndefined = object()def iterrecords(f, headers=None, bucketheader=Undefined,missingfieldsok=False, dialect="excel", *args, **kw):rows = reader(f, dialect, *args, **kw)for row in itertools.ifilter(None, rows):if not headers:headers = rowheadcount = len(headers)print headerscontinuerowcount = len(row)rowheaders = headersif rowcount < headcount:if not missingfieldsok:raise KeyError("row has more values than headers")elif rowcount > headcount:if bucketheader is Undefined:raise KeyError("row has more values than headers")rowheaders += [bucketheader] * (rowcount - headcount)record = Record(rowheaders, row)yield record