Tim, can you take a look at these if you get the chance? If not, I
should be able to spend some time on this around March 22nd.
-param
So, regarding the patches: I think the features would be all good
additions. The tricky thing will be to figure out a way of
integrating the functionality without affecting compatibility.
BOM removal: I think this would be best implemented within
readline_iterator(). The file should be opened with codecs.open()
instead of the builtin open(), and if calling readline() on the file
returns a unicode string, the BOM would be stripped from it (for the
first line only). What do you think?
For convenience, maybe there should also be a function that takes a
filename and an encoding, instead of an open file object like
readfp().
INI file with quotes: This needs to be added in such a way that the
old behavior - i.e. the quotes being part of the value - continues to
work by default.
This may fit in with something else I had in mind: an optional
function that transforms the value when it is read or written. This
could be used to implement interpolation (right now interpolation of
%(VAR)s only works if compat.ConfigParser is used). The application
could then provide a value transforming function could remove and add
the quotes as needed. Or, given that the quotes are probably a
commonly needed functionality, the library could provide such a
function, and all the application would need to do is to hook up the
function to the INIConfig object.
INI files with parts outside sections: The comment about backward
compatibility applies here as well. I think the easiest way to
achieve this would be to inject a "[DEFAULTSECT]" line at the start of
the file - the rest of the parsing would then just work. This could
be done by a wrapper class.
I'll see if I can implement some of this in the next few days to see
what it looks like. If you have time to rework the patches, let me
know and we could split the tasks between us.
-param
The user would have to specify the encoding, and then reading from the
file would return unicode objects. So we'd just have to check for
U+FEFF, right?
-param
Ah, so the real question is: should the library attempt to guess the
correct encoding, or should the application be responsible for
explicitly specifying the encoding?
I would be hesitant to add code for guessing the encoding to iniparse,
unless it turns out that INI files use a variety of encodings in the
wild, and applications often have no idea what kind of encoding they
are dealing with.
-param
Tim
My plan is this: modify the function that iterates over the file to
skip a leading BOM *if* the readline() function of the file object
returns Unicode strings. In addition, maybe I'll add a function that
takes a file name and encoding instead of a file object, and its
encoding will default to utf-8.
-param
I implemented this yesterday (and checked it in to SVN), and then I
realized that this was not enough. One of the main goals of iniparse
is the ability to round-trip - and although this approach works just
fine for parsing, it makes round-tripping difficult/impossible. To
recreate the original file on disk, we must know what its encoding
was, whether or not it started with a BOM, etc.
Implementing full-featured round-tripping for arbitrarily encoded
files (including values that are not representable as ASCII) goes
beyond simply ignoring BOMs at the beginning of ASCII files... but I
think that would be a good feature to have. I'm planning to work on
that next.
-param